Revisiting Machine Learning Datasets - Sonar, Mines vs. Rocks

Today, we will have a look at this dataset on “Sonar: Mines vs. Rocks” by Gorman and Sejnowski [1] as part of my “Exploring Less Known Datasets for Machine Learning” series.
The dataset is hosted on the UCI machine learning repository [2].

Contents

Dataset exploration and preprocessing
Applying classical machine learning algorithms
Discussion of results

Dataset exploration and pre-processing

This dataset dates back to the Cold War and consists of sonar data. The name and setup suggests that the aim of this data is do distinguish between rocks and metal structures such as sea mines on the seafloor. The experimental setup was as follows:

a metal cylinder and a cylindrical rock, both of length 5 ft, placed on sandy ocean floor
sonar impulse: wide-band linear FM
return sampling in a distance of 10 meters
return sample aspect angle: spanning 90° (metal cylinder) and 180° (rock)
input data is a normalized spectral envelope covering 60 samples (hence it is heavily pre-processed already and is not a time series but the result of one)
208 samples selected from 1200 (by which criteria?); 111 metal samples, 97 rock samples

input_data = pd.read_csv('data/sonar.csv', header=None)
input_data.head(2)

Let us rename the header to something more accessible and understandable. Next, we can have a look at some basic statistics:

A brief look at boxplots of the frequency bins makes the dataset summary a bit more accessible and tells us that there are some outliers. Maybe that are the values we are interested in:

input_data.plot.box(figsize=(12,7), xticks=[])
plt.title('Boxplots of all frequency bins')
plt.xlabel('Frequency bin')
plt.ylabel('Power spectral density (normalized)')
plt.show()

Our classes are more or less evenly distributed:

input_data['target'].value_counts()


M    111
R     97
Name: target, dtype: int64

Let us have a look at our actual input data. The input is not raw sonar data or spectograms but spectral envelopes. Unfortunately, we do not have further information on the frequencies (ok it is a paper from Cold War times).
This is how an example spectral envelope looks for each of the classes:

plt.figure(figsize=(8,5))
plt.plot(input_data[input_data['target'] == 'R'].values[0][:-1], label='Rock', color='black')
plt.plot(input_data[input_data['target'] == 'M'].values[0][:-1], label='Metal', color='lightgray', linestyle='--')
plt.legend()
plt.title('Example of both classes')
plt.xlabel('Frequency bin')
plt.ylabel('Power spectral density (normalized)')
plt.tight_layout()
plt.show()

The dataset is normalized already.
Therefore, the only preprocessing we have to do is to encode the classes of type string to integers and split our data into training and testing data.
Using 20 % of the data for testing leaves us with 166 samples for training (and validation) as well as 42 samples for final evaluation.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
y = input_data['target'].copy()
y = LabelEncoder().fit_transform(y)

X_df = input_data.copy()
X_df.drop(['target'], inplace=True, axis=1)

X_train, X_test, y_train, y_test = train_test_split(X_df.values, y, test_size=0.2, shuffle=True, random_state=42)

Baseline results


	
		Features
		Average accuracy on training sets
		Average accuracy on testing sets
	
	
		Aspect angle independent
		89.4 - 99.8 %
		77.1 - 84.7 %
	
		
		Aspect angle dependent
		79.3 - 100 %
		73.1 - 90.4 %

Features	Average accuracy on training sets	Average accuracy on testing sets
Aspect angle independent	89.4 - 99.8 %	77.1 - 84.7 %
Aspect angle dependent	79.3 - 100 %	73.1 - 90.4 %

Nb!: We have to be careful with these results since they used different training and testing sets out of their full set and therefore this resembles cross-validation accuracies and not test accuracies. Their validation set size was 13.

Applying classical machine learning algorithms

Now we can apply some classification algorithms. We can pass a bunch of parameters for grid search-based hyperparameter optimization to the algorithms since our dataset is small and therefore training times are short.
An example of a support vector machine for classification looks like this:

def train_test_SVC_classification(X_train, X_test, y_train, y_test,scorer, dataset_id):
    SVC_classification = SVC()
    # libsvm is quite slow 
    grid_parameters_SVC_classification = {'C' : [1, 5, 7, 10, 30, 50,75,90,100,110],
                                     'kernel' : ['rbf', 'linear'],
                                     'shrinking' : [False, True],
                                     'tol' : [0.001, 0.0001, 0.00001]}
    start_time = time.time()
    grid_obj = GridSearchCV(SVC_classification, param_grid=grid_parameters_SVC_classification, cv=4, n_jobs=-1, scoring=scorer)
    grid_fit = grid_obj.fit(X_train, y_train)
    training_time = time.time() - start_time
    best_SVC_classification = grid_fit.best_estimator_
    prediction = best_SVC_classification.predict(X_test)
    accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
    classification_rep = classification_report(y_true=y_test, y_pred=prediction)
    
    return {'Classification type' : 'SVM Classification', 'model' : grid_fit, 'Predictions' : prediction, 'Accuracy' : accuracy, 'Classification Report':classification_rep, 'Training time' : training_time, 'dataset' : dataset_id}

We end up with the following accuracies of different classifiers:


  
    
      Classification type
      Accuracy
    
  
  
    
      Gaussian Naive Bayes Classification
      0.738095
    
    
      Decision Tree Classification
      0.761905
    
    
      SVM Classification
      0.880952
    
    
      Random Forest Classification
      0.809524
    
    
      AdaBoost Classification
      0.833333
    
    
      XGBoost Classification
      0.809524

Classification type	Accuracy
Gaussian Naive Bayes Classification	0.738095
Decision Tree Classification	0.761905
SVM Classification	0.880952
Random Forest Classification	0.809524
AdaBoost Classification	0.833333
XGBoost Classification	0.809524

We end up with SVM as the best classifier. The majority of algorithms performs in the range of the baseline neural network for the aspect-independent data.
The SVM model outperforms the baseline.

Applying Neural Networks

We can use keras at for the neural networks since this this is a very simple architecture and therefore there is no need to optimize it to the fullest.

Next, let us have a look at all neural network architectures from the original publication. I cheated a bit with the optimizer ;).
The simplest one consists of an input layer with a unit per input feature followed by an output layer:

def build_baseline_model_60_flat(input_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='sigmoid'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Further, they tested various neural networks with one hidden layer consisting of 2,3,6,12 and 24 units.
The example of a NN from the original paper with a hidden layer of 3 units looks like this:

def build_baseline_model_60_1_layer_3_hidden_units(input_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='sigmoid'))
    model.add(Dense(3, activation='sigmoid'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Since the datasets are realitvely small, we can give it a shot and try another deep neural network without wasting much time and see how it performs:

def build_model_3_layers_deep(input_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(24, activation='relu'))
    model.add(Dense(12, activation='relu'))
    model.add(Dense(6, activation='relu'))
    model.add(Dense(3, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

All NN are trained on 250 epochs with a batchsize of 32. Further, the model with the lowest validation loss is selected for further analysis using
callbacks.ModelCheckpoint(callback_file_path, monitor='val_loss', save_best_only=True, save_weights_only=True).

Some initial testing showed that 250 epochs are more than enough. Beyond that point the model starts to overfit to the training data aggressively.

The neural networks yield:


  
    
      Classification type
      Accuracy
    
  
  
    
      NN base 60 flat
      0.809524
    
    
      NN base 60 flat 1 layer 2 hidden units
      0.880952
    
    
      NN base 60 flat 1 layer 3 hidden units
      0.904762
    
    
      NN base 60 flat 1 layer 6 hidden units
      0.809524
    
    
      NN base 60 flat 1 layer 12 hidden units
      0.809524
    
    
      NN base 60 flat 1 layer 24 hidden units
      0.809524
    
    
      NN deep 3 layer of hidden units
      0.809524

Classification type	Accuracy
NN base 60 flat	0.809524
NN base 60 flat 1 layer 2 hidden units	0.880952
NN base 60 flat 1 layer 3 hidden units	0.904762
NN base 60 flat 1 layer 6 hidden units	0.809524
NN base 60 flat 1 layer 12 hidden units	0.809524
NN base 60 flat 1 layer 24 hidden units	0.809524
NN deep 3 layer of hidden units	0.809524

Again, we end up within the performance of the models of the original paper.

Further improvements

How can improve this? The results are still not very satisfying.

Rescaling

Though the dataset is normalized to [0,1], we can rescale it to [-1,1] to improve convergence a bit and perhaps we get better results.

Feature reduction

The second approach to obtain better results is to reduce the number of features we are using. Let us have a look at the feature importances obtained from the Random Forest classifier:

Let us remove the 17 least important features and see what is going to happen. 16 of them are not important according to the Random Forest classifier and 17th is the one with the lowest importance.

Results

The problem with comparing these results with the results from the original paper is that they did not used a proper train, validation and test splitting.
The original testing result on for aspect-independent accuracy ranges from 77.1 to 84.7 %. Even with a simple classifier such as SVM, we are able to achieve better results.
The neural networks show some sort of indifferentiable behavior but do perform reasonable well on 3 of the 4 datasets.


  
    
      Classifier
      Accuracy on original dataset
      Accuracy on rescaled dataset
      Accuracy on reduced dataset
      Accuracy on rescaled and reduced dataset
    
  
  
    
      Gaussian Naive Bayes Classification
      0.738095
      0.738095
      0.738095
      0.738095
    
    
      Decision Tree Classification
      0.761905
      0.761905
      0.761905
      0.738095
    
    
      SVM Classification
      0.880952
      0.880952
      0.880952
      0.904762
    
    
      Random Forest Classification
      0.809524
      0.809524
      0.809524
      0.833333
    
    
      AdaBoost Classification
      0.833333
      0.833333
      0.833333
      0.833333
    
    
       XGBoost Classification
      0.809524
      0.809524
      0.809524
      0.833333
    
    
      NN base 60 flat
      0.857143
      0.857143
      0.857143
      0.785714
    
    
      NN base 60 flat 1 layer 2 hidden units
      0.833333
      0.809524
      0.928571
      0.785714
    
    
      NN base 60 flat 1 layer 3 hidden units
      0.880952
      0.809524
      0.857143
      0.785714
    
    
      NN base 60 flat 1 layer 6 hidden units
      0.880952
      0.833333
      0.833333
      0.785714
    
    
      NN base 60 flat 1 layer 12 hidden units
      0.857143
      0.880952
      0.857143
      0.785714
    
    
      NN base 60 flat 1 layer 24 hidden units
      0.857143
      0.833333
      0.833333
      0.785714
    
    
      NN deep 3 layer 6 hidden units
      0.809524
      0.833333
      0.809524
      0.904762

Classifier	Accuracy on original dataset	Accuracy on rescaled dataset	Accuracy on reduced dataset	Accuracy on rescaled and reduced dataset
Gaussian Naive Bayes Classification	0.738095	0.738095	0.738095	0.738095
Decision Tree Classification	0.761905	0.761905	0.761905	0.738095
SVM Classification	0.880952	0.880952	0.880952	0.904762
Random Forest Classification	0.809524	0.809524	0.809524	0.833333
AdaBoost Classification	0.833333	0.833333	0.833333	0.833333
XGBoost Classification	0.809524	0.809524	0.809524	0.833333
NN base 60 flat	0.857143	0.857143	0.857143	0.785714
NN base 60 flat 1 layer 2 hidden units	0.833333	0.809524	0.928571	0.785714
NN base 60 flat 1 layer 3 hidden units	0.880952	0.809524	0.857143	0.785714
NN base 60 flat 1 layer 6 hidden units	0.880952	0.833333	0.833333	0.785714
NN base 60 flat 1 layer 12 hidden units	0.857143	0.880952	0.857143	0.785714
NN base 60 flat 1 layer 24 hidden units	0.857143	0.833333	0.833333	0.785714
NN deep 3 layer 6 hidden units	0.809524	0.833333	0.809524	0.904762

Discussion of results

Disclaimer: I am aware that this dataset is from the 1980s and computational power was extremely limited back then.

We could say that though the validation accuracies (and therefore in a way the test results) that the models here generalize more than the models in the original publication because we used a much bigger test set as well as a proper set for cross-validation.
If we would overfit a bit more to the training set and perhaps iterate over all features and leave them out iteratively, we may end up with better results.
However, we would en up with a model that is somewhat disconnected from what data we are receiving (in case we would deploy such a model). Geologic material on the sea floor can be messy.
For further improvement we should ask some more questions:

What could be extracted from raw sonar data?
What kind of rock was used (this leads to different signatures)?
What would happen on different seafloor material (e.g. rocks, clays, silts)?
What criteria were used to select the dataset from a bigger set of 1200 samples? This should ring an alarm bell.

References

[1] Gorman, R.P.; Sejnowski, T.J. (1988): Analysis of hidden units in a layered network trained to classify sonar targets. In: Neural Networks 1 (1), 75 - 89. doi: 10.1016/0893-6080(88)90023-8.

[2] Dua, D.; Taniskidou, K.E. (2018). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

Acknowledgements

I would like to thank Gorman and Sejnowski for publishing the dataset.

	0	1	2	3	4	5	6	7	8	9	...	51	52	53	54	55	56	57	58	59	60
0	0.0200	0.0371	0.0428	0.0207	0.0954	0.0986	0.1539	0.1601	0.3109	0.2111	...	0.0027	0.0065	0.0159	0.0072	0.0167	0.018	0.0084	0.0090	0.0032	R
1	0.0453	0.0523	0.0843	0.0689	0.1183	0.2583	0.2156	0.3481	0.3337	0.2872	...	0.0084	0.0089	0.0048	0.0094	0.0191	0.014	0.0049	0.0052	0.0044	R

	A0	A1	A2	A3	A4	A5	A6	A7	A8	A9	...	A50	A51	A52	A53	A54	A55	A56	A57	A58	A59
count	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000	...	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000	208.000000
mean	0.029164	0.038437	0.043832	0.053892	0.075202	0.104570	0.121747	0.134799	0.178003	0.208259	...	0.016069	0.013420	0.010709	0.010941	0.009290	0.008222	0.007820	0.007949	0.007941	0.006507
std	0.022991	0.032960	0.038428	0.046528	0.055552	0.059105	0.061788	0.085152	0.118387	0.134416	...	0.012008	0.009634	0.007060	0.007301	0.007088	0.005736	0.005785	0.006470	0.006181	0.005031
min	0.001500	0.000600	0.001500	0.005800	0.006700	0.010200	0.003300	0.005500	0.007500	0.011300	...	0.000000	0.000800	0.000500	0.001000	0.000600	0.000400	0.000300	0.000300	0.000100	0.000600
25%	0.013350	0.016450	0.018950	0.024375	0.038050	0.067025	0.080900	0.080425	0.097025	0.111275	...	0.008425	0.007275	0.005075	0.005375	0.004150	0.004400	0.003700	0.003600	0.003675	0.003100
50%	0.022800	0.030800	0.034300	0.044050	0.062500	0.092150	0.106950	0.112100	0.152250	0.182400	...	0.013900	0.011400	0.009550	0.009300	0.007500	0.006850	0.005950	0.005800	0.006400	0.005300
75%	0.035550	0.047950	0.057950	0.064500	0.100275	0.134125	0.154000	0.169600	0.233425	0.268700	...	0.020825	0.016725	0.014900	0.014500	0.012100	0.010575	0.010425	0.010350	0.010325	0.008525
max	0.137100	0.233900	0.305900	0.426400	0.401000	0.382300	0.372900	0.459000	0.682800	0.710600	...	0.100400	0.070900	0.039000	0.035200	0.044700	0.039400	0.035500	0.044000	0.036400	0.043900