Today, we will have a look at this dataset on “Sonar: Mines vs. Rocks” by Gorman and Sejnowski [1] as part of my “Exploring Less Known Datasets for Machine Learning” series.
The dataset is hosted on the UCI machine learning repository [2].
Contents
- Dataset exploration and preprocessing
- Applying classical machine learning algorithms
- Discussion of results
Dataset exploration and pre-processing
This dataset dates back to the Cold War and consists of sonar data. The name and setup suggests that the aim of this data is do distinguish between rocks and metal structures such as sea mines on the seafloor. The experimental setup was as follows:
- a metal cylinder and a cylindrical rock, both of length 5 ft, placed on sandy ocean floor
- sonar impulse: wide-band linear FM
- return sampling in a distance of 10 meters
- return sample aspect angle: spanning 90° (metal cylinder) and 180° (rock)
- input data is a normalized spectral envelope covering 60 samples (hence it is heavily pre-processed already and is not a time series but the result of one)
- 208 samples selected from 1200 (by which criteria?); 111 metal samples, 97 rock samples
input_data = pd.read_csv('data/sonar.csv', header=None)
input_data.head(2)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0200 | 0.0371 | 0.0428 | 0.0207 | 0.0954 | 0.0986 | 0.1539 | 0.1601 | 0.3109 | 0.2111 | ... | 0.0027 | 0.0065 | 0.0159 | 0.0072 | 0.0167 | 0.018 | 0.0084 | 0.0090 | 0.0032 | R |
1 | 0.0453 | 0.0523 | 0.0843 | 0.0689 | 0.1183 | 0.2583 | 0.2156 | 0.3481 | 0.3337 | 0.2872 | ... | 0.0084 | 0.0089 | 0.0048 | 0.0094 | 0.0191 | 0.014 | 0.0049 | 0.0052 | 0.0044 | R |
Let us rename the header to something more accessible and understandable. Next, we can have a look at some basic statistics:
A0 | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | ... | A50 | A51 | A52 | A53 | A54 | A55 | A56 | A57 | A58 | A59 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | ... | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 |
mean | 0.029164 | 0.038437 | 0.043832 | 0.053892 | 0.075202 | 0.104570 | 0.121747 | 0.134799 | 0.178003 | 0.208259 | ... | 0.016069 | 0.013420 | 0.010709 | 0.010941 | 0.009290 | 0.008222 | 0.007820 | 0.007949 | 0.007941 | 0.006507 |
std | 0.022991 | 0.032960 | 0.038428 | 0.046528 | 0.055552 | 0.059105 | 0.061788 | 0.085152 | 0.118387 | 0.134416 | ... | 0.012008 | 0.009634 | 0.007060 | 0.007301 | 0.007088 | 0.005736 | 0.005785 | 0.006470 | 0.006181 | 0.005031 |
min | 0.001500 | 0.000600 | 0.001500 | 0.005800 | 0.006700 | 0.010200 | 0.003300 | 0.005500 | 0.007500 | 0.011300 | ... | 0.000000 | 0.000800 | 0.000500 | 0.001000 | 0.000600 | 0.000400 | 0.000300 | 0.000300 | 0.000100 | 0.000600 |
25% | 0.013350 | 0.016450 | 0.018950 | 0.024375 | 0.038050 | 0.067025 | 0.080900 | 0.080425 | 0.097025 | 0.111275 | ... | 0.008425 | 0.007275 | 0.005075 | 0.005375 | 0.004150 | 0.004400 | 0.003700 | 0.003600 | 0.003675 | 0.003100 |
50% | 0.022800 | 0.030800 | 0.034300 | 0.044050 | 0.062500 | 0.092150 | 0.106950 | 0.112100 | 0.152250 | 0.182400 | ... | 0.013900 | 0.011400 | 0.009550 | 0.009300 | 0.007500 | 0.006850 | 0.005950 | 0.005800 | 0.006400 | 0.005300 |
75% | 0.035550 | 0.047950 | 0.057950 | 0.064500 | 0.100275 | 0.134125 | 0.154000 | 0.169600 | 0.233425 | 0.268700 | ... | 0.020825 | 0.016725 | 0.014900 | 0.014500 | 0.012100 | 0.010575 | 0.010425 | 0.010350 | 0.010325 | 0.008525 |
max | 0.137100 | 0.233900 | 0.305900 | 0.426400 | 0.401000 | 0.382300 | 0.372900 | 0.459000 | 0.682800 | 0.710600 | ... | 0.100400 | 0.070900 | 0.039000 | 0.035200 | 0.044700 | 0.039400 | 0.035500 | 0.044000 | 0.036400 | 0.043900 |
A brief look at boxplots of the frequency bins makes the dataset summary a bit more accessible and tells us that there are some outliers. Maybe that are the values we are interested in:
input_data.plot.box(figsize=(12,7), xticks=[])
plt.title('Boxplots of all frequency bins')
plt.xlabel('Frequency bin')
plt.ylabel('Power spectral density (normalized)')
plt.show()
Our classes are more or less evenly distributed:
input_data['target'].value_counts()
M 111
R 97
Name: target, dtype: int64
Let us have a look at our actual input data. The input is not raw sonar data or spectograms but spectral envelopes. Unfortunately, we do not have further information on the frequencies (ok it is a paper from Cold War times).
This is how an example spectral envelope looks for each of the classes:
plt.figure(figsize=(8,5))
plt.plot(input_data[input_data['target'] == 'R'].values[0][:-1], label='Rock', color='black')
plt.plot(input_data[input_data['target'] == 'M'].values[0][:-1], label='Metal', color='lightgray', linestyle='--')
plt.legend()
plt.title('Example of both classes')
plt.xlabel('Frequency bin')
plt.ylabel('Power spectral density (normalized)')
plt.tight_layout()
plt.show()
The dataset is normalized already.
Therefore, the only preprocessing we have to do is to encode the classes of type string to integers and split our data into training and testing data.
Using 20 % of the data for testing leaves us with 166 samples for training (and validation) as well as 42 samples for final evaluation.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
y = input_data['target'].copy()
y = LabelEncoder().fit_transform(y)
X_df = input_data.copy()
X_df.drop(['target'], inplace=True, axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_df.values, y, test_size=0.2, shuffle=True, random_state=42)
Baseline results
Features | Average accuracy on training sets | Average accuracy on testing sets |
---|---|---|
Aspect angle independent | 89.4 - 99.8 % | 77.1 - 84.7 % |
Aspect angle dependent | 79.3 - 100 % | 73.1 - 90.4 % |
Nb!: We have to be careful with these results since they used different training and testing sets out of their full set and therefore this resembles cross-validation accuracies and not test accuracies. Their validation set size was 13.
Applying classical machine learning algorithms
Now we can apply some classification algorithms. We can pass a bunch of parameters for grid search-based hyperparameter optimization to the algorithms since our dataset is small and therefore training times are short.
An example of a support vector machine for classification looks like this:
def train_test_SVC_classification(X_train, X_test, y_train, y_test,scorer, dataset_id):
SVC_classification = SVC()
# libsvm is quite slow
grid_parameters_SVC_classification = {'C' : [1, 5, 7, 10, 30, 50,75,90,100,110],
'kernel' : ['rbf', 'linear'],
'shrinking' : [False, True],
'tol' : [0.001, 0.0001, 0.00001]}
start_time = time.time()
grid_obj = GridSearchCV(SVC_classification, param_grid=grid_parameters_SVC_classification, cv=4, n_jobs=-1, scoring=scorer)
grid_fit = grid_obj.fit(X_train, y_train)
training_time = time.time() - start_time
best_SVC_classification = grid_fit.best_estimator_
prediction = best_SVC_classification.predict(X_test)
accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
classification_rep = classification_report(y_true=y_test, y_pred=prediction)
return {'Classification type' : 'SVM Classification', 'model' : grid_fit, 'Predictions' : prediction, 'Accuracy' : accuracy, 'Classification Report':classification_rep, 'Training time' : training_time, 'dataset' : dataset_id}
We end up with the following accuracies of different classifiers:
Classification type | Accuracy |
---|---|
Gaussian Naive Bayes Classification | 0.738095 |
Decision Tree Classification | 0.761905 |
SVM Classification | 0.880952 |
Random Forest Classification | 0.809524 |
AdaBoost Classification | 0.833333 |
XGBoost Classification | 0.809524 |
We end up with SVM as the best classifier. The majority of algorithms performs in the range of the baseline neural network for the aspect-independent data.
The SVM model outperforms the baseline.
Applying Neural Networks
We can use keras at for the neural networks since this this is a very simple architecture and therefore there is no need to optimize it to the fullest.
Next, let us have a look at all neural network architectures from the original publication. I cheated a bit with the optimizer ;).
The simplest one consists of an input layer with a unit per input feature followed by an output layer:
def build_baseline_model_60_flat(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='sigmoid'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
Further, they tested various neural networks with one hidden layer consisting of 2,3,6,12 and 24 units.
The example of a NN from the original paper with a hidden layer of 3 units looks like this:
def build_baseline_model_60_1_layer_3_hidden_units(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='sigmoid'))
model.add(Dense(3, activation='sigmoid'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
Since the datasets are realitvely small, we can give it a shot and try another deep neural network without wasting much time and see how it performs:
def build_model_3_layers_deep(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(24, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
All NN are trained on 250 epochs with a batchsize of 32. Further, the model with the lowest validation loss is selected for further analysis using
callbacks.ModelCheckpoint(callback_file_path, monitor='val_loss', save_best_only=True, save_weights_only=True)
.
Some initial testing showed that 250 epochs are more than enough. Beyond that point the model starts to overfit to the training data aggressively.
The neural networks yield:
Classification type | Accuracy |
---|---|
NN base 60 flat | 0.809524 |
NN base 60 flat 1 layer 2 hidden units | 0.880952 |
NN base 60 flat 1 layer 3 hidden units | 0.904762 |
NN base 60 flat 1 layer 6 hidden units | 0.809524 |
NN base 60 flat 1 layer 12 hidden units | 0.809524 |
NN base 60 flat 1 layer 24 hidden units | 0.809524 |
NN deep 3 layer of hidden units | 0.809524 |
Again, we end up within the performance of the models of the original paper.
Further improvements
How can improve this? The results are still not very satisfying.
Rescaling
Though the dataset is normalized to [0,1], we can rescale it to [-1,1] to improve convergence a bit and perhaps we get better results.
Feature reduction
The second approach to obtain better results is to reduce the number of features we are using. Let us have a look at the feature importances obtained from the Random Forest classifier:
Let us remove the 17 least important features and see what is going to happen. 16 of them are not important according to the Random Forest classifier and 17th is the one with the lowest importance.
Results
The problem with comparing these results with the results from the original paper is that they did not used a proper train, validation and test splitting.
The original testing result on for aspect-independent accuracy ranges from 77.1 to 84.7 %. Even with a simple classifier such as SVM, we are able to achieve better results.
The neural networks show some sort of indifferentiable behavior but do perform reasonable well on 3 of the 4 datasets.
Classifier | Accuracy on original dataset | Accuracy on rescaled dataset | Accuracy on reduced dataset | Accuracy on rescaled and reduced dataset |
---|---|---|---|---|
Gaussian Naive Bayes Classification | 0.738095 | 0.738095 | 0.738095 | 0.738095 |
Decision Tree Classification | 0.761905 | 0.761905 | 0.761905 | 0.738095 |
SVM Classification | 0.880952 | 0.880952 | 0.880952 | 0.904762 |
Random Forest Classification | 0.809524 | 0.809524 | 0.809524 | 0.833333 |
AdaBoost Classification | 0.833333 | 0.833333 | 0.833333 | 0.833333 |
XGBoost Classification | 0.809524 | 0.809524 | 0.809524 | 0.833333 |
NN base 60 flat | 0.857143 | 0.857143 | 0.857143 | 0.785714 |
NN base 60 flat 1 layer 2 hidden units | 0.833333 | 0.809524 | 0.928571 | 0.785714 |
NN base 60 flat 1 layer 3 hidden units | 0.880952 | 0.809524 | 0.857143 | 0.785714 |
NN base 60 flat 1 layer 6 hidden units | 0.880952 | 0.833333 | 0.833333 | 0.785714 |
NN base 60 flat 1 layer 12 hidden units | 0.857143 | 0.880952 | 0.857143 | 0.785714 |
NN base 60 flat 1 layer 24 hidden units | 0.857143 | 0.833333 | 0.833333 | 0.785714 |
NN deep 3 layer 6 hidden units | 0.809524 | 0.833333 | 0.809524 | 0.904762 |
Discussion of results
Disclaimer: I am aware that this dataset is from the 1980s and computational power was extremely limited back then.
We could say that though the validation accuracies (and therefore in a way the test results) that the models here generalize more than the models in the original publication because we used a much bigger test set as well as a proper set for cross-validation.
If we would overfit a bit more to the training set and perhaps iterate over all features and leave them out iteratively, we may end up with better results.
However, we would en up with a model that is somewhat disconnected from what data we are receiving (in case we would deploy such a model). Geologic material on the sea floor can be messy.
For further improvement we should ask some more questions:
- What could be extracted from raw sonar data?
- What kind of rock was used (this leads to different signatures)?
- What would happen on different seafloor material (e.g. rocks, clays, silts)?
- What criteria were used to select the dataset from a bigger set of 1200 samples? This should ring an alarm bell.
References
[1] Gorman, R.P.; Sejnowski, T.J. (1988): Analysis of hidden units in a layered network trained to classify sonar targets. In: Neural Networks 1 (1), 75 - 89. doi: 10.1016/0893-6080(88)90023-8.
[2] Dua, D.; Taniskidou, K.E. (2018). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
Acknowledgements
I would like to thank Gorman and Sejnowski for publishing the dataset.