Today, we will have a look at this dataset on sensorless drive diagnosis as part of my “Exploring Less Known Datasets for Machine Learning” series. The dataset is hosted on the UCI Machine Learning Repository and originates from Martyna Bator and others [1].
Contents
Dataset Exploration
The idea behind this dataset is that sensors are too expensive to monitor every drive unit in a single factory. Hence, the authors idea is to use phase currents (measured for drive control anyhow) to predict the state of a drive and allow for predictive maintenance.
If I understand the dataset description correctly, then it is the result of an Empirical Mode Decomposition (EMD) on the raw current measurements. Furthermore, no information on training and test set sizes etc. are available. The original lacks of clear metrics of performance evaluation. Let’s dive into it and have a look at the data:
input_data = pd.read_csv("./data/Sensorless_drive_diagnosis.csv", delim_whitespace=True, header=None)
header_names = ['A' + str(i) for i in range(input_data.shape[1]-1)]
header_names.append('status')
input_data.set_axis(header_names, axis=1, inplace=True)
display(input_data.head(3))
display(input_data.tail(3))
display(input_data.describe())
A0 | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | ... | A39 | A40 | A41 | A42 | A43 | A44 | A45 | A46 | A47 | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -3.014600e-07 | 0.000008 | -0.000012 | -0.000002 | -0.000001 | -0.000021 | 0.031718 | 0.03171 | 0.031721 | -0.032963 | ... | -0.63308 | 2.9646 | 8.1198 | -1.4961 | -1.4961 | -1.4961 | -1.4996 | -1.4996 | -1.4996 | 1 |
1 | 2.913200e-06 | -0.000005 | 0.000003 | -0.000006 | 0.000003 | -0.000004 | 0.030804 | 0.03081 | 0.030806 | -0.033520 | ... | -0.59314 | 7.6252 | 6.1690 | -1.4967 | -1.4967 | -1.4967 | -1.5005 | -1.5005 | -1.5005 | 1 |
2 | -2.951700e-06 | -0.000003 | -0.000016 | -0.000001 | -0.000002 | 0.000017 | 0.032877 | 0.03288 | 0.032896 | -0.029834 | ... | -0.63252 | 2.7784 | 5.3017 | -1.4983 | -1.4983 | -1.4982 | -1.4985 | -1.4985 | -1.4985 | 1 |
A0 | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | ... | A39 | A40 | A41 | A42 | A43 | A44 | A45 | A46 | A47 | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58506 | -0.000006 | 0.000019 | -0.000102 | -0.000003 | 0.000004 | 0.000117 | -0.081989 | -0.082008 | -0.081906 | -0.18614 | ... | -0.51103 | 20.9250 | 9.0437 | -1.5035 | -1.5035 | -1.5039 | -1.4911 | -1.4912 | -1.4910 | 11 |
58507 | -0.000004 | 0.000034 | -0.000442 | 0.000005 | 0.000007 | 0.000087 | -0.081500 | -0.081534 | -0.081093 | -0.18363 | ... | -0.52033 | 1.3890 | 10.7430 | -1.5029 | -1.5029 | -1.5030 | -1.4932 | -1.4932 | -1.4931 | 11 |
58508 | -0.000009 | 0.000052 | 0.000072 | 0.000010 | 0.000004 | -0.000032 | -0.083034 | -0.083086 | -0.083159 | -0.18589 | ... | -0.50974 | 1.6026 | 4.5773 | -1.5039 | -1.5040 | -1.5036 | -1.4945 | -1.4946 | -1.4943 | 11 |
A0 | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | A9 | ... | A39 | A40 | A41 | A42 | A43 | A44 | A45 | A46 | A47 | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 58509.000000 | 5.850900e+04 | 5.850900e+04 | 58509.000000 | 5.850900e+04 | 5.850900e+04 | 58509.000000 | 58509.000000 | 58509.000000 | 58509.000000 | ... | 58509.000000 | 58509.000000 | 58509.000000 | 58509.000000 | 58509.000000 | 58509.000000 | 58509.000000 | 58509.000000 | 58509.000000 | 58509.000000 |
mean | -0.000003 | 1.439648e-06 | 1.412013e-06 | -0.000001 | 1.351239e-06 | -2.654483e-07 | 0.001915 | 0.001913 | 0.001912 | -0.011897 | ... | -0.397757 | 7.293781 | 8.273772 | -1.500887 | -1.500912 | -1.500805 | -1.497771 | -1.497794 | -1.497686 | 6.000000 |
std | 0.000072 | 5.555429e-05 | 2.353009e-04 | 0.000063 | 5.660943e-05 | 2.261907e-04 | 0.036468 | 0.036465 | 0.036470 | 0.066482 | ... | 25.018728 | 12.451781 | 6.565952 | 0.003657 | 0.003668 | 0.003632 | 0.003163 | 0.003163 | 0.003175 | 3.162305 |
min | -0.013721 | -5.414400e-03 | -1.358000e-02 | -0.012787 | -8.355900e-03 | -9.741300e-03 | -0.139890 | -0.135940 | -0.130860 | -0.218640 | ... | -0.902350 | -0.596830 | 0.320660 | -1.525500 | -1.526200 | -1.523700 | -1.521400 | -1.523200 | -1.521300 | 1.000000 |
25% | -0.000007 | -1.444400e-05 | -7.239600e-05 | -0.000005 | -1.475300e-05 | -7.379100e-05 | -0.019927 | -0.019951 | -0.019925 | -0.032144 | ... | -0.715470 | 1.450300 | 4.436300 | -1.503300 | -1.503400 | -1.503200 | -1.499600 | -1.499600 | -1.499500 | 3.000000 |
50% | -0.000003 | 8.804600e-07 | 5.137700e-07 | -0.000001 | 7.540200e-07 | -1.659300e-07 | 0.013226 | 0.013230 | 0.013247 | -0.015566 | ... | -0.661710 | 3.301300 | 6.479100 | -1.500300 | -1.500300 | -1.500300 | -1.498100 | -1.498100 | -1.498000 | 6.000000 |
75% | 0.000002 | 1.877700e-05 | 7.520000e-05 | 0.000004 | 1.906200e-05 | 7.138600e-05 | 0.024770 | 0.024776 | 0.024777 | 0.020614 | ... | -0.573980 | 8.288500 | 9.857500 | -1.498200 | -1.498200 | -1.498200 | -1.496200 | -1.496300 | -1.496200 | 9.000000 |
max | 0.005784 | 4.525300e-03 | 5.237700e-03 | 0.001453 | 8.245100e-04 | 2.753600e-03 | 0.069125 | 0.069130 | 0.069131 | 0.352580 | ... | 3670.800000 | 889.930000 | 153.150000 | -1.457600 | -1.456100 | -1.455500 | -1.337200 | -1.337200 | -1.337100 | 11.000000 |
Let’s see how drive states are distributed:
That looks good. We don’t have to deal with imbalanced datasets.
Let’s have a look at boxplots of the input features:
Perhaps it is better to rescale the data (MaxAbsScaler) and look again:
Well, that looks better. Let’s see if we can find any characteristic input data for each drive state output:
It is time to do train test splitting and throw some ML algorithms at the dataset:
X_train, X_test, y_train, y_test = train_test_split(X_df_scaled, y, test_size=0.2, shuffle=True, random_state=42)
Machine Learning Algorithms
# no parameter variation for GaussianNaiveBayes
grid_parameters_decision_tree_classification = {'max_depth' : [None, 3,5,7,9,10,11]}
grid_parameters_random_forest_classification = {'n_estimators' : [3,5,10,15,18], 'max_depth' : [None, 2,3,5,7,9]}
grid_parameters_adaboost_classifier = {'n_estimators' : [3,5,10,20,50,60,80,100,200,250,300,350,400],
'learning_rate' : [0.001, 0.01, 0.1, 0.8, 1.0]}
grid_parameters_x_gradient_boosting_classification = {'n_estimators' : [3,5,18,20,60,80,150],
'max_depth' : [1,2,7,9,15],
'learning_rate' :[0.001, 0.01, 0.1],
'booster' : ['gbtree', 'dart']}
- SVM Classifier seems to have some internal errors - I didn’t had the time to resolve them.
Furthermoore, we can use a few simple NNs (just testing how they perform, no optimization to the dataset):
def build_baseline_model_1(input_dim,output_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(input_dim//2, activation='relu'))
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
return model
def build_baseline_model_2(input_dim,output_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim//2, activation='relu'))
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
return model
def build_baseline_model_3(input_dim,output_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim//2, activation='relu'))
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
return model
def build_baseline_model_4(input_dim,output_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim//2, activation='relu'))
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
return model
def build_baseline_model_5(input_dim,output_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim*3, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(input_dim//2, activation='relu'))
model.add(Dense(output_dim, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
return model
All neural networks are trained for 1500 epochs and use a batch size of 256. The best model (criterion: validation loss) is selected for final testing.
Results
Let’s see how the algorithms perform:
It seems like drive diagnosis is easy. Neural Networks, XGBoost and Random Forests are almost free of false predictions. Decision trees perform quite good as well. Gaussian Naive Bayes is super fast but doesn’t yield results that meet any standards quality-wise. Again, this seems to be another dataset where AdaBoost fails. I’m not too sure if I still use to few hyperparameters.
Since the original publication doesn’t contain any clear metrics, I can’t compare the results. If the figures in the publication show something similar, then I would say that we will outperform the fuzzy logic approach even with pure decision trees.
References
[1] Lohweg, V.; Paschke, F.; Bayer, C.; Bator, M.; Mönks, W.; Dicks, A.; Enge-Rosenblatt, O. (2013): Sensorlose Zustandsüberwachung and Synchronmotoren. Proceedings. 23. Workshop Computational Intelligence. doi: 10.5445/KSP/1000036887.
Acknowledgements
I would like to thank the authors from the original publication for making the dataset available.