Today, we will have a look at this dataset on sensorless drive diagnosis as part of my “Exploring Less Known Datasets for Machine Learning” series. The dataset is hosted on the UCI Machine Learning Repository and originates from Martyna Bator and others [1].


Contents


Dataset Exploration

The idea behind this dataset is that sensors are too expensive to monitor every drive unit in a single factory. Hence, the authors idea is to use phase currents (measured for drive control anyhow) to predict the state of a drive and allow for predictive maintenance.
If I understand the dataset description correctly, then it is the result of an Empirical Mode Decomposition (EMD) on the raw current measurements. Furthermore, no information on training and test set sizes etc. are available. The original lacks of clear metrics of performance evaluation. Let’s dive into it and have a look at the data:

input_data = pd.read_csv("./data/Sensorless_drive_diagnosis.csv", delim_whitespace=True, header=None)
header_names = ['A' + str(i) for i in range(input_data.shape[1]-1)]
header_names.append('status')
input_data.set_axis(header_names, axis=1, inplace=True)
display(input_data.head(3))
display(input_data.tail(3))
display(input_data.describe())
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 ... A39 A40 A41 A42 A43 A44 A45 A46 A47 status
0 -3.014600e-07 0.000008 -0.000012 -0.000002 -0.000001 -0.000021 0.031718 0.03171 0.031721 -0.032963 ... -0.63308 2.9646 8.1198 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996 1
1 2.913200e-06 -0.000005 0.000003 -0.000006 0.000003 -0.000004 0.030804 0.03081 0.030806 -0.033520 ... -0.59314 7.6252 6.1690 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005 1
2 -2.951700e-06 -0.000003 -0.000016 -0.000001 -0.000002 0.000017 0.032877 0.03288 0.032896 -0.029834 ... -0.63252 2.7784 5.3017 -1.4983 -1.4983 -1.4982 -1.4985 -1.4985 -1.4985 1
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 ... A39 A40 A41 A42 A43 A44 A45 A46 A47 status
58506 -0.000006 0.000019 -0.000102 -0.000003 0.000004 0.000117 -0.081989 -0.082008 -0.081906 -0.18614 ... -0.51103 20.9250 9.0437 -1.5035 -1.5035 -1.5039 -1.4911 -1.4912 -1.4910 11
58507 -0.000004 0.000034 -0.000442 0.000005 0.000007 0.000087 -0.081500 -0.081534 -0.081093 -0.18363 ... -0.52033 1.3890 10.7430 -1.5029 -1.5029 -1.5030 -1.4932 -1.4932 -1.4931 11
58508 -0.000009 0.000052 0.000072 0.000010 0.000004 -0.000032 -0.083034 -0.083086 -0.083159 -0.18589 ... -0.50974 1.6026 4.5773 -1.5039 -1.5040 -1.5036 -1.4945 -1.4946 -1.4943 11
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 ... A39 A40 A41 A42 A43 A44 A45 A46 A47 status
count 58509.000000 5.850900e+04 5.850900e+04 58509.000000 5.850900e+04 5.850900e+04 58509.000000 58509.000000 58509.000000 58509.000000 ... 58509.000000 58509.000000 58509.000000 58509.000000 58509.000000 58509.000000 58509.000000 58509.000000 58509.000000 58509.000000
mean -0.000003 1.439648e-06 1.412013e-06 -0.000001 1.351239e-06 -2.654483e-07 0.001915 0.001913 0.001912 -0.011897 ... -0.397757 7.293781 8.273772 -1.500887 -1.500912 -1.500805 -1.497771 -1.497794 -1.497686 6.000000
std 0.000072 5.555429e-05 2.353009e-04 0.000063 5.660943e-05 2.261907e-04 0.036468 0.036465 0.036470 0.066482 ... 25.018728 12.451781 6.565952 0.003657 0.003668 0.003632 0.003163 0.003163 0.003175 3.162305
min -0.013721 -5.414400e-03 -1.358000e-02 -0.012787 -8.355900e-03 -9.741300e-03 -0.139890 -0.135940 -0.130860 -0.218640 ... -0.902350 -0.596830 0.320660 -1.525500 -1.526200 -1.523700 -1.521400 -1.523200 -1.521300 1.000000
25% -0.000007 -1.444400e-05 -7.239600e-05 -0.000005 -1.475300e-05 -7.379100e-05 -0.019927 -0.019951 -0.019925 -0.032144 ... -0.715470 1.450300 4.436300 -1.503300 -1.503400 -1.503200 -1.499600 -1.499600 -1.499500 3.000000
50% -0.000003 8.804600e-07 5.137700e-07 -0.000001 7.540200e-07 -1.659300e-07 0.013226 0.013230 0.013247 -0.015566 ... -0.661710 3.301300 6.479100 -1.500300 -1.500300 -1.500300 -1.498100 -1.498100 -1.498000 6.000000
75% 0.000002 1.877700e-05 7.520000e-05 0.000004 1.906200e-05 7.138600e-05 0.024770 0.024776 0.024777 0.020614 ... -0.573980 8.288500 9.857500 -1.498200 -1.498200 -1.498200 -1.496200 -1.496300 -1.496200 9.000000
max 0.005784 4.525300e-03 5.237700e-03 0.001453 8.245100e-04 2.753600e-03 0.069125 0.069130 0.069131 0.352580 ... 3670.800000 889.930000 153.150000 -1.457600 -1.456100 -1.455500 -1.337200 -1.337200 -1.337100 11.000000


Let’s see how drive states are distributed:

That looks good. We don’t have to deal with imbalanced datasets.

Let’s have a look at boxplots of the input features:

Perhaps it is better to rescale the data (MaxAbsScaler) and look again:

Well, that looks better. Let’s see if we can find any characteristic input data for each drive state output:

It is time to do train test splitting and throw some ML algorithms at the dataset:

X_train, X_test, y_train, y_test = train_test_split(X_df_scaled, y, test_size=0.2, shuffle=True, random_state=42)

Machine Learning Algorithms

# no parameter variation for GaussianNaiveBayes
grid_parameters_decision_tree_classification = {'max_depth' : [None, 3,5,7,9,10,11]}
grid_parameters_random_forest_classification = {'n_estimators' : [3,5,10,15,18], 'max_depth' : [None, 2,3,5,7,9]}
grid_parameters_adaboost_classifier = {'n_estimators' : [3,5,10,20,50,60,80,100,200,250,300,350,400],
					'learning_rate' : [0.001, 0.01, 0.1, 0.8, 1.0]}
grid_parameters_x_gradient_boosting_classification = {'n_estimators' : [3,5,18,20,60,80,150],
                                                    'max_depth' : [1,2,7,9,15],
                                                    'learning_rate' :[0.001, 0.01, 0.1],
                                                    'booster' : ['gbtree', 'dart']}
  • SVM Classifier seems to have some internal errors - I didn’t had the time to resolve them.

Furthermoore, we can use a few simple NNs (just testing how they perform, no optimization to the dataset):

def build_baseline_model_1(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model
def build_baseline_model_2(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model
def build_baseline_model_3(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model
def build_baseline_model_4(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model
def build_baseline_model_5(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*3, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

All neural networks are trained for 1500 epochs and use a batch size of 256. The best model (criterion: validation loss) is selected for final testing.

Results

Let’s see how the algorithms perform:

It seems like drive diagnosis is easy. Neural Networks, XGBoost and Random Forests are almost free of false predictions. Decision trees perform quite good as well. Gaussian Naive Bayes is super fast but doesn’t yield results that meet any standards quality-wise. Again, this seems to be another dataset where AdaBoost fails. I’m not too sure if I still use to few hyperparameters.

Since the original publication doesn’t contain any clear metrics, I can’t compare the results. If the figures in the publication show something similar, then I would say that we will outperform the fuzzy logic approach even with pure decision trees.

References

[1] Lohweg, V.; Paschke, F.; Bayer, C.; Bator, M.; Mönks, W.; Dicks, A.; Enge-Rosenblatt, O. (2013): Sensorlose Zustandsüberwachung and Synchronmotoren. Proceedings. 23. Workshop Computational Intelligence. doi: 10.5445/KSP/1000036887.

Acknowledgements

I would like to thank the authors from the original publication for making the dataset available.