Revisiting Machine Learning Datasets - Sensorless Drive Diagnosis

Today, we will have a look at this dataset on sensorless drive diagnosis as part of my “Exploring Less Known Datasets for Machine Learning” series. The dataset is hosted on the UCI Machine Learning Repository and originates from Martyna Bator and others [1].

Contents

Dataset Exploration
Machine Learning Algorithms
Results

Dataset Exploration

The idea behind this dataset is that sensors are too expensive to monitor every drive unit in a single factory. Hence, the authors idea is to use phase currents (measured for drive control anyhow) to predict the state of a drive and allow for predictive maintenance.
If I understand the dataset description correctly, then it is the result of an Empirical Mode Decomposition (EMD) on the raw current measurements. Furthermore, no information on training and test set sizes etc. are available. The original lacks of clear metrics of performance evaluation. Let’s dive into it and have a look at the data:

input_data = pd.read_csv("./data/Sensorless_drive_diagnosis.csv", delim_whitespace=True, header=None)
header_names = ['A' + str(i) for i in range(input_data.shape[1]-1)]
header_names.append('status')
input_data.set_axis(header_names, axis=1, inplace=True)
display(input_data.head(3))
display(input_data.tail(3))
display(input_data.describe())


  
    
      
      A0
      A1
      A2
      A3
      A4
      A5
      A6
      A7
      A8
      A9
      ...
      A39
      A40
      A41
      A42
      A43
      A44
      A45
      A46
      A47
      status
    
  
  
    
      0
      -3.014600e-07
      0.000008
      -0.000012
      -0.000002
      -0.000001
      -0.000021
      0.031718
      0.03171
      0.031721
      -0.032963
      ...
      -0.63308
      2.9646
      8.1198
      -1.4961
      -1.4961
      -1.4961
      -1.4996
      -1.4996
      -1.4996
      1
    
    
      1
      2.913200e-06
      -0.000005
      0.000003
      -0.000006
      0.000003
      -0.000004
      0.030804
      0.03081
      0.030806
      -0.033520
      ...
      -0.59314
      7.6252
      6.1690
      -1.4967
      -1.4967
      -1.4967
      -1.5005
      -1.5005
      -1.5005
      1
    
    
      2
      -2.951700e-06
      -0.000003
      -0.000016
      -0.000001
      -0.000002
      0.000017
      0.032877
      0.03288
      0.032896
      -0.029834
      ...
      -0.63252
      2.7784
      5.3017
      -1.4983
      -1.4983
      -1.4982
      -1.4985
      -1.4985
      -1.4985
      1


  
    
      
      A0
      A1
      A2
      A3
      A4
      A5
      A6
      A7
      A8
      A9
      ...
      A39
      A40
      A41
      A42
      A43
      A44
      A45
      A46
      A47
      status
    
  
  
    
      count
      58509.000000
      5.850900e+04
      5.850900e+04
      58509.000000
      5.850900e+04
      5.850900e+04
      58509.000000
      58509.000000
      58509.000000
      58509.000000
      ...
      58509.000000
      58509.000000
      58509.000000
      58509.000000
      58509.000000
      58509.000000
      58509.000000
      58509.000000
      58509.000000
      58509.000000
    
    
      mean
      -0.000003
      1.439648e-06
      1.412013e-06
      -0.000001
      1.351239e-06
      -2.654483e-07
      0.001915
      0.001913
      0.001912
      -0.011897
      ...
      -0.397757
      7.293781
      8.273772
      -1.500887
      -1.500912
      -1.500805
      -1.497771
      -1.497794
      -1.497686
      6.000000
    
    
      std
      0.000072
      5.555429e-05
      2.353009e-04
      0.000063
      5.660943e-05
      2.261907e-04
      0.036468
      0.036465
      0.036470
      0.066482
      ...
      25.018728
      12.451781
      6.565952
      0.003657
      0.003668
      0.003632
      0.003163
      0.003163
      0.003175
      3.162305
    
    
      min
      -0.013721
      -5.414400e-03
      -1.358000e-02
      -0.012787
      -8.355900e-03
      -9.741300e-03
      -0.139890
      -0.135940
      -0.130860
      -0.218640
      ...
      -0.902350
      -0.596830
      0.320660
      -1.525500
      -1.526200
      -1.523700
      -1.521400
      -1.523200
      -1.521300
      1.000000
    
    
      25%
      -0.000007
      -1.444400e-05
      -7.239600e-05
      -0.000005
      -1.475300e-05
      -7.379100e-05
      -0.019927
      -0.019951
      -0.019925
      -0.032144
      ...
      -0.715470
      1.450300
      4.436300
      -1.503300
      -1.503400
      -1.503200
      -1.499600
      -1.499600
      -1.499500
      3.000000
    
    
      50%
      -0.000003
      8.804600e-07
      5.137700e-07
      -0.000001
      7.540200e-07
      -1.659300e-07
      0.013226
      0.013230
      0.013247
      -0.015566
      ...
      -0.661710
      3.301300
      6.479100
      -1.500300
      -1.500300
      -1.500300
      -1.498100
      -1.498100
      -1.498000
      6.000000
    
    
      75%
      0.000002
      1.877700e-05
      7.520000e-05
      0.000004
      1.906200e-05
      7.138600e-05
      0.024770
      0.024776
      0.024777
      0.020614
      ...
      -0.573980
      8.288500
      9.857500
      -1.498200
      -1.498200
      -1.498200
      -1.496200
      -1.496300
      -1.496200
      9.000000
    
    
      max
      0.005784
      4.525300e-03
      5.237700e-03
      0.001453
      8.245100e-04
      2.753600e-03
      0.069125
      0.069130
      0.069131
      0.352580
      ...
      3670.800000
      889.930000
      153.150000
      -1.457600
      -1.456100
      -1.455500
      -1.337200
      -1.337200
      -1.337100
      11.000000

Let’s see how drive states are distributed:

That looks good. We don’t have to deal with imbalanced datasets.

Let’s have a look at boxplots of the input features:

Perhaps it is better to rescale the data (MaxAbsScaler) and look again:

Well, that looks better. Let’s see if we can find any characteristic input data for each drive state output:

It is time to do train test splitting and throw some ML algorithms at the dataset:

X_train, X_test, y_train, y_test = train_test_split(X_df_scaled, y, test_size=0.2, shuffle=True, random_state=42)

Machine Learning Algorithms

# no parameter variation for GaussianNaiveBayes
grid_parameters_decision_tree_classification = {'max_depth' : [None, 3,5,7,9,10,11]}
grid_parameters_random_forest_classification = {'n_estimators' : [3,5,10,15,18], 'max_depth' : [None, 2,3,5,7,9]}
grid_parameters_adaboost_classifier = {'n_estimators' : [3,5,10,20,50,60,80,100,200,250,300,350,400],
					'learning_rate' : [0.001, 0.01, 0.1, 0.8, 1.0]}
grid_parameters_x_gradient_boosting_classification = {'n_estimators' : [3,5,18,20,60,80,150],
                                                    'max_depth' : [1,2,7,9,15],
                                                    'learning_rate' :[0.001, 0.01, 0.1],
                                                    'booster' : ['gbtree', 'dart']}

SVM Classifier seems to have some internal errors - I didn’t had the time to resolve them.

Furthermoore, we can use a few simple NNs (just testing how they perform, no optimization to the dataset):

def build_baseline_model_1(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def build_baseline_model_2(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def build_baseline_model_3(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def build_baseline_model_4(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def build_baseline_model_5(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*3, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

All neural networks are trained for 1500 epochs and use a batch size of 256. The best model (criterion: validation loss) is selected for final testing.

Results

Let’s see how the algorithms perform:

It seems like drive diagnosis is easy. Neural Networks, XGBoost and Random Forests are almost free of false predictions. Decision trees perform quite good as well. Gaussian Naive Bayes is super fast but doesn’t yield results that meet any standards quality-wise. Again, this seems to be another dataset where AdaBoost fails. I’m not too sure if I still use to few hyperparameters.

Since the original publication doesn’t contain any clear metrics, I can’t compare the results. If the figures in the publication show something similar, then I would say that we will outperform the fuzzy logic approach even with pure decision trees.

References

[1] Lohweg, V.; Paschke, F.; Bayer, C.; Bator, M.; Mönks, W.; Dicks, A.; Enge-Rosenblatt, O. (2013): Sensorlose Zustandsüberwachung and Synchronmotoren. Proceedings. 23. Workshop Computational Intelligence. doi: 10.5445/KSP/1000036887.

Acknowledgements

I would like to thank the authors from the original publication for making the dataset available.

	A0	A1	A2	A3	A4	A5	A6	A7	A8	A9	...	A39	A40	A41	A42	A43	A44	A45	A46	A47	status
58506	-0.000006	0.000019	-0.000102	-0.000003	0.000004	0.000117	-0.081989	-0.082008	-0.081906	-0.18614	...	-0.51103	20.9250	9.0437	-1.5035	-1.5035	-1.5039	-1.4911	-1.4912	-1.4910	11
58507	-0.000004	0.000034	-0.000442	0.000005	0.000007	0.000087	-0.081500	-0.081534	-0.081093	-0.18363	...	-0.52033	1.3890	10.7430	-1.5029	-1.5029	-1.5030	-1.4932	-1.4932	-1.4931	11
58508	-0.000009	0.000052	0.000072	0.000010	0.000004	-0.000032	-0.083034	-0.083086	-0.083159	-0.18589	...	-0.50974	1.6026	4.5773	-1.5039	-1.5040	-1.5036	-1.4945	-1.4946	-1.4943	11

	A0	A1	A2	A3	A4	A5	A6	A7	A8	A9	...	A39	A40	A41	A42	A43	A44	A45	A46	A47	status
0	-3.014600e-07	0.000008	-0.000012	-0.000002	-0.000001	-0.000021	0.031718	0.03171	0.031721	-0.032963	...	-0.63308	2.9646	8.1198	-1.4961	-1.4961	-1.4961	-1.4996	-1.4996	-1.4996	1
1	2.913200e-06	-0.000005	0.000003	-0.000006	0.000003	-0.000004	0.030804	0.03081	0.030806	-0.033520	...	-0.59314	7.6252	6.1690	-1.4967	-1.4967	-1.4967	-1.5005	-1.5005	-1.5005	1
2	-2.951700e-06	-0.000003	-0.000016	-0.000001	-0.000002	0.000017	0.032877	0.03288	0.032896	-0.029834	...	-0.63252	2.7784	5.3017	-1.4983	-1.4983	-1.4982	-1.4985	-1.4985	-1.4985	1

	A0	A1	A2	A3	A4	A5	A6	A7	A8	A9	...	A39	A40	A41	A42	A43	A44	A45	A46	A47	status
count	58509.000000	5.850900e+04	5.850900e+04	58509.000000	5.850900e+04	5.850900e+04	58509.000000	58509.000000	58509.000000	58509.000000	...	58509.000000	58509.000000	58509.000000	58509.000000	58509.000000	58509.000000	58509.000000	58509.000000	58509.000000	58509.000000
mean	-0.000003	1.439648e-06	1.412013e-06	-0.000001	1.351239e-06	-2.654483e-07	0.001915	0.001913	0.001912	-0.011897	...	-0.397757	7.293781	8.273772	-1.500887	-1.500912	-1.500805	-1.497771	-1.497794	-1.497686	6.000000
std	0.000072	5.555429e-05	2.353009e-04	0.000063	5.660943e-05	2.261907e-04	0.036468	0.036465	0.036470	0.066482	...	25.018728	12.451781	6.565952	0.003657	0.003668	0.003632	0.003163	0.003163	0.003175	3.162305
min	-0.013721	-5.414400e-03	-1.358000e-02	-0.012787	-8.355900e-03	-9.741300e-03	-0.139890	-0.135940	-0.130860	-0.218640	...	-0.902350	-0.596830	0.320660	-1.525500	-1.526200	-1.523700	-1.521400	-1.523200	-1.521300	1.000000
25%	-0.000007	-1.444400e-05	-7.239600e-05	-0.000005	-1.475300e-05	-7.379100e-05	-0.019927	-0.019951	-0.019925	-0.032144	...	-0.715470	1.450300	4.436300	-1.503300	-1.503400	-1.503200	-1.499600	-1.499600	-1.499500	3.000000
50%	-0.000003	8.804600e-07	5.137700e-07	-0.000001	7.540200e-07	-1.659300e-07	0.013226	0.013230	0.013247	-0.015566	...	-0.661710	3.301300	6.479100	-1.500300	-1.500300	-1.500300	-1.498100	-1.498100	-1.498000	6.000000
75%	0.000002	1.877700e-05	7.520000e-05	0.000004	1.906200e-05	7.138600e-05	0.024770	0.024776	0.024777	0.020614	...	-0.573980	8.288500	9.857500	-1.498200	-1.498200	-1.498200	-1.496200	-1.496300	-1.496200	9.000000
max	0.005784	4.525300e-03	5.237700e-03	0.001453	8.245100e-04	2.753600e-03	0.069125	0.069130	0.069131	0.352580	...	3670.800000	889.930000	153.150000	-1.457600	-1.456100	-1.455500	-1.337200	-1.337200	-1.337100	11.000000