Revisiting Machine Learning Datasets

Today, we’ll have a look at this dataset on Steel Plate Faults by the Semeion Research Center of Sciences of Communication as part of my Exploring Less Known Datasets for Machine Learning series.
The dataset deals with detecting surface defects in stainless steel plates [1].
Let’s see how classical ML algorithms and some simple DNNs compare to the ANNs used in the original publication.

Contents

Dataset exploration and preprocessing
Machine Learning Algorithms
Results
Discussion

Dataset exploration and preprocessing

The steel plate dataset is hosted on the UCI Machine Learning Repository. The dataset is split into the data and a file that contains the header. Hence, we have to load both and combine them:

InputDataHeader = pd.read_csv("./data/Faults27x7_var",
                                header=None)
display(InputDataHeader.values)

array([['X_Minimum'],
       ['X_Maximum'],
       ['Y_Minimum'],
       ['Y_Maximum'],
       ['Pixels_Areas'],
       ['X_Perimeter'],
       ['Y_Perimeter'],
       ['Sum_of_Luminosity'],
       ['Minimum_of_Luminosity'],
       ['Maximum_of_Luminosity'],
       ['Length_of_Conveyer'],
       ['TypeOfSteel_A300'],
       ['TypeOfSteel_A400'],
       ['Steel_Plate_Thickness'],
       ['Edges_Index'],
       ['Empty_Index'],
       ['Square_Index'],
       ['Outside_X_Index'],
       ['Edges_X_Index'],
       ['Edges_Y_Index'],
       ['Outside_Global_Index'],
       ['LogOfAreas'],
       ['Log_X_Index'],
       ['Log_Y_Index'],
       ['Orientation_Index'],
       ['Luminosity_Index'],
       ['SigmoidOfAreas'],
       ['Pastry'],
       ['Z_Scratch'],
       ['K_Scatch'],
       ['Stains'],
       ['Dirtiness'],
       ['Bumps'],
       ['Other_Faults']], dtype=object)

InputData = pd.read_csv("./data/Faults.NNA",
                         header=None, sep="\t")
InputData.set_axis(InputDataHeader.values.flatten(),
                    axis=1,
                    inplace=True)

Next, we can have a look at the dataset:

display(InputData.head(2))
display(InputData.tail(2))
display(InputData.describe())


  
    
      
      X_Minimum
      X_Maximum
      Y_Minimum
      Y_Maximum
      Pixels_Areas
      X_Perimeter
      Y_Perimeter
      Sum_of_Luminosity
      Minimum_of_Luminosity
      Maximum_of_Luminosity
      ...
      Orientation_Index
      Luminosity_Index
      SigmoidOfAreas
      Pastry
      Z_Scratch
      K_Scatch
      Stains
      Dirtiness
      Bumps
      Other_Faults
    
  
  
    
      0
      42
      50
      270900
      270944
      267
      17
      44
      24220
      76
      108
      ...
      0.8182
      -0.2913
      0.5822
      1
      0
      0
      0
      0
      0
      0
    
    
      1
      645
      651
      2538079
      2538108
      108
      10
      30
      11397
      84
      123
      ...
      0.7931
      -0.1756
      0.2984
      1
      0
      0
      0
      0
      0
      0


  
    
      
      X_Minimum
      X_Maximum
      Y_Minimum
      Y_Maximum
      Pixels_Areas
      X_Perimeter
      Y_Perimeter
      Sum_of_Luminosity
      Minimum_of_Luminosity
      Maximum_of_Luminosity
      ...
      Orientation_Index
      Luminosity_Index
      SigmoidOfAreas
      Pastry
      Z_Scratch
      K_Scatch
      Stains
      Dirtiness
      Bumps
      Other_Faults
    
  
  
    
      1939
      137
      170
      422497
      422528
      419
      97
      47
      52715
      117
      140
      ...
      -0.0606
      -0.0171
      0.9919
      0
      0
      0
      0
      0
      0
      1
    
    
      1940
      1261
      1281
      87951
      87967
      103
      26
      22
      11682
      101
      133
      ...
      -0.2000
      -0.1139
      0.5296
      0
      0
      0
      0
      0
      0
      1


  
    
      
      X_Minimum
      X_Maximum
      Y_Minimum
      Y_Maximum
      Pixels_Areas
      X_Perimeter
      Y_Perimeter
      Sum_of_Luminosity
      Minimum_of_Luminosity
      Maximum_of_Luminosity
      ...
      Orientation_Index
      Luminosity_Index
      SigmoidOfAreas
      Pastry
      Z_Scratch
      K_Scatch
      Stains
      Dirtiness
      Bumps
      Other_Faults
    
  
  
    
      count
      1941.000000
      1941.000000
      1.941000e+03
      1.941000e+03
      1941.000000
      1941.000000
      1941.000000
      1.941000e+03
      1941.000000
      1941.000000
      ...
      1941.000000
      1941.000000
      1941.000000
      1941.000000
      1941.000000
      1941.000000
      1941.000000
      1941.000000
      1941.000000
      1941.000000
    
    
      mean
      571.136012
      617.964451
      1.650685e+06
      1.650739e+06
      1893.878413
      111.855229
      82.965997
      2.063121e+05
      84.548686
      130.193715
      ...
      0.083288
      -0.131305
      0.585420
      0.081401
      0.097888
      0.201443
      0.037094
      0.028336
      0.207110
      0.346728
    
    
      std
      520.690671
      497.627410
      1.774578e+06
      1.774590e+06
      5168.459560
      301.209187
      426.482879
      5.122936e+05
      32.134276
      18.690992
      ...
      0.500868
      0.148767
      0.339452
      0.273521
      0.297239
      0.401181
      0.189042
      0.165973
      0.405339
      0.476051
    
    
      min
      0.000000
      4.000000
      6.712000e+03
      6.724000e+03
      2.000000
      2.000000
      1.000000
      2.500000e+02
      0.000000
      37.000000
      ...
      -0.991000
      -0.998900
      0.119000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      51.000000
      192.000000
      4.712530e+05
      4.712810e+05
      84.000000
      15.000000
      13.000000
      9.522000e+03
      63.000000
      124.000000
      ...
      -0.333300
      -0.195000
      0.248200
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      50%
      435.000000
      467.000000
      1.204128e+06
      1.204136e+06
      174.000000
      26.000000
      25.000000
      1.920200e+04
      90.000000
      127.000000
      ...
      0.095200
      -0.133000
      0.506300
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      75%
      1053.000000
      1072.000000
      2.183073e+06
      2.183084e+06
      822.000000
      84.000000
      83.000000
      8.301100e+04
      106.000000
      140.000000
      ...
      0.511600
      -0.066600
      0.999800
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1.000000
    
    
      max
      1705.000000
      1713.000000
      1.298766e+07
      1.298769e+07
      152655.000000
      10449.000000
      18152.000000
      1.159141e+07
      203.000000
      253.000000
      ...
      0.991700
      0.642100
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000

We can detect one problem here. Our target variables are One-Hot-Encoded already. Hence, we have to reverse the process to satisfy scikit-lear. It’s also helpful to gain a better, visual impression of the dataset:

X_df = InputData.copy()
X_df.drop(["Pastry","Z_Scratch","K_Scatch","Stains","Dirtiness","Bumps","Other_Faults"], axis=1,inplace=True)
y_df = InputData[["Pastry","Z_Scratch","K_Scatch","Stains","Dirtiness","Bumps","Other_Faults"]].copy()

# prepare y for scikit-learn
y = []
for i in range(y_df.shape[0]):
    if y_df["Pastry"].values[i] == 1:
        y.append("Pastry")
    elif y_df["Z_Scratch"].values[i] == 1:
        y.append("Z_Scratch")
    elif y_df["K_Scatch"].values[i] == 1:
        y.append("K_Scatch")
    elif y_df["Stains"].values[i] == 1:
        y.append("Stains")
    elif y_df["Dirtiness"].values[i] == 1:
        y.append("Dirtiness")
    elif y_df["Bumps"].values[i] == 1:
        y.append("Bumps")
    else:
        y.append("Other_Faults")

FailureModeDistribution = {}
for FailureMode in y_df:
    FailureModeDistribution[FailureMode] = np.bincount(y_df[FailureMode])[1]
FailureModeCheckSum = np.sum([FailureModeDistribution[FailureMode] for FailureMode in FailureModeDistribution])

We end up with 1941 detected faults which means that there are no doubles. However, we deal with an uneven distribution:

I don’t know what K_Scatch is; I assume that it should be a “Scratch” as well.

Let’s have a look at the input features and their distributions:

Some input features are distributed evenly, others contain a lot of outliers which may be significant to predict certain failures.

Let’s rescale the data and see if we can extract a bit more by visual assessment:

scaler = sklearn.preprocessing.MaxAbsScaler()
scaler.fit(X_df)
X_scaled = scaler.transform(X_df)
X_df_scaled = pd.DataFrame(X_scaled)
X_df_scaled.set_axis(InputDataHeader.values.flatten()[:-y_df.shape[1]],
                    axis=1,
                    inplace=True)
                    
plt.figure(figsize=(11,9))
for FailureMode in y_df:
    plt.plot(X_df_scaled[y_df[FailureMode] == 1].values[0], label=FailureMode)
plt.title("Examples for each failure mode (max abs scaled data)")
plt.legend()
plt.show()

Well, it looks like there are some differences visible.

Machine Learning Algorithms

We are going to use my standard “brute force routine” that performed quite well so far. We are going to use a full grid search with the following Hyperparameters for GridSearchCV:

# no parameter variation for GaussianNaiveBayes
grid_parameters_decision_tree_classification = {'max_depth' : [None, 3,5,7,9,10,11]}
grid_parameters_random_forest_classification = {'n_estimators' : [3,5,10,15,18], 'max_depth' : [None, 2,3,5,7,9]}
grid_parameters_adaboost_classifier = {'n_estimators' : [3,5,10,20,50,60,80,100,200,250,300,350,400],
					'learning_rate' : [0.001, 0.01, 0.1, 0.8, 1.0]}
grid_parameters_x_gradient_boosting_classification = {'n_estimators' : [3,5,10,15,18,20,25,50,60,80,100,120,150,200],
                                                    'max_depth' : [1,2,3,5,7,9,10,11,15],
                                                    'learning_rate' :[0.001, 0.01, 0.1],
                                                    'booster' : ['gbtree', 'dart']}

SVM Classifier seems to have some internal errors - I didn’t had the time to resolve them.
I didn’t managed to get CatBoost running. Somehow it did not accept the target data as one-hot-encoded nor as classes like scikit-learn requires. CatBoost’s own one-hot-encoding lead to the same error messages as providing it with one-hot-encoded targets manually.

Furthermoore, we can use a few simple NNs (just testing how they perform, no optimization to the dataset):

def build_baseline_model_1(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def build_baseline_model_2(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def build_baseline_model_3(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def build_baseline_model_4(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def build_baseline_model_5(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*3, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

All neural networks are trained for 1500 epochs and use a batch size of 32. The best model (criterion: validation loss) is selected for final testing.

Results

Before we have a look at the results, we should see what the the baseline results from the original paper were:

“Class A” - Class F”? It is not noted how they correlate to the dataset. Nevertheless, the results average around 89 % overall accuracy. At this point I ran into my usual problem: It is not documented if there was train-test-splitting and what set sizes have been used or if it was simple KFold cross validation. Further, the dataset was used as an example to demonstrade some patented neural network.

Okay, let’s look at the results:

It seems like AdaBoost performs worst since it tries to predict mainly “Other_Faults”.
Decision Tree and Gaussian Naive Bayes perform as expected. Random Forest and the neural networks perform with a similar overall accuracies. However, there is some variation between accuracies of classes. XGBoost performs best.

Discussion

I’m not happy with these results, eventhough we didn’t optimize much and didn’t do any feature engineering. It is also not clear how good these results are because it is not clear how the baseline results were obtained.

update

TPOT and Auto-Sklearn lead to similar results as XGBoost.

References

[1] Buscema, M. (1998): MetaNet: The Theory of Independent Judges, in Substance Use & Misuse, 33(2), 439-461.

	X_Minimum	X_Maximum	Y_Minimum	Y_Maximum	Pixels_Areas	X_Perimeter	Y_Perimeter	Sum_of_Luminosity	Minimum_of_Luminosity	Maximum_of_Luminosity	...	Orientation_Index	Luminosity_Index	SigmoidOfAreas	Pastry	Z_Scratch	K_Scatch	Stains	Dirtiness	Bumps	Other_Faults
0	42	50	270900	270944	267	17	44	24220	76	108	...	0.8182	-0.2913	0.5822	1	0	0	0	0	0	0
1	645	651	2538079	2538108	108	10	30	11397	84	123	...	0.7931	-0.1756	0.2984	1	0	0	0	0	0	0

	X_Minimum	X_Maximum	Y_Minimum	Y_Maximum	Pixels_Areas	X_Perimeter	Y_Perimeter	Sum_of_Luminosity	Minimum_of_Luminosity	Maximum_of_Luminosity	...	Orientation_Index	Luminosity_Index	SigmoidOfAreas	Pastry	Z_Scratch	K_Scatch	Stains	Dirtiness	Bumps	Other_Faults
1939	137	170	422497	422528	419	97	47	52715	117	140	...	-0.0606	-0.0171	0.9919	0	0	0	0	0	0	1
1940	1261	1281	87951	87967	103	26	22	11682	101	133	...	-0.2000	-0.1139	0.5296	0	0	0	0	0	0	1

	X_Minimum	X_Maximum	Y_Minimum	Y_Maximum	Pixels_Areas	X_Perimeter	Y_Perimeter	Sum_of_Luminosity	Minimum_of_Luminosity	Maximum_of_Luminosity	...	Orientation_Index	Luminosity_Index	SigmoidOfAreas	Pastry	Z_Scratch	K_Scatch	Stains	Dirtiness	Bumps	Other_Faults
count	1941.000000	1941.000000	1.941000e+03	1.941000e+03	1941.000000	1941.000000	1941.000000	1.941000e+03	1941.000000	1941.000000	...	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000
mean	571.136012	617.964451	1.650685e+06	1.650739e+06	1893.878413	111.855229	82.965997	2.063121e+05	84.548686	130.193715	...	0.083288	-0.131305	0.585420	0.081401	0.097888	0.201443	0.037094	0.028336	0.207110	0.346728
std	520.690671	497.627410	1.774578e+06	1.774590e+06	5168.459560	301.209187	426.482879	5.122936e+05	32.134276	18.690992	...	0.500868	0.148767	0.339452	0.273521	0.297239	0.401181	0.189042	0.165973	0.405339	0.476051
min	0.000000	4.000000	6.712000e+03	6.724000e+03	2.000000	2.000000	1.000000	2.500000e+02	0.000000	37.000000	...	-0.991000	-0.998900	0.119000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	51.000000	192.000000	4.712530e+05	4.712810e+05	84.000000	15.000000	13.000000	9.522000e+03	63.000000	124.000000	...	-0.333300	-0.195000	0.248200	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	435.000000	467.000000	1.204128e+06	1.204136e+06	174.000000	26.000000	25.000000	1.920200e+04	90.000000	127.000000	...	0.095200	-0.133000	0.506300	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	1053.000000	1072.000000	2.183073e+06	2.183084e+06	822.000000	84.000000	83.000000	8.301100e+04	106.000000	140.000000	...	0.511600	-0.066600	0.999800	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
max	1705.000000	1713.000000	1.298766e+07	1.298769e+07	152655.000000	10449.000000	18152.000000	1.159141e+07	203.000000	253.000000	...	0.991700	0.642100	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000