Today, we’ll have a look at this dataset on Steel Plate Faults by the Semeion Research Center of Sciences of Communication as part of my Exploring Less Known Datasets for Machine Learning series.
The dataset deals with detecting surface defects in stainless steel plates [1].
Let’s see how classical ML algorithms and some simple DNNs compare to the ANNs used in the original publication.


Contents


Dataset exploration and preprocessing

The steel plate dataset is hosted on the UCI Machine Learning Repository. The dataset is split into the data and a file that contains the header. Hence, we have to load both and combine them:

InputDataHeader = pd.read_csv("./data/Faults27x7_var",
                                header=None)
display(InputDataHeader.values)
array([['X_Minimum'],
       ['X_Maximum'],
       ['Y_Minimum'],
       ['Y_Maximum'],
       ['Pixels_Areas'],
       ['X_Perimeter'],
       ['Y_Perimeter'],
       ['Sum_of_Luminosity'],
       ['Minimum_of_Luminosity'],
       ['Maximum_of_Luminosity'],
       ['Length_of_Conveyer'],
       ['TypeOfSteel_A300'],
       ['TypeOfSteel_A400'],
       ['Steel_Plate_Thickness'],
       ['Edges_Index'],
       ['Empty_Index'],
       ['Square_Index'],
       ['Outside_X_Index'],
       ['Edges_X_Index'],
       ['Edges_Y_Index'],
       ['Outside_Global_Index'],
       ['LogOfAreas'],
       ['Log_X_Index'],
       ['Log_Y_Index'],
       ['Orientation_Index'],
       ['Luminosity_Index'],
       ['SigmoidOfAreas'],
       ['Pastry'],
       ['Z_Scratch'],
       ['K_Scatch'],
       ['Stains'],
       ['Dirtiness'],
       ['Bumps'],
       ['Other_Faults']], dtype=object)
InputData = pd.read_csv("./data/Faults.NNA",
                         header=None, sep="\t")
InputData.set_axis(InputDataHeader.values.flatten(),
                    axis=1,
                    inplace=True)

Next, we can have a look at the dataset:

display(InputData.head(2))
display(InputData.tail(2))
display(InputData.describe())
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity ... Orientation_Index Luminosity_Index SigmoidOfAreas Pastry Z_Scratch K_Scatch Stains Dirtiness Bumps Other_Faults
0 42 50 270900 270944 267 17 44 24220 76 108 ... 0.8182 -0.2913 0.5822 1 0 0 0 0 0 0
1 645 651 2538079 2538108 108 10 30 11397 84 123 ... 0.7931 -0.1756 0.2984 1 0 0 0 0 0 0
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity ... Orientation_Index Luminosity_Index SigmoidOfAreas Pastry Z_Scratch K_Scatch Stains Dirtiness Bumps Other_Faults
1939 137 170 422497 422528 419 97 47 52715 117 140 ... -0.0606 -0.0171 0.9919 0 0 0 0 0 0 1
1940 1261 1281 87951 87967 103 26 22 11682 101 133 ... -0.2000 -0.1139 0.5296 0 0 0 0 0 0 1
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity ... Orientation_Index Luminosity_Index SigmoidOfAreas Pastry Z_Scratch K_Scatch Stains Dirtiness Bumps Other_Faults
count 1941.000000 1941.000000 1.941000e+03 1.941000e+03 1941.000000 1941.000000 1941.000000 1.941000e+03 1941.000000 1941.000000 ... 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000
mean 571.136012 617.964451 1.650685e+06 1.650739e+06 1893.878413 111.855229 82.965997 2.063121e+05 84.548686 130.193715 ... 0.083288 -0.131305 0.585420 0.081401 0.097888 0.201443 0.037094 0.028336 0.207110 0.346728
std 520.690671 497.627410 1.774578e+06 1.774590e+06 5168.459560 301.209187 426.482879 5.122936e+05 32.134276 18.690992 ... 0.500868 0.148767 0.339452 0.273521 0.297239 0.401181 0.189042 0.165973 0.405339 0.476051
min 0.000000 4.000000 6.712000e+03 6.724000e+03 2.000000 2.000000 1.000000 2.500000e+02 0.000000 37.000000 ... -0.991000 -0.998900 0.119000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 51.000000 192.000000 4.712530e+05 4.712810e+05 84.000000 15.000000 13.000000 9.522000e+03 63.000000 124.000000 ... -0.333300 -0.195000 0.248200 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 435.000000 467.000000 1.204128e+06 1.204136e+06 174.000000 26.000000 25.000000 1.920200e+04 90.000000 127.000000 ... 0.095200 -0.133000 0.506300 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 1053.000000 1072.000000 2.183073e+06 2.183084e+06 822.000000 84.000000 83.000000 8.301100e+04 106.000000 140.000000 ... 0.511600 -0.066600 0.999800 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
max 1705.000000 1713.000000 1.298766e+07 1.298769e+07 152655.000000 10449.000000 18152.000000 1.159141e+07 203.000000 253.000000 ... 0.991700 0.642100 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

We can detect one problem here. Our target variables are One-Hot-Encoded already. Hence, we have to reverse the process to satisfy scikit-lear. It’s also helpful to gain a better, visual impression of the dataset:

X_df = InputData.copy()
X_df.drop(["Pastry","Z_Scratch","K_Scatch","Stains","Dirtiness","Bumps","Other_Faults"], axis=1,inplace=True)
y_df = InputData[["Pastry","Z_Scratch","K_Scatch","Stains","Dirtiness","Bumps","Other_Faults"]].copy()

# prepare y for scikit-learn
y = []
for i in range(y_df.shape[0]):
    if y_df["Pastry"].values[i] == 1:
        y.append("Pastry")
    elif y_df["Z_Scratch"].values[i] == 1:
        y.append("Z_Scratch")
    elif y_df["K_Scatch"].values[i] == 1:
        y.append("K_Scatch")
    elif y_df["Stains"].values[i] == 1:
        y.append("Stains")
    elif y_df["Dirtiness"].values[i] == 1:
        y.append("Dirtiness")
    elif y_df["Bumps"].values[i] == 1:
        y.append("Bumps")
    else:
        y.append("Other_Faults")

FailureModeDistribution = {}
for FailureMode in y_df:
    FailureModeDistribution[FailureMode] = np.bincount(y_df[FailureMode])[1]
FailureModeCheckSum = np.sum([FailureModeDistribution[FailureMode] for FailureMode in FailureModeDistribution])

We end up with 1941 detected faults which means that there are no doubles. However, we deal with an uneven distribution:

I don’t know what K_Scatch is; I assume that it should be a “Scratch” as well.

Let’s have a look at the input features and their distributions:

Some input features are distributed evenly, others contain a lot of outliers which may be significant to predict certain failures.

Let’s rescale the data and see if we can extract a bit more by visual assessment:

scaler = sklearn.preprocessing.MaxAbsScaler()
scaler.fit(X_df)
X_scaled = scaler.transform(X_df)
X_df_scaled = pd.DataFrame(X_scaled)
X_df_scaled.set_axis(InputDataHeader.values.flatten()[:-y_df.shape[1]],
                    axis=1,
                    inplace=True)
                    
plt.figure(figsize=(11,9))
for FailureMode in y_df:
    plt.plot(X_df_scaled[y_df[FailureMode] == 1].values[0], label=FailureMode)
plt.title("Examples for each failure mode (max abs scaled data)")
plt.legend()
plt.show()

Well, it looks like there are some differences visible.

Machine Learning Algorithms

We are going to use my standard “brute force routine” that performed quite well so far. We are going to use a full grid search with the following Hyperparameters for GridSearchCV:

# no parameter variation for GaussianNaiveBayes
grid_parameters_decision_tree_classification = {'max_depth' : [None, 3,5,7,9,10,11]}
grid_parameters_random_forest_classification = {'n_estimators' : [3,5,10,15,18], 'max_depth' : [None, 2,3,5,7,9]}
grid_parameters_adaboost_classifier = {'n_estimators' : [3,5,10,20,50,60,80,100,200,250,300,350,400],
					'learning_rate' : [0.001, 0.01, 0.1, 0.8, 1.0]}
grid_parameters_x_gradient_boosting_classification = {'n_estimators' : [3,5,10,15,18,20,25,50,60,80,100,120,150,200],
                                                    'max_depth' : [1,2,3,5,7,9,10,11,15],
                                                    'learning_rate' :[0.001, 0.01, 0.1],
                                                    'booster' : ['gbtree', 'dart']}
  • SVM Classifier seems to have some internal errors - I didn’t had the time to resolve them.

  • I didn’t managed to get CatBoost running. Somehow it did not accept the target data as one-hot-encoded nor as classes like scikit-learn requires. CatBoost’s own one-hot-encoding lead to the same error messages as providing it with one-hot-encoded targets manually.

Furthermoore, we can use a few simple NNs (just testing how they perform, no optimization to the dataset):

def build_baseline_model_1(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model
def build_baseline_model_2(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model
def build_baseline_model_3(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model
def build_baseline_model_4(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model
def build_baseline_model_5(input_dim,output_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*3, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(input_dim//2, activation='relu'))
    model.add(Dense(output_dim, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

All neural networks are trained for 1500 epochs and use a batch size of 32. The best model (criterion: validation loss) is selected for final testing.

Results

Before we have a look at the results, we should see what the the baseline results from the original paper were:

“Class A” - Class F”? It is not noted how they correlate to the dataset. Nevertheless, the results average around 89 % overall accuracy. At this point I ran into my usual problem: It is not documented if there was train-test-splitting and what set sizes have been used or if it was simple KFold cross validation. Further, the dataset was used as an example to demonstrade some patented neural network.

Okay, let’s look at the results:

It seems like AdaBoost performs worst since it tries to predict mainly “Other_Faults”.
Decision Tree and Gaussian Naive Bayes perform as expected. Random Forest and the neural networks perform with a similar overall accuracies. However, there is some variation between accuracies of classes. XGBoost performs best.

Discussion

I’m not happy with these results, eventhough we didn’t optimize much and didn’t do any feature engineering. It is also not clear how good these results are because it is not clear how the baseline results were obtained.

update

TPOT and Auto-Sklearn lead to similar results as XGBoost.

References

[1] Buscema, M. (1998): MetaNet: The Theory of Independent Judges, in Substance Use & Misuse, 33(2), 439-461.