It’s time to have another look at a less-known machine learning dataset. This one is really similar to the dataset on forest type classification because the dataset is curated by the same guys (Johnson (2013), Johnson & Xie (2013)). Again, the dataset is hosted on the UCI Machine Learning Repository.
Let’s load the dataset and see what we can come up with.
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
input_data_train = pd.read_csv("./data/training.csv")
input_data_test = pd.read_csv("./data/testing.csv")
display(input_data_train.head(2))
display(input_data_test.head(2))
display(input_data_train.describe())
class | BrdIndx | Area | Round | Bright | Compact | ShpIndx | Mean_G | Mean_R | Mean_NIR | ... | SD_NIR_140 | LW_140 | GLCM1_140 | Rect_140 | GLCM2_140 | Dens_140 | Assym_140 | NDVI_140 | BordLngth_140 | GLCM3_140 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | car | 1.27 | 91 | 0.97 | 231.38 | 1.39 | 1.47 | 207.92 | 241.74 | 244.48 | ... | 26.18 | 2.00 | 0.50 | 0.85 | 6.29 | 1.67 | 0.70 | -0.08 | 56 | 3806.36 |
1 | concrete | 2.36 | 241 | 1.56 | 216.15 | 2.46 | 2.51 | 187.85 | 229.39 | 231.20 | ... | 22.29 | 2.25 | 0.79 | 0.55 | 8.42 | 1.38 | 0.81 | -0.09 | 1746 | 1450.14 |
class | BrdIndx | Area | Round | Bright | Compact | ShpIndx | Mean_G | Mean_R | Mean_NIR | ... | SD_NIR_140 | LW_140 | GLCM1_140 | Rect_140 | GLCM2_140 | Dens_140 | Assym_140 | NDVI_140 | BordLngth_140 | GLCM3_140 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | concrete | 1.32 | 131 | 0.81 | 222.74 | 1.66 | 2.18 | 192.94 | 235.11 | 240.15 | ... | 31.15 | 5.04 | 0.80 | 0.58 | 8.56 | 0.82 | 0.98 | -0.10 | 1512 | 1287.52 |
1 | shadow | 1.59 | 864 | 0.94 | 47.56 | 1.41 | 1.87 | 36.82 | 48.78 | 57.09 | ... | 12.01 | 3.70 | 0.52 | 0.96 | 7.01 | 1.69 | 0.86 | -0.14 | 196 | 2659.74 |
BrdIndx | Area | Round | Bright | Compact | ShpIndx | Mean_G | Mean_R | Mean_NIR | SD_G | ... | SD_NIR_140 | LW_140 | GLCM1_140 | Rect_140 | GLCM2_140 | Dens_140 | Assym_140 | NDVI_140 | BordLngth_140 | GLCM3_140 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | ... | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 | 168.000000 |
mean | 2.008512 | 565.869048 | 1.132976 | 165.569821 | 2.077679 | 2.229881 | 161.577083 | 163.672440 | 171.459226 | 10.131369 | ... | 23.769881 | 3.098274 | 0.796488 | 0.665000 | 7.795536 | 1.594405 | 0.615357 | 0.014583 | 983.309524 | 1275.292917 |
std | 0.634807 | 679.852886 | 0.489150 | 61.883993 | 0.699600 | 0.703572 | 63.407201 | 71.306748 | 67.973969 | 5.179409 | ... | 12.836522 | 6.101883 | 0.103930 | 0.179086 | 0.670491 | 0.460627 | 0.239900 | 0.153677 | 880.013745 | 603.658611 |
min | 1.000000 | 10.000000 | 0.020000 | 37.670000 | 1.000000 | 1.060000 | 30.680000 | 32.210000 | 40.120000 | 4.330000 | ... | 4.020000 | 1.000000 | 0.330000 | 0.240000 | 6.290000 | 0.230000 | 0.070000 | -0.360000 | 56.000000 | 336.730000 |
25% | 1.537500 | 178.000000 | 0.787500 | 133.977500 | 1.547500 | 1.700000 | 91.040000 | 101.187500 | 120.165000 | 6.770000 | ... | 13.965000 | 1.395000 | 0.757500 | 0.560000 | 7.357500 | 1.325000 | 0.460000 | -0.080000 | 320.000000 | 817.405000 |
50% | 1.920000 | 315.000000 | 1.085000 | 164.485000 | 1.940000 | 2.130000 | 187.560000 | 160.615000 | 178.345000 | 8.010000 | ... | 21.135000 | 1.740000 | 0.810000 | 0.690000 | 7.790000 | 1.660000 | 0.620000 | -0.040000 | 776.000000 | 1187.025000 |
75% | 2.375000 | 667.000000 | 1.410000 | 221.895000 | 2.460000 | 2.680000 | 210.940000 | 234.815000 | 236.002500 | 11.500000 | ... | 29.957500 | 2.285000 | 0.870000 | 0.810000 | 8.260000 | 1.945000 | 0.810000 | 0.120000 | 1412.500000 | 1588.427500 |
max | 4.190000 | 3659.000000 | 2.890000 | 244.740000 | 4.700000 | 4.300000 | 246.350000 | 253.080000 | 253.320000 | 36.400000 | ... | 60.020000 | 51.540000 | 0.950000 | 0.980000 | 9.340000 | 2.340000 | 1.000000 | 0.350000 | 6232.000000 | 3806.360000 |
But what do they mean?!?
LEGEND
Class: Land cover class (nominal)
BrdIndx: Border Index (shape variable)
Area: Area in m2 (size variable)
Round: Roundness (shape variable)
Bright: Brightness (spectral variable)
Compact: Compactness (shape variable)
ShpIndx: Shape Index (shape variable)
Mean_G: Green (spectral variable)
Mean_R: Red (spectral variable)
Mean_NIR: Near Infrared (spectral variable)
SD_G: Standard deviation of Green (texture variable)
SD_R: Standard deviation of Red (texture variable)
SD_NIR: Standard deviation of Near Infrared (texture variable)
LW: Length/Width (shape variable)
GLCM1: Gray-Level Co-occurrence Matrix [i forget which type of GLCM metric this one is] (texture variable)
Rect: Rectangularity (shape variable)
GLCM2: Another Gray-Level Co-occurrence Matrix attribute (texture variable)
Dens: Density (shape variable)
Assym: Assymetry (shape variable)
NDVI: Normalized Difference Vegetation Index (spectral variable)
BordLngth: Border Length (shape variable)
GLCM3: Another Gray-Level Co-occurrence Matrix attribute (texture variable)Note: These variables repeat for each coarser scale (i.e. variable_40, variable_60, …variable_140).
Let’s see if unscaled box plots are any help:
input_data_train.plot.box(figsize=(12,7), xticks=[])
plt.title('Boxplots of all features')
plt.xlabel('Feature')
plt.ylabel('')
plt.show()
Well, that doen’t help much. Let’s have a look at the class distributions of both sets:
y_hist_train_dict = dict(input_data_train['class'].value_counts())
y_hist_test_dict = dict(input_data_test['class'].value_counts())
plt.figure(figsize=(7,5))
plt.bar(list(y_hist_train_dict.keys()), y_hist_train_dict.values(), color="black")
plt.title("Urban land classes histogram - train set")
plt.ylabel("Count")
plt.xlabel("Urban land class")
plt.tight_layout()
plt.show()
plt.figure(figsize=(7,5))
plt.bar(list(y_hist_test_dict.keys()), y_hist_test_dict.values(), color="black")
plt.title("Urban land classes histogram - test set")
plt.ylabel("Count")
plt.xlabel("Urban land class")
plt.tight_layout()
plt.show()
That’s not so good… . Let’s scale the data and see what we can derive from it visually:
#categorical encoding
input_data_train['class'] = pd.Categorical(input_data_train['class']).codes
input_data_test['class'] = pd.Categorical(input_data_test['class']).codes
# split data into X and y
y_train = input_data_train['class'].copy(deep=True)
X_train = input_data_train.copy(deep=True)
X_train.drop(['class'], inplace=True, axis=1)
y_test = input_data_test['class'].copy(deep=True)
X_test = input_data_test.copy(deep=True)
X_test.drop(['class'], inplace=True, axis=1)
from sklearn.preprocessing import MaxAbsScaler
X_train = X_train.astype('float64')
X_test = X_test.astype('float64')
scaler = MaxAbsScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
plt.figure(figsize=(13,8))
plt.boxplot(X_train,meanline=False, notch=True)
plt.title('Boxplots of all features - scaled')
plt.xlabel('Feature')
plt.ylabel('')
plt.show()
plt.figure(figsize=(11,9))
for cl in input_data_train["class"].unique():
plt.plot(input_data_train[input_data_train["class"] == cl].values[0], label=cl)
plt.title("Examples for each urban land class - training set (unscaled))")
plt.legend()
plt.show()
plt.figure(figsize=(11,9))
for cl in input_data_train["class"].unique():
plt.plot(X_train[input_data_train["class"] == cl][0], label=cl)
plt.title("Examples for each urban land class - training set (scaled))")
plt.legend()
plt.show()
Well, that looks messy. Let’s see what some ML algorithma make out of it.