It’s time to have another look at a less-known machine learning dataset. This one is really similar to the dataset on forest type classification because the dataset is curated by the same guys (Johnson (2013), Johnson & Xie (2013)). Again, the dataset is hosted on the UCI Machine Learning Repository.

Let’s load the dataset and see what we can come up with.

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

input_data_train = pd.read_csv("./data/training.csv")
input_data_test = pd.read_csv("./data/testing.csv")

display(input_data_train.head(2))
display(input_data_test.head(2))
display(input_data_train.describe())
class BrdIndx Area Round Bright Compact ShpIndx Mean_G Mean_R Mean_NIR ... SD_NIR_140 LW_140 GLCM1_140 Rect_140 GLCM2_140 Dens_140 Assym_140 NDVI_140 BordLngth_140 GLCM3_140
0 car 1.27 91 0.97 231.38 1.39 1.47 207.92 241.74 244.48 ... 26.18 2.00 0.50 0.85 6.29 1.67 0.70 -0.08 56 3806.36
1 concrete 2.36 241 1.56 216.15 2.46 2.51 187.85 229.39 231.20 ... 22.29 2.25 0.79 0.55 8.42 1.38 0.81 -0.09 1746 1450.14
class BrdIndx Area Round Bright Compact ShpIndx Mean_G Mean_R Mean_NIR ... SD_NIR_140 LW_140 GLCM1_140 Rect_140 GLCM2_140 Dens_140 Assym_140 NDVI_140 BordLngth_140 GLCM3_140
0 concrete 1.32 131 0.81 222.74 1.66 2.18 192.94 235.11 240.15 ... 31.15 5.04 0.80 0.58 8.56 0.82 0.98 -0.10 1512 1287.52
1 shadow 1.59 864 0.94 47.56 1.41 1.87 36.82 48.78 57.09 ... 12.01 3.70 0.52 0.96 7.01 1.69 0.86 -0.14 196 2659.74
BrdIndx Area Round Bright Compact ShpIndx Mean_G Mean_R Mean_NIR SD_G ... SD_NIR_140 LW_140 GLCM1_140 Rect_140 GLCM2_140 Dens_140 Assym_140 NDVI_140 BordLngth_140 GLCM3_140
count 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000 ... 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000 168.000000
mean 2.008512 565.869048 1.132976 165.569821 2.077679 2.229881 161.577083 163.672440 171.459226 10.131369 ... 23.769881 3.098274 0.796488 0.665000 7.795536 1.594405 0.615357 0.014583 983.309524 1275.292917
std 0.634807 679.852886 0.489150 61.883993 0.699600 0.703572 63.407201 71.306748 67.973969 5.179409 ... 12.836522 6.101883 0.103930 0.179086 0.670491 0.460627 0.239900 0.153677 880.013745 603.658611
min 1.000000 10.000000 0.020000 37.670000 1.000000 1.060000 30.680000 32.210000 40.120000 4.330000 ... 4.020000 1.000000 0.330000 0.240000 6.290000 0.230000 0.070000 -0.360000 56.000000 336.730000
25% 1.537500 178.000000 0.787500 133.977500 1.547500 1.700000 91.040000 101.187500 120.165000 6.770000 ... 13.965000 1.395000 0.757500 0.560000 7.357500 1.325000 0.460000 -0.080000 320.000000 817.405000
50% 1.920000 315.000000 1.085000 164.485000 1.940000 2.130000 187.560000 160.615000 178.345000 8.010000 ... 21.135000 1.740000 0.810000 0.690000 7.790000 1.660000 0.620000 -0.040000 776.000000 1187.025000
75% 2.375000 667.000000 1.410000 221.895000 2.460000 2.680000 210.940000 234.815000 236.002500 11.500000 ... 29.957500 2.285000 0.870000 0.810000 8.260000 1.945000 0.810000 0.120000 1412.500000 1588.427500
max 4.190000 3659.000000 2.890000 244.740000 4.700000 4.300000 246.350000 253.080000 253.320000 36.400000 ... 60.020000 51.540000 0.950000 0.980000 9.340000 2.340000 1.000000 0.350000 6232.000000 3806.360000

But what do they mean?!?

LEGEND

Class: Land cover class (nominal)

BrdIndx: Border Index (shape variable)

Area: Area in m2 (size variable)

Round: Roundness (shape variable)

Bright: Brightness (spectral variable)

Compact: Compactness (shape variable)

ShpIndx: Shape Index (shape variable)

Mean_G: Green (spectral variable)

Mean_R: Red (spectral variable)

Mean_NIR: Near Infrared (spectral variable)

SD_G: Standard deviation of Green (texture variable)

SD_R: Standard deviation of Red (texture variable)

SD_NIR: Standard deviation of Near Infrared (texture variable)

LW: Length/Width (shape variable)

GLCM1: Gray-Level Co-occurrence Matrix [i forget which type of GLCM metric this one is] (texture variable)

Rect: Rectangularity (shape variable)

GLCM2: Another Gray-Level Co-occurrence Matrix attribute (texture variable)

Dens: Density (shape variable)

Assym: Assymetry (shape variable)

NDVI: Normalized Difference Vegetation Index (spectral variable)

BordLngth: Border Length (shape variable)
GLCM3: Another Gray-Level Co-occurrence Matrix attribute (texture variable)

Note: These variables repeat for each coarser scale (i.e. variable_40, variable_60, …variable_140).

dataset description

Let’s see if unscaled box plots are any help:

input_data_train.plot.box(figsize=(12,7), xticks=[])
plt.title('Boxplots of all features')
plt.xlabel('Feature')
plt.ylabel('')
plt.show()

Well, that doen’t help much. Let’s have a look at the class distributions of both sets:

y_hist_train_dict = dict(input_data_train['class'].value_counts())
y_hist_test_dict = dict(input_data_test['class'].value_counts())

plt.figure(figsize=(7,5))
plt.bar(list(y_hist_train_dict.keys()), y_hist_train_dict.values(), color="black")
plt.title("Urban land classes histogram - train set")
plt.ylabel("Count")
plt.xlabel("Urban land class")
plt.tight_layout()
plt.show()

plt.figure(figsize=(7,5))
plt.bar(list(y_hist_test_dict.keys()), y_hist_test_dict.values(), color="black")
plt.title("Urban land classes histogram - test set")
plt.ylabel("Count")
plt.xlabel("Urban land class")
plt.tight_layout()
plt.show()

That’s not so good… . Let’s scale the data and see what we can derive from it visually:

#categorical encoding
input_data_train['class'] = pd.Categorical(input_data_train['class']).codes
input_data_test['class'] = pd.Categorical(input_data_test['class']).codes

# split data into X and y
y_train = input_data_train['class'].copy(deep=True)
X_train = input_data_train.copy(deep=True)
X_train.drop(['class'], inplace=True, axis=1)

y_test = input_data_test['class'].copy(deep=True)
X_test = input_data_test.copy(deep=True)
X_test.drop(['class'], inplace=True, axis=1)

from sklearn.preprocessing import MaxAbsScaler



X_train = X_train.astype('float64')
X_test = X_test.astype('float64')


scaler = MaxAbsScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


plt.figure(figsize=(13,8))
plt.boxplot(X_train,meanline=False, notch=True)
plt.title('Boxplots of all features - scaled')
plt.xlabel('Feature')
plt.ylabel('')
plt.show()


plt.figure(figsize=(11,9))
for cl in input_data_train["class"].unique():
    plt.plot(input_data_train[input_data_train["class"] == cl].values[0], label=cl)
plt.title("Examples for each urban land class - training set (unscaled))")
plt.legend()
plt.show()


plt.figure(figsize=(11,9))
for cl in input_data_train["class"].unique():
    plt.plot(X_train[input_data_train["class"] == cl][0], label=cl)
plt.title("Examples for each urban land class - training set (scaled))")
plt.legend()
plt.show()

Well, that looks messy. Let’s see what some ML algorithma make out of it.