Revisiting ML - Urban Land Cover Classification

It’s time to have another look at a less-known machine learning dataset. This one is really similar to the dataset on forest type classification because the dataset is curated by the same guys (Johnson (2013), Johnson & Xie (2013)). Again, the dataset is hosted on the UCI Machine Learning Repository.

Let’s load the dataset and see what we can come up with.

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

input_data_train = pd.read_csv("./data/training.csv")
input_data_test = pd.read_csv("./data/testing.csv")

display(input_data_train.head(2))
display(input_data_test.head(2))
display(input_data_train.describe())


  
    
      
      class
      BrdIndx
      Area
      Round
      Bright
      Compact
      ShpIndx
      Mean_G
      Mean_R
      Mean_NIR
      ...
      SD_NIR_140
      LW_140
      GLCM1_140
      Rect_140
      GLCM2_140
      Dens_140
      Assym_140
      NDVI_140
      BordLngth_140
      GLCM3_140
    
  
  
    
      0
      car
      1.27
      91
      0.97
      231.38
      1.39
      1.47
      207.92
      241.74
      244.48
      ...
      26.18
      2.00
      0.50
      0.85
      6.29
      1.67
      0.70
      -0.08
      56
      3806.36
    
    
      1
      concrete
      2.36
      241
      1.56
      216.15
      2.46
      2.51
      187.85
      229.39
      231.20
      ...
      22.29
      2.25
      0.79
      0.55
      8.42
      1.38
      0.81
      -0.09
      1746
      1450.14


  
    
      
      class
      BrdIndx
      Area
      Round
      Bright
      Compact
      ShpIndx
      Mean_G
      Mean_R
      Mean_NIR
      ...
      SD_NIR_140
      LW_140
      GLCM1_140
      Rect_140
      GLCM2_140
      Dens_140
      Assym_140
      NDVI_140
      BordLngth_140
      GLCM3_140
    
  
  
    
      0
      concrete
      1.32
      131
      0.81
      222.74
      1.66
      2.18
      192.94
      235.11
      240.15
      ...
      31.15
      5.04
      0.80
      0.58
      8.56
      0.82
      0.98
      -0.10
      1512
      1287.52
    
    
      1
      shadow
      1.59
      864
      0.94
      47.56
      1.41
      1.87
      36.82
      48.78
      57.09
      ...
      12.01
      3.70
      0.52
      0.96
      7.01
      1.69
      0.86
      -0.14
      196
      2659.74


  
    
      
      BrdIndx
      Area
      Round
      Bright
      Compact
      ShpIndx
      Mean_G
      Mean_R
      Mean_NIR
      SD_G
      ...
      SD_NIR_140
      LW_140
      GLCM1_140
      Rect_140
      GLCM2_140
      Dens_140
      Assym_140
      NDVI_140
      BordLngth_140
      GLCM3_140
    
  
  
    
      count
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      ...
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
      168.000000
    
    
      mean
      2.008512
      565.869048
      1.132976
      165.569821
      2.077679
      2.229881
      161.577083
      163.672440
      171.459226
      10.131369
      ...
      23.769881
      3.098274
      0.796488
      0.665000
      7.795536
      1.594405
      0.615357
      0.014583
      983.309524
      1275.292917
    
    
      std
      0.634807
      679.852886
      0.489150
      61.883993
      0.699600
      0.703572
      63.407201
      71.306748
      67.973969
      5.179409
      ...
      12.836522
      6.101883
      0.103930
      0.179086
      0.670491
      0.460627
      0.239900
      0.153677
      880.013745
      603.658611
    
    
      min
      1.000000
      10.000000
      0.020000
      37.670000
      1.000000
      1.060000
      30.680000
      32.210000
      40.120000
      4.330000
      ...
      4.020000
      1.000000
      0.330000
      0.240000
      6.290000
      0.230000
      0.070000
      -0.360000
      56.000000
      336.730000
    
    
      25%
      1.537500
      178.000000
      0.787500
      133.977500
      1.547500
      1.700000
      91.040000
      101.187500
      120.165000
      6.770000
      ...
      13.965000
      1.395000
      0.757500
      0.560000
      7.357500
      1.325000
      0.460000
      -0.080000
      320.000000
      817.405000
    
    
      50%
      1.920000
      315.000000
      1.085000
      164.485000
      1.940000
      2.130000
      187.560000
      160.615000
      178.345000
      8.010000
      ...
      21.135000
      1.740000
      0.810000
      0.690000
      7.790000
      1.660000
      0.620000
      -0.040000
      776.000000
      1187.025000
    
    
      75%
      2.375000
      667.000000
      1.410000
      221.895000
      2.460000
      2.680000
      210.940000
      234.815000
      236.002500
      11.500000
      ...
      29.957500
      2.285000
      0.870000
      0.810000
      8.260000
      1.945000
      0.810000
      0.120000
      1412.500000
      1588.427500
    
    
      max
      4.190000
      3659.000000
      2.890000
      244.740000
      4.700000
      4.300000
      246.350000
      253.080000
      253.320000
      36.400000
      ...
      60.020000
      51.540000
      0.950000
      0.980000
      9.340000
      2.340000
      1.000000
      0.350000
      6232.000000
      3806.360000

But what do they mean?!?

LEGEND

Class: Land cover class (nominal)

BrdIndx: Border Index (shape variable)

Area: Area in m2 (size variable)

Round: Roundness (shape variable)

Bright: Brightness (spectral variable)

Compact: Compactness (shape variable)

ShpIndx: Shape Index (shape variable)

Mean_G: Green (spectral variable)

Mean_R: Red (spectral variable)

Mean_NIR: Near Infrared (spectral variable)

SD_G: Standard deviation of Green (texture variable)

SD_R: Standard deviation of Red (texture variable)

SD_NIR: Standard deviation of Near Infrared (texture variable)

LW: Length/Width (shape variable)

GLCM1: Gray-Level Co-occurrence Matrix [i forget which type of GLCM metric this one is] (texture variable)

Rect: Rectangularity (shape variable)

GLCM2: Another Gray-Level Co-occurrence Matrix attribute (texture variable)

Dens: Density (shape variable)

Assym: Assymetry (shape variable)

NDVI: Normalized Difference Vegetation Index (spectral variable)

BordLngth: Border Length (shape variable)
GLCM3: Another Gray-Level Co-occurrence Matrix attribute (texture variable)

Note: These variables repeat for each coarser scale (i.e. variable_40, variable_60, …variable_140).

– dataset description

Let’s see if unscaled box plots are any help:

input_data_train.plot.box(figsize=(12,7), xticks=[])
plt.title('Boxplots of all features')
plt.xlabel('Feature')
plt.ylabel('')
plt.show()

Well, that doen’t help much. Let’s have a look at the class distributions of both sets:

y_hist_train_dict = dict(input_data_train['class'].value_counts())
y_hist_test_dict = dict(input_data_test['class'].value_counts())

plt.figure(figsize=(7,5))
plt.bar(list(y_hist_train_dict.keys()), y_hist_train_dict.values(), color="black")
plt.title("Urban land classes histogram - train set")
plt.ylabel("Count")
plt.xlabel("Urban land class")
plt.tight_layout()
plt.show()

plt.figure(figsize=(7,5))
plt.bar(list(y_hist_test_dict.keys()), y_hist_test_dict.values(), color="black")
plt.title("Urban land classes histogram - test set")
plt.ylabel("Count")
plt.xlabel("Urban land class")
plt.tight_layout()
plt.show()

That’s not so good… . Let’s scale the data and see what we can derive from it visually:

#categorical encoding
input_data_train['class'] = pd.Categorical(input_data_train['class']).codes
input_data_test['class'] = pd.Categorical(input_data_test['class']).codes

# split data into X and y
y_train = input_data_train['class'].copy(deep=True)
X_train = input_data_train.copy(deep=True)
X_train.drop(['class'], inplace=True, axis=1)

y_test = input_data_test['class'].copy(deep=True)
X_test = input_data_test.copy(deep=True)
X_test.drop(['class'], inplace=True, axis=1)

from sklearn.preprocessing import MaxAbsScaler



X_train = X_train.astype('float64')
X_test = X_test.astype('float64')


scaler = MaxAbsScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


plt.figure(figsize=(13,8))
plt.boxplot(X_train,meanline=False, notch=True)
plt.title('Boxplots of all features - scaled')
plt.xlabel('Feature')
plt.ylabel('')
plt.show()


plt.figure(figsize=(11,9))
for cl in input_data_train["class"].unique():
    plt.plot(input_data_train[input_data_train["class"] == cl].values[0], label=cl)
plt.title("Examples for each urban land class - training set (unscaled))")
plt.legend()
plt.show()


plt.figure(figsize=(11,9))
for cl in input_data_train["class"].unique():
    plt.plot(X_train[input_data_train["class"] == cl][0], label=cl)
plt.title("Examples for each urban land class - training set (scaled))")
plt.legend()
plt.show()

Well, that looks messy. Let’s see what some ML algorithma make out of it.

	class	BrdIndx	Area	Round	Bright	Compact	ShpIndx	Mean_G	Mean_R	Mean_NIR	...	SD_NIR_140	LW_140	GLCM1_140	Rect_140	GLCM2_140	Dens_140	Assym_140	NDVI_140	BordLngth_140	GLCM3_140
0	car	1.27	91	0.97	231.38	1.39	1.47	207.92	241.74	244.48	...	26.18	2.00	0.50	0.85	6.29	1.67	0.70	-0.08	56	3806.36
1	concrete	2.36	241	1.56	216.15	2.46	2.51	187.85	229.39	231.20	...	22.29	2.25	0.79	0.55	8.42	1.38	0.81	-0.09	1746	1450.14

	class	BrdIndx	Area	Round	Bright	Compact	ShpIndx	Mean_G	Mean_R	Mean_NIR	...	SD_NIR_140	LW_140	GLCM1_140	Rect_140	GLCM2_140	Dens_140	Assym_140	NDVI_140	BordLngth_140	GLCM3_140
0	concrete	1.32	131	0.81	222.74	1.66	2.18	192.94	235.11	240.15	...	31.15	5.04	0.80	0.58	8.56	0.82	0.98	-0.10	1512	1287.52
1	shadow	1.59	864	0.94	47.56	1.41	1.87	36.82	48.78	57.09	...	12.01	3.70	0.52	0.96	7.01	1.69	0.86	-0.14	196	2659.74

	BrdIndx	Area	Round	Bright	Compact	ShpIndx	Mean_G	Mean_R	Mean_NIR	SD_G	...	SD_NIR_140	LW_140	GLCM1_140	Rect_140	GLCM2_140	Dens_140	Assym_140	NDVI_140	BordLngth_140	GLCM3_140
count	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000	...	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000	168.000000
mean	2.008512	565.869048	1.132976	165.569821	2.077679	2.229881	161.577083	163.672440	171.459226	10.131369	...	23.769881	3.098274	0.796488	0.665000	7.795536	1.594405	0.615357	0.014583	983.309524	1275.292917
std	0.634807	679.852886	0.489150	61.883993	0.699600	0.703572	63.407201	71.306748	67.973969	5.179409	...	12.836522	6.101883	0.103930	0.179086	0.670491	0.460627	0.239900	0.153677	880.013745	603.658611
min	1.000000	10.000000	0.020000	37.670000	1.000000	1.060000	30.680000	32.210000	40.120000	4.330000	...	4.020000	1.000000	0.330000	0.240000	6.290000	0.230000	0.070000	-0.360000	56.000000	336.730000
25%	1.537500	178.000000	0.787500	133.977500	1.547500	1.700000	91.040000	101.187500	120.165000	6.770000	...	13.965000	1.395000	0.757500	0.560000	7.357500	1.325000	0.460000	-0.080000	320.000000	817.405000
50%	1.920000	315.000000	1.085000	164.485000	1.940000	2.130000	187.560000	160.615000	178.345000	8.010000	...	21.135000	1.740000	0.810000	0.690000	7.790000	1.660000	0.620000	-0.040000	776.000000	1187.025000
75%	2.375000	667.000000	1.410000	221.895000	2.460000	2.680000	210.940000	234.815000	236.002500	11.500000	...	29.957500	2.285000	0.870000	0.810000	8.260000	1.945000	0.810000	0.120000	1412.500000	1588.427500
max	4.190000	3659.000000	2.890000	244.740000	4.700000	4.300000	246.350000	253.080000	253.320000	36.400000	...	60.020000	51.540000	0.950000	0.980000	9.340000	2.340000	1.000000	0.350000	6232.000000	3806.360000