It’s time to have a look at another dataset. We’ll have a look at more traditional land use classification this time. In fact this dataset is about forest type classification. The dataset is hosted on the UCI Machine Learning Repository and originates from research by Johnson et al. (2012).
Let’s load the dataset and look at it:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
input_data_train = pd.read_csv("./data/training.csv")
input_data_test = pd.read_csv("./data/testing.csv")
display(input_data_train.head(2))
display(input_data_test.head(2))
display(input_data_train.describe())
class | b1 | b2 | b3 | b4 | b5 | b6 | b7 | b8 | b9 | ... | pred_minus_obs_H_b9 | pred_minus_obs_S_b1 | pred_minus_obs_S_b2 | pred_minus_obs_S_b3 | pred_minus_obs_S_b4 | pred_minus_obs_S_b5 | pred_minus_obs_S_b6 | pred_minus_obs_S_b7 | pred_minus_obs_S_b8 | pred_minus_obs_S_b9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | d | 39 | 36 | 57 | 91 | 59 | 101 | 93 | 27 | 60 | ... | -2.36 | -18.41 | -1.88 | -6.43 | -21.03 | -1.60 | -6.18 | -22.50 | -5.20 | -7.86 |
1 | h | 84 | 30 | 57 | 112 | 51 | 98 | 92 | 26 | 62 | ... | -2.26 | -16.27 | -1.95 | -6.25 | -18.79 | -1.99 | -6.18 | -23.41 | -8.87 | -10.83 |
class | b1 | b2 | b3 | b4 | b5 | b6 | b7 | b8 | b9 | ... | pred_minus_obs_H_b9 | pred_minus_obs_S_b1 | pred_minus_obs_S_b2 | pred_minus_obs_S_b3 | pred_minus_obs_S_b4 | pred_minus_obs_S_b5 | pred_minus_obs_S_b6 | pred_minus_obs_S_b7 | pred_minus_obs_S_b8 | pred_minus_obs_S_b9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | d | 67 | 51 | 68 | 115 | 69 | 111 | 136 | 31 | 67 | ... | -9.17 | -18.27 | -1.80 | -6.32 | -20.88 | -1.63 | -6.13 | -22.56 | -5.53 | -8.11 |
1 | s | 67 | 28 | 51 | 99 | 50 | 97 | 82 | 26 | 59 | ... | -2.25 | -20.13 | -2.11 | -6.35 | -21.94 | -1.22 | -6.13 | -22.20 | -3.41 | -6.57 |
b1 | b2 | b3 | b4 | b5 | b6 | b7 | b8 | b9 | pred_minus_obs_H_b1 | ... | pred_minus_obs_H_b9 | pred_minus_obs_S_b1 | pred_minus_obs_S_b2 | pred_minus_obs_S_b3 | pred_minus_obs_S_b4 | pred_minus_obs_S_b5 | pred_minus_obs_S_b6 | pred_minus_obs_S_b7 | pred_minus_obs_S_b8 | pred_minus_obs_S_b9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | ... | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 | 198.000000 |
mean | 62.949495 | 41.020202 | 63.676768 | 101.409091 | 58.732323 | 100.651515 | 90.601010 | 28.691919 | 61.116162 | 50.818889 | ... | -5.594141 | -20.037576 | -1.007121 | -4.355657 | -20.996919 | -0.973737 | -4.597626 | -18.840000 | -1.570808 | -4.155859 |
std | 12.779563 | 17.832543 | 17.314545 | 14.804627 | 12.392648 | 11.190314 | 15.588861 | 8.977752 | 9.787158 | 12.842321 | ... | 9.769193 | 4.948562 | 1.783671 | 2.352311 | 6.490763 | 0.702619 | 1.736712 | 5.251095 | 1.807792 | 1.982423 |
min | 34.000000 | 25.000000 | 47.000000 | 54.000000 | 44.000000 | 84.000000 | 54.000000 | 21.000000 | 50.000000 | 7.660000 | ... | -53.530000 | -32.950000 | -8.800000 | -11.210000 | -40.370000 | -3.270000 | -8.730000 | -34.140000 | -8.870000 | -10.830000 |
25% | 54.000000 | 28.000000 | 52.000000 | 92.250000 | 49.000000 | 92.000000 | 80.000000 | 24.000000 | 55.000000 | 40.667500 | ... | -6.627500 | -23.325000 | -1.860000 | -5.790000 | -24.090000 | -1.290000 | -5.747500 | -22.237500 | -2.370000 | -5.122500 |
50% | 60.000000 | 31.500000 | 57.000000 | 99.500000 | 55.000000 | 98.000000 | 91.000000 | 25.000000 | 58.000000 | 53.030000 | ... | -2.255000 | -20.020000 | -0.970000 | -4.350000 | -20.465000 | -0.945000 | -4.540000 | -19.200000 | -1.420000 | -4.125000 |
75% | 70.750000 | 50.750000 | 69.000000 | 111.750000 | 65.000000 | 107.000000 | 101.000000 | 27.000000 | 63.000000 | 59.920000 | ... | 0.247500 | -17.787500 | -0.042500 | -2.882500 | -17.955000 | -0.642500 | -3.617500 | -16.227500 | -0.655000 | -3.105000 |
max | 105.000000 | 160.000000 | 196.000000 | 172.000000 | 98.000000 | 136.000000 | 139.000000 | 82.000000 | 109.000000 | 83.320000 | ... | 5.740000 | 5.130000 | 12.460000 | 7.370000 | 1.880000 | 3.440000 | 3.940000 | 3.670000 | 8.840000 | 7.790000 |
But what do these values mean? The dataset description tells us:
Attribute Information:
Class: ‘s’ (‘Sugi’ forest), ‘h’ (‘Hinoki’ forest), ‘d’ (‘Mixed deciduous’ forest), ‘o’ (‘Other’ non-forest land)
b1 - b9: ASTER image bands containing spectral information in the green, red, and near infrared wavelengths for three dates (Sept. 26, 2010; March 19, 2011; May 08, 2011.
pred_minus_obs_S_b1 - pred_minus_obs_S_b9: Predicted spectral values (based on spatial interpolation) minus actual spectral values for the ‘s’ class (b1-b9).
pred_minus_obs_H_b1 - pred_minus_obs_H_b9: Predicted spectral values (based on spatial interpolation) minus actual spectral values for the ‘h’ class (b1-b9).
Let’s see if we can derive anything from box plots:
input_data_train.plot.box(figsize=(12,7), xticks=[])
plt.title('Boxplots of all features')
plt.xlabel('Feature')
plt.ylabel('')
plt.show()
Next, we’ll have a look at class distributions in both sets:
plt.figure(figsize=(7,5))
plt.bar(list(y_hist_train_dict.keys()), y_hist_train_dict.values(), color="black")
plt.title("Forest type histogram - train set")
plt.ylabel("Count")
plt.xlabel("Forest type")
plt.tight_layout()
plt.show()
That looks kind of equally distributed. Let’s look at the test set:
plt.figure(figsize=(7,5))
plt.bar(list(y_hist_test_dict.keys()), y_hist_test_dict.values(), color="black")
plt.title("Forest type histogram - test set")
plt.ylabel("Count")
plt.xlabel("Forest type")
plt.tight_layout()
plt.show()
Well. that doesn’t look so great. It’s time to scale the data and see if we can identify some distinct features visually.
#categorical encoding
input_data_train['class'] = pd.Categorical(input_data_train['class']).codes
input_data_test['class'] = pd.Categorical(input_data_test['class']).codes
# split data into X and y
y_train = input_data_train['class'].copy(deep=True)
X_train = input_data_train.copy(deep=True)
X_train.drop(['class'], inplace=True, axis=1)
y_test = input_data_test['class'].copy(deep=True)
X_test = input_data_test.copy(deep=True)
X_test.drop(['class'], inplace=True, axis=1)
from sklearn.preprocessing import MaxAbsScaler
X_train = X_train.astype('float64')
X_test = X_test.astype('float64')
scaler = MaxAbsScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
plt.figure(figsize=(13,8))
plt.boxplot(X_train,meanline=False, notch=True)
plt.title('Boxplots of all features - scaled')
plt.xlabel('Feature')
plt.ylabel('')
plt.show()
plt.figure(figsize=(11,9))
for cl in input_data_train["class"].unique():
plt.plot(input_data_train[input_data_train["class"] == cl].values[0], label=cl)
plt.title("Examples for each forest type - training set (unscaled))")
plt.legend()
plt.show()
plt.figure(figsize=(11,9))
for cl in input_data_train["class"].unique():
plt.plot(X_train[input_data_train["class"] == cl][0], label=cl)
plt.title("Examples for each forest type - training set (scaled))")
plt.legend()
plt.show()
Let’s throw a few machine learning algorithms at it. *The neural networks are similar to the ones from the concrete compressive strength dataset.
It looks like Mondrian forests perform bests.