Today, we’ll have a look at the heating technology dataset as part of my dataset exploration series.
The dataset contains data on the choice of heating systems in houses in California.
It originates from work by Kenneth Train (2003 Discrete Choice Methods with Simulation) and is available as part of some R packages (e.g. as part of mlogitBMA).
Let’s have a look at it.
**NB!: There is a second dataset that originates from the same source containing 250 samples of a similar problem.
Dataset exploration
There is some information in the problem sets.
However, it is hardly more than available elsewhere. So let’s see what we’ve got.
import time
import os
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from skgarden import MondrianForestClassifier
import xgboost
inputFile = "./data/heatingDataset.csv"
inputData = pd.read_csv(inputFile)
display(inputData.sample(10))
display(inputData.describe())
IDCase | DepVar | IC_GC | IC_GR | IC_EC | IC_ER | IC_HP | OC_GC | OC_GR | OC_EC | OC_ER | OC_HP | Income | AgeHed | Rooms | Region | PB_GC | PB_GR | PB_EC | PB_ER | PB_HP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
489 | 490 | gr | 632.92 | 818.03 | 585.00 | 800.49 | 886.42 | 162.21 | 121.66 | 420.62 | 316.60 | 189.06 | 6 | 55 | 3 | ncostl | 3.901856 | 6.723903 | 1.390804 | 2.528395 | 4.688564 |
258 | 259 | gc | 732.12 | 948.08 | 813.66 | 988.02 | 997.33 | 159.63 | 143.01 | 451.70 | 434.16 | 225.52 | 7 | 35 | 2 | ncostl | 4.586356 | 6.629466 | 1.801328 | 2.275705 | 4.422357 |
16 | 17 | hp | 723.92 | 929.70 | 821.79 | 969.64 | 1058.00 | 160.40 | 147.08 | 432.58 | 446.38 | 209.57 | 6 | 55 | 2 | scostl | 4.513217 | 6.321050 | 1.899741 | 2.172230 | 5.048433 |
288 | 289 | gc | 846.21 | 1030.30 | 728.83 | 907.28 | 989.24 | 167.54 | 146.89 | 523.68 | 479.98 | 207.13 | 7 | 60 | 7 | scostl | 5.050794 | 7.014092 | 1.391747 | 1.890245 | 4.775938 |
438 | 439 | gc | 668.77 | 639.30 | 709.80 | 827.27 | 848.57 | 153.44 | 116.06 | 360.78 | 378.45 | 161.73 | 7 | 35 | 7 | ncostl | 4.358511 | 5.508358 | 1.967404 | 2.185943 | 5.246831 |
570 | 571 | gr | 780.76 | 811.08 | 925.92 | 861.54 | 1069.20 | 182.40 | 152.61 | 466.80 | 383.03 | 219.56 | 7 | 20 | 3 | ncostl | 4.280482 | 5.314724 | 1.983548 | 2.249276 | 4.869739 |
203 | 204 | gr | 744.10 | 808.98 | 741.52 | 1042.20 | 1058.70 | 175.45 | 158.48 | 507.67 | 462.53 | 230.67 | 3 | 35 | 6 | scostl | 4.241094 | 5.104619 | 1.460634 | 2.253259 | 4.589674 |
261 | 262 | gr | 676.59 | 676.19 | 713.75 | 966.32 | 864.39 | 157.46 | 109.57 | 414.15 | 404.63 | 177.20 | 2 | 65 | 4 | scostl | 4.296901 | 6.171306 | 1.723409 | 2.388157 | 4.878047 |
675 | 676 | ec | 693.74 | 995.11 | 642.63 | 731.88 | 764.26 | 154.09 | 122.40 | 412.34 | 326.78 | 165.63 | 5 | 55 | 5 | valley | 4.502174 | 8.129984 | 1.558495 | 2.239672 | 4.614261 |
802 | 803 | gr | 820.42 | 901.27 | 755.84 | 912.42 | 1123.60 | 160.34 | 145.76 | 483.70 | 424.58 | 222.98 | 7 | 35 | 7 | ncostl | 5.116752 | 6.183246 | 1.562621 | 2.148994 | 5.039017 |
IDCase | IC_GC | IC_GR | IC_EC | IC_ER | IC_HP | OC_GC | OC_GR | OC_EC | OC_ER | OC_HP | Income | AgeHed | Rooms | PB_GC | PB_GR | PB_EC | PB_ER | PB_HP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 | 900.000000 |
mean | 450.500000 | 776.826600 | 921.770244 | 824.543478 | 983.928022 | 1046.481256 | 172.115837 | 154.471421 | 476.803011 | 429.729867 | 219.299300 | 4.641111 | 42.944444 | 4.424444 | 4.547567 | 6.006252 | 1.743155 | 2.307971 | 4.806496 |
std | 259.951919 | 115.563049 | 138.086712 | 125.273885 | 147.155448 | 156.703000 | 25.939778 | 22.885742 | 73.153325 | 65.792627 | 32.969554 | 1.684543 | 14.094754 | 1.743847 | 0.553284 | 0.690552 | 0.216538 | 0.282903 | 0.571321 |
min | 1.000000 | 431.830000 | 574.940000 | 469.610000 | 546.820000 | 532.320000 | 84.016000 | 77.863000 | 237.440000 | 179.950000 | 120.970000 | 2.000000 | 20.000000 | 2.000000 | 3.145842 | 4.217751 | 1.229730 | 1.481977 | 2.833901 |
25% | 225.750000 | 696.307500 | 825.612500 | 742.200000 | 884.572500 | 934.180000 | 155.755000 | 138.877500 | 425.035000 | 386.085000 | 197.437500 | 3.000000 | 30.000000 | 3.000000 | 4.193045 | 5.537963 | 1.595142 | 2.121147 | 4.411282 |
50% | 450.500000 | 778.505000 | 924.305000 | 824.840000 | 989.700000 | 1046.550000 | 172.105000 | 154.110000 | 480.055000 | 430.665000 | 220.845000 | 5.000000 | 45.000000 | 4.000000 | 4.503725 | 5.979367 | 1.729599 | 2.291828 | 4.767201 |
75% | 675.250000 | 855.280000 | 1013.425000 | 904.530000 | 1081.425000 | 1153.425000 | 189.175000 | 169.500000 | 525.545000 | 471.125000 | 239.740000 | 6.000000 | 55.000000 | 6.000000 | 4.854183 | 6.446233 | 1.871423 | 2.471767 | 5.125785 |
max | 900.000000 | 1158.900000 | 1344.000000 | 1230.500000 | 1496.300000 | 1679.800000 | 248.430000 | 227.920000 | 705.360000 | 664.430000 | 318.580000 | 7.000000 | 65.000000 | 7.000000 | 8.646091 | 8.759873 | 2.632885 | 3.613393 | 7.529125 |
There is a pattern in this dataset. The target variable depVar
denotes different heating technologies:
- gr: gas room heating system
- gc: gas central heating system
- ec: electric central heating system
- er: electric room heating system
- hp: heat pump
This is the easy part, but what do all the other features resemble? The suffix is always the heating system. Therefore, we have 5 features containing installation costs (IC_xx) and five features containing operating costs (OC_). Moreover, we have four more features - namely Income
, AgeHed
, Rooms
, and Region
. The only not self-explanatory feature is AgeHed
which is the age of the building in question.
Since I downloaded this dataset from Rdatasets, it contains features with the PB_
prefix. These are features calculate from OC_
and IC_
(OC/IC ratio).
Let’s encode the labels and see what we can derive up-front.
#OneHotEncoding would increase dimensionality
regionTmp = []
for i in inputData["Region"].values:
if i == "ncostl":
regionTmp.append(0)
elif i == "scostl":
regionTmp.append(1)
elif i == "valley":
regionTmp.append(2)
elif i == "mountn":
regionTmp.append(3)
else:
regionTmp.append(9999)
inputData["Region"] = regionTmp
target = []
for i in inputData["DepVar"].values:
if i == "gc":
target.append(0)
elif i == "er":
target.append(1)
elif i == "gr":
target.append(2)
elif i == "hp":
target.append(3)
elif i == "ec":
target.append(4)
inputData["target"] = target
inputData.drop(["IDCase","DepVar"], axis=1, inplace=True)
plt.figure(figsize=(15,15))
sns.heatmap(inputData.corr(), annot=True, cmap="Blues")
plt.show()
bins = np.bincount(inputData["target"].values)
plt.figure()
plt.bar(["gc","er","gr","hp","ec"], bins, color='black')
plt.xticks(["gc","er","gr","hp","ec"])
plt.xlabel('Classes')
plt.ylabel('Count')
plt.title('Histogram of target classes')
plt.show()
plt.figure(figsize=(10,7))
for targetClass in inputData["target"].unique():
plt.plot(inputData[inputData["target"] == targetClass].values[0],
label=targetClass)
plt.xlabel("Feature")
plt.ylabel("Feature value")
plt.title("Per class example")
plt.legend()
plt.show()
Well, that looks bad. We have a highly imbalanced dataset and no feature that shows at least some correlation to the target variable.
We’ll start using the full dataset and see if the brute-force approach works.
y = inputData["target"].values
X = inputData.copy(deep=True)
X.drop(["target"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=42)
Well, this looks absolutely unusable. I tried CatBoost - the proper way - as well, but it didn’t helped. Let’s see if we can improve the rotten thing by reducing the sample size of the most prominent class.
dropouts = np.random.permutation(np.array(inputData[inputData["target"] == 0]["target"].index.values))[:450]
inputData.drop(dropouts,inplace=True)
The new class distribution is as follows:
NB!: Since I’m a bit lazy here, I didn’t preserve the original test set!
Did it help? Well, perhaps a tiny bit but not worth mentioning:
So what else could we do? Well, kick out all cost features and see if location, house age and income are enough.
inputData.drop(["IDCase","DepVar",
"IC_GC", "IC_GR", "IC_EC", "IC_ER", "IC_HP",
"OC_GC", "OC_GR", "OC_EC", "OC_ER", "OC_HP",
"PB_GC","PB_GR","PB_EC","PB_ER","PB_HP"], axis=1, inplace=True)
dropouts = np.random.permutation(np.array(inputData[inputData["target"] == 0]["target"].index.values))[:450]
inputData.drop(dropouts,inplace=True)
plt.figure(figsize=(5,5))
sns.heatmap(inputData.corr(), annot=True, cmap="Blues")
plt.show()
No clear correlation up-front. An the results are not nice either:
TPOT
So let’s see if TPOT can do some magic tricks:
import time
import os
import sys
import random
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
import matplotlib.pyplot as plt
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split, GridSearchCV
import sklearn.metrics as skmetrics
from tpot import TPOTClassifier
inputFile = "./data/heatingDataset.csv"
inputData = pd.read_csv(inputFile)
#OneHotEncoding would increase dimensionality
regionTmp = []
for i in inputData["Region"].values:
if i == "ncostl":
regionTmp.append(0)
elif i == "scostl":
regionTmp.append(1)
elif i == "valley":
regionTmp.append(2)
elif i == "mountn":
regionTmp.append(3)
else:
regionTmp.append(9999)
inputData["Region"] = regionTmp
target = []
for i in inputData["DepVar"].values:
if i == "gc":
target.append(0)
elif i == "er":
target.append(1)
elif i == "gr":
target.append(2)
elif i == "hp":
target.append(3)
elif i == "ec":
target.append(4)
inputData["target"] = target
inputData.drop(["DepVar"], axis=1, inplace=True)
y = inputData["target"].copy(deep=True)
X = inputData.copy(deep=True)
X.drop(["target","IDCase"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.25,
shuffle=True,
random_state=42)
tpot = TPOTClassifier(max_time_mins=60,
verbosity=2,
n_jobs=-1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_heating_technology.py')
Generation 1 - Current best internal CV score: 0.42960818294712
Generation 2 - Current best internal CV score: 0.43397223692357717
Generation 3 - Current best internal CV score: 0.43397223692357717
Generation 4 - Current best internal CV score: 0.43397223692357717
Generation 5 - Current best internal CV score: 0.43397223692357717
Generation 6 - Current best internal CV score: 0.43397223692357717
Generation 7 - Current best internal CV score: 0.43397223692357717
Generation 8 - Current best internal CV score: 0.43397223692357717
Generation 9 - Current best internal CV score: 0.43397223692357717
Generation 10 - Current best internal CV score: 0.4396282805641169
Generation 11 - Current best internal CV score: 0.4396282805641169
Generation 12 - Current best internal CV score: 0.4396282805641169
Generation 13 - Current best internal CV score: 0.4396282805641169
Generation 14 - Current best internal CV score: 0.4396282805641169
Generation 15 - Current best internal CV score: 0.4396282805641169
Generation 16 - Current best internal CV score: 0.4396282805641169
Generation 17 - Current best internal CV score: 0.4396282805641169
Generation 18 - Current best internal CV score: 0.4396282805641169
Generation 19 - Current best internal CV score: 0.44172625039347746
Generation 20 - Current best internal CV score: 0.44172625039347746
Generation 21 - Current best internal CV score: 0.44172625039347746
Generation 22 - Current best internal CV score: 0.4538688777504506
Generation 23 - Current best internal CV score: 0.4538688777504506
Generation 24 - Current best internal CV score: 0.458544600451764
Generation 25 - Current best internal CV score: 0.45880714935538885
Generation 26 - Current best internal CV score: 0.45880714935538885
Generation 27 - Current best internal CV score: 0.45880714935538885
Generation 28 - Current best internal CV score: 0.45880714935538885
Generation 29 - Current best internal CV score: 0.46641760841816876
Generation 30 - Current best internal CV score: 0.46641760841816876
Generation 31 - Current best internal CV score: 0.46641760841816876
Generation 32 - Current best internal CV score: 0.46641760841816876
Generation 33 - Current best internal CV score: 0.46641760841816876
Generation 34 - Current best internal CV score: 0.46641760841816876
Generation 35 - Current best internal CV score: 0.46641760841816876
Generation 36 - Current best internal CV score: 0.46641760841816876
Generation 37 - Current best internal CV score: 0.46641760841816876
Generation 38 - Current best internal CV score: 0.46641760841816876
Generation 39 - Current best internal CV score: 0.46641760841816876
Generation 40 - Current best internal CV score: 0.46641760841816876
Generation 41 - Current best internal CV score: 0.46641760841816876
Generation 42 - Current best internal CV score: 0.46641760841816876
Generation 43 - Current best internal CV score: 0.46641760841816876
Generation 44 - Current best internal CV score: 0.46641760841816876
Generation 45 - Current best internal CV score: 0.46641760841816876
Generation 46 - Current best internal CV score: 0.46641760841816876
Generation 47 - Current best internal CV score: 0.46641760841816876
Generation 48 - Current best internal CV score: 0.46641760841816876
Generation 49 - Current best internal CV score: 0.46641760841816876
Generation 50 - Current best internal CV score: 0.46641760841816876
Generation 51 - Current best internal CV score: 0.46641760841816876
Generation 52 - Current best internal CV score: 0.46641760841816876
Generation 53 - Current best internal CV score: 0.46641760841816876
Generation 54 - Current best internal CV score: 0.46641760841816876
Generation 55 - Current best internal CV score: 0.46641760841816876
Generation 56 - Current best internal CV score: 0.46641760841816876
Generation 57 - Current best internal CV score: 0.46641760841816876
Generation 58 - Current best internal CV score: 0.46641760841816876
Generation 59 - Current best internal CV score: 0.46641760841816876
Generation 60 - Current best internal CV score: 0.46641760841816876
Generation 61 - Current best internal CV score: 0.46641760841816876
Generation 62 - Current best internal CV score: 0.46641760841816876
Generation 63 - Current best internal CV score: 0.46641760841816876
Generation 64 - Current best internal CV score: 0.46641760841816876
Generation 65 - Current best internal CV score: 0.46641760841816876
Generation 66 - Current best internal CV score: 0.46641760841816876
Generation 67 - Current best internal CV score: 0.46641760841816876
Generation 68 - Current best internal CV score: 0.46641760841816876
Generation 69 - Current best internal CV score: 0.46641760841816876
Generation 70 - Current best internal CV score: 0.46641760841816876
Generation 71 - Current best internal CV score: 0.46641760841816876
Generation 72 - Current best internal CV score: 0.46641760841816876
Generation 73 - Current best internal CV score: 0.46641760841816876
Generation 74 - Current best internal CV score: 0.46641760841816876
Generation 75 - Current best internal CV score: 0.46641760841816876
Generation 76 - Current best internal CV score: 0.46641760841816876
Generation 77 - Current best internal CV score: 0.46641760841816876
Generation 78 - Current best internal CV score: 0.46641760841816876
60.031978683333335 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.
WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation.
TPOT closed prematurely. Will use the current best pipeline.
Best pipeline: ExtraTreesClassifier(MultinomialNB(CombineDFs(VarianceThreshold(input_matrix, threshold=0.05), Binarizer(input_matrix, threshold=0.45)), alpha=0.001, fit_prior=False), bootstrap=False, criterion=gini, max_features=0.8, min_samples_leaf=6, min_samples_split=14, n_estimators=100)
0.3115942028985507
Not really more useful, it looks equally shitty.
What else could we do?
Well, so far everything failed. Throwing neural nets at it? NO! I really would like to get some information about the data itself. And that is a problem.
I looked at some of the data in a grouped by region way and it didn’t yield anything useful. In general, we could conclude that this dataset is too small to derive a useful model. The most we could make out if it would be some sort of a statistical desciption. I found a few existing solutions that look nice at first glance, but I question any use of these descriptions.
But let’s move away from “data driven whatever” and think about the problem for a moment. What do we got here?
- Age of a house as a proxi for it’s state
- Income of people living there as a proxy for what? High income == expensive house == high quality substance?
- region as a proxy of climate and perhaps economics
- room as a proxy of size
and so on and so forth. So we could end up saying that heating system x is cheaper for buildings of age… and so on. Would it help anything?
But even within a city we have a large variety of different heating solutions that fit certain building types.
Depending on local (micro) climate around building, which can easily regulated with trees etc., we may end up with something different. Further, correctly built houese (proper application of computational fluid dynamics) may changes a lot and perhaps no heating tech is required at all.
Therefore, I’m not going to investigate this dataset further.
If you really ended up because you were interested in heating system selection, then get some proper CFD-optimized building plan and build that house.