It is time to have a look at another less-known dataset. Today, we’ll have a look at the energy efficiency dataset, which originates from Tsanas and Xifara (2012) and is hosted on the UCI Machine Learning Repository.
Contents
Introduction to the dataset
The dataset is the result of simulations performed by Tsanas and Xifara with the aim to predict energy performance of buildings (EPB). In particular, the aim is to estimate heating and cooling loads given eight design parameters. Let’s have a look at what we got here (detailed feature explanations are in the original publication mentioned above).
inputData = pd.read_excel("./data/ENB2012_data.xlsx")
inputDataRaw = inputData.copy(deep=True)
names = ["Relative Compactness",
"Surface Area",
"Wall Area",
"Roof Area",
"Overall Height",
"Orientation",
"Glazing Area",
"Glazing Area Distribution",
"Heating Load",
"Cooling Load"]
inputData.columns = names
display(inputData.sample(10))
display(inputData.describe())
Relative Compactness | Surface Area | Wall Area | Roof Area | Overall Height | Orientation | Glazing Area | Glazing Area Distribution | Heating Load | Cooling Load | |
---|---|---|---|---|---|---|---|---|---|---|
513 | 0.69 | 735.0 | 294.0 | 220.50 | 3.5 | 3 | 0.25 | 5 | 12.12 | 14.97 |
684 | 0.82 | 612.5 | 318.5 | 147.00 | 7.0 | 2 | 0.40 | 4 | 28.93 | 28.20 |
134 | 0.66 | 759.5 | 318.5 | 220.50 | 3.5 | 4 | 0.10 | 2 | 11.33 | 15.00 |
525 | 0.62 | 808.5 | 367.5 | 220.50 | 3.5 | 3 | 0.25 | 5 | 13.99 | 14.61 |
244 | 0.90 | 563.5 | 318.5 | 122.50 | 7.0 | 2 | 0.10 | 5 | 29.83 | 29.82 |
608 | 0.69 | 735.0 | 294.0 | 220.50 | 3.5 | 2 | 0.40 | 2 | 14.75 | 16.44 |
506 | 0.74 | 686.0 | 245.0 | 220.50 | 3.5 | 4 | 0.25 | 5 | 11.64 | 14.81 |
562 | 0.69 | 735.0 | 294.0 | 220.50 | 3.5 | 4 | 0.40 | 1 | 14.42 | 16.87 |
642 | 0.79 | 637.0 | 343.0 | 147.00 | 7.0 | 4 | 0.40 | 3 | 42.49 | 38.81 |
673 | 0.98 | 514.5 | 294.0 | 110.25 | 7.0 | 3 | 0.40 | 4 | 32.67 | 33.06 |
Relative Compactness | Surface Area | Wall Area | Roof Area | Overall Height | Orientation | Glazing Area | Glazing Area Distribution | Heating Load | Cooling Load | |
---|---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.00000 | 768.000000 | 768.000000 | 768.00000 | 768.000000 | 768.000000 |
mean | 0.764167 | 671.708333 | 318.500000 | 176.604167 | 5.25000 | 3.500000 | 0.234375 | 2.81250 | 22.307195 | 24.587760 |
std | 0.105777 | 88.086116 | 43.626481 | 45.165950 | 1.75114 | 1.118763 | 0.133221 | 1.55096 | 10.090204 | 9.513306 |
min | 0.620000 | 514.500000 | 245.000000 | 110.250000 | 3.50000 | 2.000000 | 0.000000 | 0.00000 | 6.010000 | 10.900000 |
25% | 0.682500 | 606.375000 | 294.000000 | 140.875000 | 3.50000 | 2.750000 | 0.100000 | 1.75000 | 12.992500 | 15.620000 |
50% | 0.750000 | 673.750000 | 318.500000 | 183.750000 | 5.25000 | 3.500000 | 0.250000 | 3.00000 | 18.950000 | 22.080000 |
75% | 0.830000 | 741.125000 | 343.000000 | 220.500000 | 7.00000 | 4.250000 | 0.400000 | 4.00000 | 31.667500 | 33.132500 |
max | 0.980000 | 808.500000 | 416.500000 | 220.500000 | 7.00000 | 5.000000 | 0.400000 | 5.00000 | 43.100000 | 48.030000 |
plt.close('all')
plt.figure(figsize=(10,12))
idx = 1
for col in inputData:
plt.subplot((inputData.shape[-1] // 2) + (inputData.shape[-1] % 2 ),2, idx)
idx += 1
plt.boxplot(inputData[col], meanline=False, notch=False, labels=[''])
plt.title(col)
plt.tight_layout()
plt.show()
pd.plotting.scatter_matrix(inputData, figsize=(15,15));
plt.figure(figsize=(10,10))
sns.heatmap(inputData.corr(), annot=True, cmap="Blues")
plt.show()
This looks kind of interesting. It certainly is one of the less common correlation matrices I’ve seen so far.
Brute force approach
Let’s split the dataset and throw some machine learning algorithms at it.
y = inputDataRaw.copy(deep=True)
y.drop(["X1","X2", "X3", "X4", "X5", "X6","X7","X8"], axis=1, inplace=True)
X = inputDataRaw.copy(deep=True)
X.drop(["Y1","Y2"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y, test_size=0.25, random_state=42)
datasets = {}
datasets[0] = {'X_train': X_train,
'X_test' : X_test,
'y_train': y_train["Y1"].values,
'y_test' : y_test["Y1"].values}
datasets[1] = {'X_train': X_train,
'X_test' : X_test,
'y_train': y_train["Y2"].values,
'y_test' : y_test["Y2"].values}
Well, how does it compare to the original publication? Well, that is really difficult to assess since the original paper provides us the (averaged?) cross-validation results of 10-folds with 100 repetitions using two algorithms.
Metric | Variable | Best result | Best original publication |
---|---|---|---|
MSE | Heating (Y1) | 0.2035 | 1.03 +- 0.54 |
MAE | Heating (Y1) | 0.2905 | 0.51 +- 0.11 |
MSE | Cooling (Y2) | 0.9069 | 6.59 +- 1.56 |
MAE | Cooling (Y2) | 0.5345 | 1.42 +- 0.25 |
The best performances were reached by random forest in the original publication.
However, even if we compare our random forest or even decision tree results here (and not xgboost), then we’ll still outperform the original results by quite a margin. Therefore, it is kind of difficult to understand how they came up with such bad results in their original publication.
Results using TPOT
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error, mean_squared_error, mean_absolute_error
import sklearn.metrics
from tpot import TPOTRegressor
inputData = pd.read_excel("./data/ENB2012_data.xlsx")
y = inputData.copy(deep=True)
y.drop(["X1","X2", "X3", "X4", "X5", "X6","X7","X8"], axis=1, inplace=True)
X = inputData.copy(deep=True)
X.drop(["Y1","Y2"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,
y,
test_size=0.25,
random_state=42)
tpot = TPOTRegressor(max_time_mins=60,
verbosity=21,
n_jobs=-1)
tpot.fit(X_train,y_train["Y1"].values)
tpot.export('energy_efficieny_heating_load.py')
y_predictions = tpot.predict(X_test)
r2 = sklearn.metrics.r2_score(y_test["Y1"].values, y_predictions)
mae = sklearn.metrics.mean_absolute_error(y_test["Y1"].values, y_predictions)
mse = sklearn.metrics.mean_squared_error(y_test["Y1"].values, y_predictions)
rmse = np.sqrt(mse)
print("R2 score:", r2)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
R2 score: 0.9983789421178374
MAE: 0.2825495393872261
MSE: 0.17039042238066923
RMSE: 0.4127837477186683
tpot = TPOTRegressor(max_time_mins=60,
verbosity=21,
n_jobs=-1)
tpot.fit(X_train,y_train["Y2"].values)
tpot.export('energy_efficieny_cooling_load.py')
y_predictions = tpot.predict(X_test)
r2 = sklearn.metrics.r2_score(y_test["Y2"].values, y_predictions)
mae = sklearn.metrics.mean_absolute_error(y_test["Y2"].values, y_predictions)
mse = sklearn.metrics.mean_squared_error(y_test["Y2"].values, y_predictions)
rmse = np.sqrt(mse)
print("R2 score:", r2)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
R2 score: 0.9889458605293449
MAE: 0.6541753176848094
MSE: 1.0320711651414836
RMSE: 1.0159090338910681
The final pipelines are:
# heating load
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectFwe, f_regression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from tpot.builtins import StackingEstimator
from xgboost import XGBRegressor
from sklearn.preprocessing import FunctionTransformer
from copy import copy
# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['target'].values, random_state=None)
# Average CV score on the training set was:-0.14438142720136646
exported_pipeline = make_pipeline(
make_union(
make_pipeline(
StackingEstimator(estimator=DecisionTreeRegressor(max_depth=1, min_samples_leaf=13, min_samples_split=8)),
SelectFwe(score_func=f_regression, alpha=0.002)
),
make_union(
FunctionTransformer(copy),
FunctionTransformer(copy)
)
),
StandardScaler(),
XGBRegressor(learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, nthread=1, subsample=0.7000000000000001)
)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
# Cooling heat
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBRegressor
# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['target'].values, random_state=None)
# Average CV score on the training set was:-0.766588545649422
exported_pipeline = make_pipeline(
StackingEstimator(estimator=GradientBoostingRegressor(alpha=0.9, learning_rate=0.001, loss="ls", max_depth=10, max_features=0.7000000000000001, min_samples_leaf=17, min_samples_split=20, n_estimators=100, subsample=0.1)),
XGBRegressor(learning_rate=0.5, max_depth=9, min_child_weight=14, n_estimators=100, nthread=1, subsample=1.0)
)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
Final thoughts
The original publication is another example of the bad quality of many scientific publications, especially if numerical simulations and/or statistics/machine learning is involved. Furthermore, with basically doing nothing and using TPOT we came up with a solution that is slightly better than brute forcing a model for predicting heating loads and slightly worse for predicting cooling loads.