It is time to have a look at another less-known dataset. Today, we’ll have a look at the energy efficiency dataset, which originates from Tsanas and Xifara (2012) and is hosted on the UCI Machine Learning Repository.


Contents


Introduction to the dataset

The dataset is the result of simulations performed by Tsanas and Xifara with the aim to predict energy performance of buildings (EPB). In particular, the aim is to estimate heating and cooling loads given eight design parameters. Let’s have a look at what we got here (detailed feature explanations are in the original publication mentioned above).

inputData = pd.read_excel("./data/ENB2012_data.xlsx")
inputDataRaw = inputData.copy(deep=True)
names = ["Relative Compactness",
         "Surface Area",
         "Wall Area",
         "Roof Area",
         "Overall Height",
         "Orientation",
         "Glazing Area",
         "Glazing Area Distribution",
         "Heating Load",
         "Cooling Load"]
inputData.columns = names
display(inputData.sample(10))
display(inputData.describe())
Relative Compactness Surface Area Wall Area Roof Area Overall Height Orientation Glazing Area Glazing Area Distribution Heating Load Cooling Load
513 0.69 735.0 294.0 220.50 3.5 3 0.25 5 12.12 14.97
684 0.82 612.5 318.5 147.00 7.0 2 0.40 4 28.93 28.20
134 0.66 759.5 318.5 220.50 3.5 4 0.10 2 11.33 15.00
525 0.62 808.5 367.5 220.50 3.5 3 0.25 5 13.99 14.61
244 0.90 563.5 318.5 122.50 7.0 2 0.10 5 29.83 29.82
608 0.69 735.0 294.0 220.50 3.5 2 0.40 2 14.75 16.44
506 0.74 686.0 245.0 220.50 3.5 4 0.25 5 11.64 14.81
562 0.69 735.0 294.0 220.50 3.5 4 0.40 1 14.42 16.87
642 0.79 637.0 343.0 147.00 7.0 4 0.40 3 42.49 38.81
673 0.98 514.5 294.0 110.25 7.0 3 0.40 4 32.67 33.06
Relative Compactness Surface Area Wall Area Roof Area Overall Height Orientation Glazing Area Glazing Area Distribution Heating Load Cooling Load
count 768.000000 768.000000 768.000000 768.000000 768.00000 768.000000 768.000000 768.00000 768.000000 768.000000
mean 0.764167 671.708333 318.500000 176.604167 5.25000 3.500000 0.234375 2.81250 22.307195 24.587760
std 0.105777 88.086116 43.626481 45.165950 1.75114 1.118763 0.133221 1.55096 10.090204 9.513306
min 0.620000 514.500000 245.000000 110.250000 3.50000 2.000000 0.000000 0.00000 6.010000 10.900000
25% 0.682500 606.375000 294.000000 140.875000 3.50000 2.750000 0.100000 1.75000 12.992500 15.620000
50% 0.750000 673.750000 318.500000 183.750000 5.25000 3.500000 0.250000 3.00000 18.950000 22.080000
75% 0.830000 741.125000 343.000000 220.500000 7.00000 4.250000 0.400000 4.00000 31.667500 33.132500
max 0.980000 808.500000 416.500000 220.500000 7.00000 5.000000 0.400000 5.00000 43.100000 48.030000
plt.close('all')
plt.figure(figsize=(10,12))
idx = 1
for col in inputData:
    plt.subplot((inputData.shape[-1] // 2) + (inputData.shape[-1] % 2 ),2, idx)
    idx += 1
    plt.boxplot(inputData[col], meanline=False, notch=False, labels=[''])
    plt.title(col)
plt.tight_layout()
plt.show()

pd.plotting.scatter_matrix(inputData, figsize=(15,15));

plt.figure(figsize=(10,10))
sns.heatmap(inputData.corr(), annot=True, cmap="Blues")
plt.show()

This looks kind of interesting. It certainly is one of the less common correlation matrices I’ve seen so far.

Brute force approach

Let’s split the dataset and throw some machine learning algorithms at it.

y = inputDataRaw.copy(deep=True)
y.drop(["X1","X2", "X3", "X4", "X5", "X6","X7","X8"], axis=1, inplace=True)
X = inputDataRaw.copy(deep=True)
X.drop(["Y1","Y2"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y, test_size=0.25, random_state=42)

datasets = {}
datasets[0] = {'X_train': X_train, 
               'X_test' : X_test,
               'y_train': y_train["Y1"].values, 
               'y_test' : y_test["Y1"].values}
datasets[1] = {'X_train': X_train, 
               'X_test' : X_test,
               'y_train': y_train["Y2"].values, 
               'y_test' : y_test["Y2"].values}


Well, how does it compare to the original publication? Well, that is really difficult to assess since the original paper provides us the (averaged?) cross-validation results of 10-folds with 100 repetitions using two algorithms.

Metric Variable Best result Best original publication
MSE Heating (Y1) 0.2035 1.03 +- 0.54
MAE Heating (Y1) 0.2905 0.51 +- 0.11
MSE Cooling (Y2) 0.9069 6.59 +- 1.56
MAE Cooling (Y2) 0.5345 1.42 +- 0.25

The best performances were reached by random forest in the original publication.
However, even if we compare our random forest or even decision tree results here (and not xgboost), then we’ll still outperform the original results by quite a margin. Therefore, it is kind of difficult to understand how they came up with such bad results in their original publication.

Results using TPOT

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error, mean_squared_error, mean_absolute_error
import sklearn.metrics
from tpot import TPOTRegressor

inputData = pd.read_excel("./data/ENB2012_data.xlsx")
y = inputData.copy(deep=True)
y.drop(["X1","X2", "X3", "X4", "X5", "X6","X7","X8"], axis=1, inplace=True)
X = inputData.copy(deep=True)
X.drop(["Y1","Y2"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42)
tpot = TPOTRegressor(max_time_mins=60,
                     verbosity=21,
                     n_jobs=-1)
tpot.fit(X_train,y_train["Y1"].values)
tpot.export('energy_efficieny_heating_load.py')

y_predictions = tpot.predict(X_test)
r2 = sklearn.metrics.r2_score(y_test["Y1"].values, y_predictions)
mae = sklearn.metrics.mean_absolute_error(y_test["Y1"].values, y_predictions)
mse = sklearn.metrics.mean_squared_error(y_test["Y1"].values, y_predictions)
rmse = np.sqrt(mse)
print("R2 score:", r2)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
R2 score: 0.9983789421178374
MAE: 0.2825495393872261
MSE: 0.17039042238066923
RMSE: 0.4127837477186683
tpot = TPOTRegressor(max_time_mins=60,
                     verbosity=21,
                     n_jobs=-1)
tpot.fit(X_train,y_train["Y2"].values)
tpot.export('energy_efficieny_cooling_load.py')

y_predictions = tpot.predict(X_test)
r2 = sklearn.metrics.r2_score(y_test["Y2"].values, y_predictions)
mae = sklearn.metrics.mean_absolute_error(y_test["Y2"].values, y_predictions)
mse = sklearn.metrics.mean_squared_error(y_test["Y2"].values, y_predictions)
rmse = np.sqrt(mse)
print("R2 score:", r2)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
R2 score: 0.9889458605293449
MAE: 0.6541753176848094
MSE: 1.0320711651414836
RMSE: 1.0159090338910681

The final pipelines are:

# heating load

import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectFwe, f_regression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from tpot.builtins import StackingEstimator
from xgboost import XGBRegressor
from sklearn.preprocessing import FunctionTransformer
from copy import copy

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:-0.14438142720136646
exported_pipeline = make_pipeline(
    make_union(
        make_pipeline(
            StackingEstimator(estimator=DecisionTreeRegressor(max_depth=1, min_samples_leaf=13, min_samples_split=8)),
            SelectFwe(score_func=f_regression, alpha=0.002)
        ),
        make_union(
            FunctionTransformer(copy),
            FunctionTransformer(copy)
        )
    ),
    StandardScaler(),
    XGBRegressor(learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, nthread=1, subsample=0.7000000000000001)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
# Cooling heat

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBRegressor

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:-0.766588545649422
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=GradientBoostingRegressor(alpha=0.9, learning_rate=0.001, loss="ls", max_depth=10, max_features=0.7000000000000001, min_samples_leaf=17, min_samples_split=20, n_estimators=100, subsample=0.1)),
    XGBRegressor(learning_rate=0.5, max_depth=9, min_child_weight=14, n_estimators=100, nthread=1, subsample=1.0)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Final thoughts

The original publication is another example of the bad quality of many scientific publications, especially if numerical simulations and/or statistics/machine learning is involved. Furthermore, with basically doing nothing and using TPOT we came up with a solution that is slightly better than brute forcing a model for predicting heating loads and slightly worse for predicting cooling loads.