Revisiting Machine Learning Datasets - Energy Efficiency (HVAC heating and cooling loads)

It is time to have a look at another less-known dataset. Today, we’ll have a look at the energy efficiency dataset, which originates from Tsanas and Xifara (2012) and is hosted on the UCI Machine Learning Repository.

Contents

Introduction to the dataset
Brute force approach
Results using TPOT
Final thoughts

Introduction to the dataset

The dataset is the result of simulations performed by Tsanas and Xifara with the aim to predict energy performance of buildings (EPB). In particular, the aim is to estimate heating and cooling loads given eight design parameters. Let’s have a look at what we got here (detailed feature explanations are in the original publication mentioned above).

inputData = pd.read_excel("./data/ENB2012_data.xlsx")
inputDataRaw = inputData.copy(deep=True)
names = ["Relative Compactness",
         "Surface Area",
         "Wall Area",
         "Roof Area",
         "Overall Height",
         "Orientation",
         "Glazing Area",
         "Glazing Area Distribution",
         "Heating Load",
         "Cooling Load"]
inputData.columns = names
display(inputData.sample(10))
display(inputData.describe())


  
    
      
      Relative Compactness
      Surface Area
      Wall Area
      Roof Area
      Overall Height
      Orientation
      Glazing Area
      Glazing Area Distribution
      Heating Load
      Cooling Load
    
  
  
    
      513
      0.69
      735.0
      294.0
      220.50
      3.5
      3
      0.25
      5
      12.12
      14.97
    
    
      684
      0.82
      612.5
      318.5
      147.00
      7.0
      2
      0.40
      4
      28.93
      28.20
    
    
      134
      0.66
      759.5
      318.5
      220.50
      3.5
      4
      0.10
      2
      11.33
      15.00
    
    
      525
      0.62
      808.5
      367.5
      220.50
      3.5
      3
      0.25
      5
      13.99
      14.61
    
    
      244
      0.90
      563.5
      318.5
      122.50
      7.0
      2
      0.10
      5
      29.83
      29.82
    
    
      608
      0.69
      735.0
      294.0
      220.50
      3.5
      2
      0.40
      2
      14.75
      16.44
    
    
      506
      0.74
      686.0
      245.0
      220.50
      3.5
      4
      0.25
      5
      11.64
      14.81
    
    
      562
      0.69
      735.0
      294.0
      220.50
      3.5
      4
      0.40
      1
      14.42
      16.87
    
    
      642
      0.79
      637.0
      343.0
      147.00
      7.0
      4
      0.40
      3
      42.49
      38.81
    
    
      673
      0.98
      514.5
      294.0
      110.25
      7.0
      3
      0.40
      4
      32.67
      33.06

	Relative Compactness	Surface Area	Wall Area	Roof Area	Overall Height	Orientation	Glazing Area	Glazing Area Distribution	Heating Load	Cooling Load
513	0.69	735.0	294.0	220.50	3.5	3	0.25	5	12.12	14.97
684	0.82	612.5	318.5	147.00	7.0	2	0.40	4	28.93	28.20
134	0.66	759.5	318.5	220.50	3.5	4	0.10	2	11.33	15.00
525	0.62	808.5	367.5	220.50	3.5	3	0.25	5	13.99	14.61
244	0.90	563.5	318.5	122.50	7.0	2	0.10	5	29.83	29.82
608	0.69	735.0	294.0	220.50	3.5	2	0.40	2	14.75	16.44
506	0.74	686.0	245.0	220.50	3.5	4	0.25	5	11.64	14.81
562	0.69	735.0	294.0	220.50	3.5	4	0.40	1	14.42	16.87
642	0.79	637.0	343.0	147.00	7.0	4	0.40	3	42.49	38.81
673	0.98	514.5	294.0	110.25	7.0	3	0.40	4	32.67	33.06


  
    
      
      Relative Compactness
      Surface Area
      Wall Area
      Roof Area
      Overall Height
      Orientation
      Glazing Area
      Glazing Area Distribution
      Heating Load
      Cooling Load
    
  
  
    
      count
      768.000000
      768.000000
      768.000000
      768.000000
      768.00000
      768.000000
      768.000000
      768.00000
      768.000000
      768.000000
    
    
      mean
      0.764167
      671.708333
      318.500000
      176.604167
      5.25000
      3.500000
      0.234375
      2.81250
      22.307195
      24.587760
    
    
      std
      0.105777
      88.086116
      43.626481
      45.165950
      1.75114
      1.118763
      0.133221
      1.55096
      10.090204
      9.513306
    
    
      min
      0.620000
      514.500000
      245.000000
      110.250000
      3.50000
      2.000000
      0.000000
      0.00000
      6.010000
      10.900000
    
    
      25%
      0.682500
      606.375000
      294.000000
      140.875000
      3.50000
      2.750000
      0.100000
      1.75000
      12.992500
      15.620000
    
    
      50%
      0.750000
      673.750000
      318.500000
      183.750000
      5.25000
      3.500000
      0.250000
      3.00000
      18.950000
      22.080000
    
    
      75%
      0.830000
      741.125000
      343.000000
      220.500000
      7.00000
      4.250000
      0.400000
      4.00000
      31.667500
      33.132500
    
    
      max
      0.980000
      808.500000
      416.500000
      220.500000
      7.00000
      5.000000
      0.400000
      5.00000
      43.100000
      48.030000

	Relative Compactness	Surface Area	Wall Area	Roof Area	Overall Height	Orientation	Glazing Area	Glazing Area Distribution	Heating Load	Cooling Load
count	768.000000	768.000000	768.000000	768.000000	768.00000	768.000000	768.000000	768.00000	768.000000	768.000000
mean	0.764167	671.708333	318.500000	176.604167	5.25000	3.500000	0.234375	2.81250	22.307195	24.587760
std	0.105777	88.086116	43.626481	45.165950	1.75114	1.118763	0.133221	1.55096	10.090204	9.513306
min	0.620000	514.500000	245.000000	110.250000	3.50000	2.000000	0.000000	0.00000	6.010000	10.900000
25%	0.682500	606.375000	294.000000	140.875000	3.50000	2.750000	0.100000	1.75000	12.992500	15.620000
50%	0.750000	673.750000	318.500000	183.750000	5.25000	3.500000	0.250000	3.00000	18.950000	22.080000
75%	0.830000	741.125000	343.000000	220.500000	7.00000	4.250000	0.400000	4.00000	31.667500	33.132500
max	0.980000	808.500000	416.500000	220.500000	7.00000	5.000000	0.400000	5.00000	43.100000	48.030000

plt.close('all')
plt.figure(figsize=(10,12))
idx = 1
for col in inputData:
    plt.subplot((inputData.shape[-1] // 2) + (inputData.shape[-1] % 2 ),2, idx)
    idx += 1
    plt.boxplot(inputData[col], meanline=False, notch=False, labels=[''])
    plt.title(col)
plt.tight_layout()
plt.show()

pd.plotting.scatter_matrix(inputData, figsize=(15,15));

plt.figure(figsize=(10,10))
sns.heatmap(inputData.corr(), annot=True, cmap="Blues")
plt.show()

This looks kind of interesting. It certainly is one of the less common correlation matrices I’ve seen so far.

Brute force approach

Let’s split the dataset and throw some machine learning algorithms at it.

y = inputDataRaw.copy(deep=True)
y.drop(["X1","X2", "X3", "X4", "X5", "X6","X7","X8"], axis=1, inplace=True)
X = inputDataRaw.copy(deep=True)
X.drop(["Y1","Y2"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y, test_size=0.25, random_state=42)

datasets = {}
datasets[0] = {'X_train': X_train, 
               'X_test' : X_test,
               'y_train': y_train["Y1"].values, 
               'y_test' : y_test["Y1"].values}
datasets[1] = {'X_train': X_train, 
               'X_test' : X_test,
               'y_train': y_train["Y2"].values, 
               'y_test' : y_test["Y2"].values}

Well, how does it compare to the original publication? Well, that is really difficult to assess since the original paper provides us the (averaged?) cross-validation results of 10-folds with 100 repetitions using two algorithms.

Metric	Variable	Best result	Best original publication
MSE	Heating (Y1)	0.2035	1.03 +- 0.54
MAE	Heating (Y1)	0.2905	0.51 +- 0.11
MSE	Cooling (Y2)	0.9069	6.59 +- 1.56
MAE	Cooling (Y2)	0.5345	1.42 +- 0.25

The best performances were reached by random forest in the original publication.
However, even if we compare our random forest or even decision tree results here (and not xgboost), then we’ll still outperform the original results by quite a margin. Therefore, it is kind of difficult to understand how they came up with such bad results in their original publication.

Results using TPOT

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error, mean_squared_error, mean_absolute_error
import sklearn.metrics
from tpot import TPOTRegressor

inputData = pd.read_excel("./data/ENB2012_data.xlsx")
y = inputData.copy(deep=True)
y.drop(["X1","X2", "X3", "X4", "X5", "X6","X7","X8"], axis=1, inplace=True)
X = inputData.copy(deep=True)
X.drop(["Y1","Y2"], axis=1, inplace=True)
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42)
tpot = TPOTRegressor(max_time_mins=60,
                     verbosity=21,
                     n_jobs=-1)
tpot.fit(X_train,y_train["Y1"].values)
tpot.export('energy_efficieny_heating_load.py')

y_predictions = tpot.predict(X_test)
r2 = sklearn.metrics.r2_score(y_test["Y1"].values, y_predictions)
mae = sklearn.metrics.mean_absolute_error(y_test["Y1"].values, y_predictions)
mse = sklearn.metrics.mean_squared_error(y_test["Y1"].values, y_predictions)
rmse = np.sqrt(mse)
print("R2 score:", r2)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)

R2 score: 0.9983789421178374
MAE: 0.2825495393872261
MSE: 0.17039042238066923
RMSE: 0.4127837477186683

tpot = TPOTRegressor(max_time_mins=60,
                     verbosity=21,
                     n_jobs=-1)
tpot.fit(X_train,y_train["Y2"].values)
tpot.export('energy_efficieny_cooling_load.py')

y_predictions = tpot.predict(X_test)
r2 = sklearn.metrics.r2_score(y_test["Y2"].values, y_predictions)
mae = sklearn.metrics.mean_absolute_error(y_test["Y2"].values, y_predictions)
mse = sklearn.metrics.mean_squared_error(y_test["Y2"].values, y_predictions)
rmse = np.sqrt(mse)
print("R2 score:", r2)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)

R2 score: 0.9889458605293449
MAE: 0.6541753176848094
MSE: 1.0320711651414836
RMSE: 1.0159090338910681

The final pipelines are:

# heating load

import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectFwe, f_regression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from tpot.builtins import StackingEstimator
from xgboost import XGBRegressor
from sklearn.preprocessing import FunctionTransformer
from copy import copy

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:-0.14438142720136646
exported_pipeline = make_pipeline(
    make_union(
        make_pipeline(
            StackingEstimator(estimator=DecisionTreeRegressor(max_depth=1, min_samples_leaf=13, min_samples_split=8)),
            SelectFwe(score_func=f_regression, alpha=0.002)
        ),
        make_union(
            FunctionTransformer(copy),
            FunctionTransformer(copy)
        )
    ),
    StandardScaler(),
    XGBRegressor(learning_rate=0.1, max_depth=9, min_child_weight=1, n_estimators=100, nthread=1, subsample=0.7000000000000001)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

# Cooling heat

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBRegressor

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:-0.766588545649422
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=GradientBoostingRegressor(alpha=0.9, learning_rate=0.001, loss="ls", max_depth=10, max_features=0.7000000000000001, min_samples_leaf=17, min_samples_split=20, n_estimators=100, subsample=0.1)),
    XGBRegressor(learning_rate=0.5, max_depth=9, min_child_weight=14, n_estimators=100, nthread=1, subsample=1.0)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Final thoughts

The original publication is another example of the bad quality of many scientific publications, especially if numerical simulations and/or statistics/machine learning is involved. Furthermore, with basically doing nothing and using TPOT we came up with a solution that is slightly better than brute forcing a model for predicting heating loads and slightly worse for predicting cooling loads.