Once in a while, a new dataset is added to the UCI Machine Learning Repository. Let’s have a look the latest dataset there. It is called the Metro Interstate Traffic Volume dataset.

The dataset originates from the Minnesota DoT and weather data from the OpenWeatherMap.

Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN. Hourly weather features and holidays included for impacts on traffic volume.

https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

Let’s load the dataset and have a look at it.

import time
sys_start = time.time()
import pandas as pd
import numpy as np
import seaborn as sns
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

# not gentlemen-like but it helps to keep the notebook clean ;)
import warnings
warnings.simplefilter('ignore')


from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from skgarden import MondrianForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error, make_scorer, mean_absolute_error, median_absolute_error
import xgboost


input_data = pd.read_csv("./data/Metro_Interstate_Traffic_Volume.csv")
display(input_data.sample(10))
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
4457 None 268.730 0.00 0.0 1 Clear sky is clear 2013-03-22 14:00:00 5807
31822 None 279.220 0.00 0.0 90 Clouds overcast clouds 2017-03-17 13:00:00 5857
46859 None 293.550 0.00 0.0 90 Rain heavy intensity rain 2018-08-20 08:00:00 5349
31249 None 272.950 0.00 0.0 90 Clouds overcast clouds 2017-02-23 17:00:00 6257
12900 None 278.970 0.00 0.0 20 Clouds few clouds 2014-03-14 00:00:00 673
31078 None 269.979 0.00 0.0 0 Clear Sky is Clear 2017-02-17 05:00:00 2577
26327 None 292.150 10.16 0.0 90 Rain heavy intensity rain 2016-09-05 04:00:00 278
33307 None 283.260 0.00 0.0 75 Clouds broken clouds 2017-05-02 12:00:00 4693
32796 None 281.890 0.00 0.0 1 Clear sky is clear 2017-04-18 00:00:00 596
20656 None 255.510 0.00 0.0 90 Haze haze 2016-01-16 12:00:00 4601

We have to create a few categorical features here:

#categorical encoding
input_data['holiday'] = pd.Categorical(input_data['holiday']).codes
input_data['weather_main'] = pd.Categorical(input_data['weather_main']).codes
input_data['weather_description'] = pd.Categorical(input_data['weather_description']).codes
input_data['year'] = input_data['date_time'].map(lambda x: int(x.split(" ")[0].split("-")[0]))
input_data['month'] = input_data['date_time'].map(lambda x: int(x.split(" ")[0].split("-")[1]))
input_data['day'] = input_data['date_time'].map(lambda x: int(x.split(" ")[0].split("-")[2]))
input_data['hour'] = input_data['date_time'].map(lambda x: int(x.split(" ")[1].split(":")[0]))
input_data.drop(['date_time'], axis=1, inplace=True)

display(input_data.sample(10))
holiday temp rain_1h snow_1h clouds_all weather_main weather_description traffic_volume year month day hour
10428 7 262.96 0.0 0.0 64 8 16 508 2013 12 5 0
34836 7 288.89 0.0 0.0 40 1 24 2042 2017 6 25 22
35513 7 292.28 0.0 0.0 20 3 5 6446 2017 7 20 7
30541 7 272.65 0.0 0.0 1 0 27 6054 2017 1 27 17
42067 7 274.79 0.0 0.0 90 8 15 3586 2018 3 5 10
19578 7 276.49 0.0 0.0 90 2 3 538 2015 11 19 0
26579 7 288.13 0.0 0.0 1 0 27 4405 2016 9 13 11
19624 7 265.84 0.0 0.0 40 1 24 299 2015 11 22 4
5113 7 271.15 0.0 0.0 90 8 10 459 2013 4 14 5
13139 7 265.16 0.0 0.0 20 1 4 903 2014 3 23 23

It’s time to visuazile the distribution of features.

Let’s throw a bunch of algorithms at it and see what happens ;).

So far we predicted traffic volume based on measurements/predictions for various environmental conditions. However, we could treat this as time series as well.

These results are not satisfactory. I have no intention for deeper investigation since I basically wanted to have a quick look at the dataset and ended with throwing my brute-force pipeline at it ;).