Once in a while, a new dataset is added to the UCI Machine Learning Repository. Let’s have a look the latest dataset there. It is called the Metro Interstate Traffic Volume dataset.
The dataset originates from the Minnesota DoT and weather data from the OpenWeatherMap.
Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN. Hourly weather features and holidays included for impacts on traffic volume.
– https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume
Let’s load the dataset and have a look at it.
import time
sys_start = time.time()
import pandas as pd
import numpy as np
import seaborn as sns
import pickle
import matplotlib.pyplot as plt
%matplotlib inline
# not gentlemen-like but it helps to keep the notebook clean ;)
import warnings
warnings.simplefilter('ignore')
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from skgarden import MondrianForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error, make_scorer, mean_absolute_error, median_absolute_error
import xgboost
input_data = pd.read_csv("./data/Metro_Interstate_Traffic_Volume.csv")
display(input_data.sample(10))
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
4457 | None | 268.730 | 0.00 | 0.0 | 1 | Clear | sky is clear | 2013-03-22 14:00:00 | 5807 |
31822 | None | 279.220 | 0.00 | 0.0 | 90 | Clouds | overcast clouds | 2017-03-17 13:00:00 | 5857 |
46859 | None | 293.550 | 0.00 | 0.0 | 90 | Rain | heavy intensity rain | 2018-08-20 08:00:00 | 5349 |
31249 | None | 272.950 | 0.00 | 0.0 | 90 | Clouds | overcast clouds | 2017-02-23 17:00:00 | 6257 |
12900 | None | 278.970 | 0.00 | 0.0 | 20 | Clouds | few clouds | 2014-03-14 00:00:00 | 673 |
31078 | None | 269.979 | 0.00 | 0.0 | 0 | Clear | Sky is Clear | 2017-02-17 05:00:00 | 2577 |
26327 | None | 292.150 | 10.16 | 0.0 | 90 | Rain | heavy intensity rain | 2016-09-05 04:00:00 | 278 |
33307 | None | 283.260 | 0.00 | 0.0 | 75 | Clouds | broken clouds | 2017-05-02 12:00:00 | 4693 |
32796 | None | 281.890 | 0.00 | 0.0 | 1 | Clear | sky is clear | 2017-04-18 00:00:00 | 596 |
20656 | None | 255.510 | 0.00 | 0.0 | 90 | Haze | haze | 2016-01-16 12:00:00 | 4601 |
We have to create a few categorical features here:
#categorical encoding
input_data['holiday'] = pd.Categorical(input_data['holiday']).codes
input_data['weather_main'] = pd.Categorical(input_data['weather_main']).codes
input_data['weather_description'] = pd.Categorical(input_data['weather_description']).codes
input_data['year'] = input_data['date_time'].map(lambda x: int(x.split(" ")[0].split("-")[0]))
input_data['month'] = input_data['date_time'].map(lambda x: int(x.split(" ")[0].split("-")[1]))
input_data['day'] = input_data['date_time'].map(lambda x: int(x.split(" ")[0].split("-")[2]))
input_data['hour'] = input_data['date_time'].map(lambda x: int(x.split(" ")[1].split(":")[0]))
input_data.drop(['date_time'], axis=1, inplace=True)
display(input_data.sample(10))
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | traffic_volume | year | month | day | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
10428 | 7 | 262.96 | 0.0 | 0.0 | 64 | 8 | 16 | 508 | 2013 | 12 | 5 | 0 |
34836 | 7 | 288.89 | 0.0 | 0.0 | 40 | 1 | 24 | 2042 | 2017 | 6 | 25 | 22 |
35513 | 7 | 292.28 | 0.0 | 0.0 | 20 | 3 | 5 | 6446 | 2017 | 7 | 20 | 7 |
30541 | 7 | 272.65 | 0.0 | 0.0 | 1 | 0 | 27 | 6054 | 2017 | 1 | 27 | 17 |
42067 | 7 | 274.79 | 0.0 | 0.0 | 90 | 8 | 15 | 3586 | 2018 | 3 | 5 | 10 |
19578 | 7 | 276.49 | 0.0 | 0.0 | 90 | 2 | 3 | 538 | 2015 | 11 | 19 | 0 |
26579 | 7 | 288.13 | 0.0 | 0.0 | 1 | 0 | 27 | 4405 | 2016 | 9 | 13 | 11 |
19624 | 7 | 265.84 | 0.0 | 0.0 | 40 | 1 | 24 | 299 | 2015 | 11 | 22 | 4 |
5113 | 7 | 271.15 | 0.0 | 0.0 | 90 | 8 | 10 | 459 | 2013 | 4 | 14 | 5 |
13139 | 7 | 265.16 | 0.0 | 0.0 | 20 | 1 | 4 | 903 | 2014 | 3 | 23 | 23 |
It’s time to visuazile the distribution of features.
Let’s throw a bunch of algorithms at it and see what happens ;).
So far we predicted traffic volume based on measurements/predictions for various environmental conditions. However, we could treat this as time series as well.
These results are not satisfactory. I have no intention for deeper investigation since I basically wanted to have a quick look at the dataset and ended with throwing my brute-force pipeline at it ;).