Let’s have a look at the CT Slice Localization dataset by Graf et al. (2011) as part of my series on exploring less known datasets.


Contents


Dataset exploration

Let’s load the dataset to understand the basic idea behind it.

import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# not gentlemen-like but it helps to keep the notebook clean ;)
import warnings
warnings.simplefilter('ignore')

filepath_input_data = "./data/slice_localization_data.csv"
input_data = pd.read_csv(filepath_input_data)
display(input_data.head(5))
display(input_data.tail(5))
display(input_data.describe())
patientId value0 value1 value2 value3 value4 value5 value6 value7 value8 ... value375 value376 value377 value378 value379 value380 value381 value382 value383 reference
0 0 0.0 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 -0.25 ... -0.25 0.980381 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 21.803851
1 0 0.0 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 -0.25 ... -0.25 0.977008 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 21.745726
2 0 0.0 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 -0.25 ... -0.25 0.977008 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 21.687600
3 0 0.0 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 -0.25 ... -0.25 0.977008 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 21.629474
4 0 0.0 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 -0.25 ... -0.25 0.976833 0.0 0.0 0.0 0.0 0.0 -0.25 -0.25 21.571348
patientId value0 value1 value2 value3 value4 value5 value6 value7 value8 ... value375 value376 value377 value378 value379 value380 value381 value382 value383 reference
53495 96 0.591906 0.357764 0.000000 0.000000 0.552321 0.795304 0.946697 0.952227 0.84395 ... 0.00 0.0 0.0 0.000000 0.000000 0.0 0.0 0.00 0.00 29.290398
53496 96 0.612313 0.000000 0.000000 0.000000 0.864160 0.820531 0.000000 0.938813 0.94374 ... 0.00 0.0 0.0 0.000000 0.000000 0.0 0.0 0.00 0.00 27.945721
53497 96 0.612313 0.000000 0.000000 0.000000 0.864160 0.820531 0.000000 0.938813 0.94374 ... 0.00 0.0 0.0 0.000000 0.000000 0.0 0.0 0.00 0.00 27.945721
53498 96 0.634921 0.904555 0.956087 0.980208 0.157664 0.000000 -0.250000 -0.250000 -0.25000 ... -0.25 0.0 0.0 0.994967 0.806688 0.0 0.0 -0.25 -0.25 14.582997
53499 96 0.654321 0.891021 0.882244 0.979282 0.000000 0.000000 -0.250000 -0.250000 -0.25000 ... -0.25 0.0 0.0 0.994671 0.000000 0.0 0.0 -0.25 -0.25 14.498955
patientId value0 value1 value2 value3 value4 value5 value6 value7 value8 ... value375 value376 value377 value378 value379 value380 value381 value382 value383 reference
count 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 ... 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000 53500.000000
mean 47.075701 0.059627 0.071558 0.145819 0.218728 0.274762 0.276189 0.204531 0.062281 -0.042025 ... -0.029404 0.182913 0.320112 0.359373 0.342889 0.266091 0.083049 -0.031146 -0.154524 47.028039
std 27.414240 0.174243 0.196921 0.300270 0.359163 0.378862 0.369605 0.351294 0.292232 0.268391 ... 0.085817 0.383333 0.463517 0.478188 0.471811 0.437633 0.279734 0.098738 0.122491 22.347042
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.250000 -0.250000 -0.250000 -0.250000 ... -0.250000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.250000 -0.250000 -0.250000 1.738733
25% 23.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.250000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.250000 29.891607
50% 46.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.250000 43.987893
75% 70.000000 0.000000 0.000000 0.000000 0.446429 0.684477 0.662382 0.441412 0.000000 0.000000 ... 0.000000 0.000000 0.996286 0.999677 0.999560 0.949478 0.000000 0.000000 0.000000 63.735059
max 96.000000 1.000000 1.000000 1.000000 1.000000 0.998790 0.996468 0.999334 1.000000 1.000000 ... 0.961279 1.000000 1.000000 1.000000 1.000000 1.000000 0.999857 0.996839 0.942851 97.489115

The data contains a no labels. Hence, we have to read the documentation on the UCI ML repository. The first colum is the patient ID, the last one is the location of the slice ranging from 0 to 180 where 0 denotes a slice at the top of the head. The other columns represent two histrograms. The first one is the histogram of bone structure of a slice and the second one of detected air inclusions. Lets visualize both histograms: bone structure and air inclusions. Therefore, we have to extract some information from the dataset:

inputrange_0 = []
for i in range(240):
    inputrange_0.append("value"+str(i))

bone_structure_histogram_mean = []
for i in input_data[inputrange_0]:
    bone_structure_histogram_mean.append(input_data[i].describe()["mean"])

bone_structure_histogram_std = []
for i in input_data[inputrange_0]:
    bone_structure_histogram_std.append(input_data[i].describe()["std"])

x = [i for i in range(len(bone_structure_histogram_mean))]
bones_lower_bound = np.subtract(bone_structure_histogram_mean,bone_structure_histogram_std)
bones_upper_bound = np.add(bone_structure_histogram_mean,bone_structure_histogram_std)

plt.figure(figsize=(10,7))
plt.plot(bone_structure_histogram_mean, color="black")
plt.fill_between(x,bones_lower_bound,bones_upper_bound,color="gray")
plt.xlabel("")
plt.ylabel("")
plt.title("Bone structure histogram - mean values and std")
plt.tight_layout()
plt.show()


inputrange_1 = []
for i in range(240,384):
    inputrange_1.append("value"+str(i))

air_inclusions_histogram_mean = []
for i in input_data[inputrange_1]:
    air_inclusions_histogram_mean.append(input_data[i].describe()["mean"])

air_inclusions_histogram_std = []
for i in input_data[inputrange_1]:
    air_inclusions_histogram_std.append(input_data[i].describe()["std"])

x = [i for i in range(len(air_inclusions_histogram_mean))]
air_lower_bound = np.subtract(air_inclusions_histogram_mean,air_inclusions_histogram_std)
air_upper_bound = np.add(air_inclusions_histogram_mean,air_inclusions_histogram_std)

plt.figure(figsize=(10,7))
plt.plot(air_inclusions_histogram_mean, color="black")
plt.fill_between(x,air_lower_bound,air_upper_bound,color="gray")
plt.xlabel("")
plt.ylabel("")
plt.title("Air inclusions histogram - mean values and std")
plt.tight_layout()
plt.show()


Both histograms seem to cover a wide variety of ranges.

Preprocessing and ML algorithms

Let’s do some more scaling to aim for fast convergence:

from sklearn.preprocessing import MaxAbsScaler
input_data.drop(["patientId"],axis=1, inplace=True)
input_data_scaled_df = input_data.copy()
scaler = MaxAbsScaler()
input_data_scaled = scaler.fit_transform(input_data)
input_data_scaled_df.loc[:,:] = input_data_scaled
display(input_data_scaled_df.head(3))

# We are dealing with physics/real-world meaning here, hence we need the unscaled values
extract_scaling_function = np.ones((1,input_data_scaled_df.shape[1]))
extract_scaling_function = scaler.inverse_transform(extract_scaling_function)


# split data into X and y
y_df = input_data_scaled_df['reference'].copy()
X_df = input_data_scaled_df.copy()
X_df.drop('reference', axis=1, inplace=True)

from sklearn.model_selection import GridSearchCV, train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df.values, y_df.values,test_size=0.2, random_state=42, shuffle=True)

As machine learning algorithms, we can use exactly the same as used for this regression problem on predicting concrete compressive strength.

Results

After a few hours of training, we will end up with these results:

It looks like the deepest neural network predicts the CT slice position best.

References

F. Graf, H.-P. Kriegel, M. Schubert, S. Poelsterl, A. Cavallaro
2D Image Registration in CT Images using Radial Image Descriptors
In Medical Image Computing and Computer-Assisted Intervention (MICCAI),
Toronto, Canada, 2011.