Let’s have a look at another dataset with physical meaning as part of my exploring less known datasets series. This time we’ll look at a dataset that originates from RDatasets.

This dataset contains rock permeabilities obtained from cross-sections of cores.

Twelve core samples from petroleum reservoirs were sampled by 4 cross-sections. Each core sample was measured for permeability, and each cross-section has total area of pores, total perimeter of pores, and shape.

Source: Data from BP Research, image analysis by Ronit Katz, U. Oxford.

https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/rock.html

Unfortunately, that is all information I could find on this dataset(without spending too much time). Therefore, we don’t know in which direction it was measured. Certainly, there was no permeability tensor (9x9) derived.

Let’s load the dataset and have a look at it:

input_data = pd.read_csv("./data/rock.csv")
display(input_data.sample(10))
display(input_data.describe())
Area Peri Shape Perm
11 8624 3986.24 0.148141 119.0
39 5267 1644.96 0.253832 100.0
27 5246 1585.42 0.133083 740.0
7 8209 4344.75 0.164127 17.1
0 4990 2791.90 0.090330 6.3
21 11876 4353.14 0.291029 142.0
36 3469 1376.70 0.176969 100.0
5 7979 4010.15 0.167045 17.1
35 7894 1461.06 0.276016 950.0
2 7558 3930.66 0.183312 6.3
Area Peri Shape Perm
count 48.000000 48.000000 48.000000 48.000000
mean 7187.729167 2682.211938 0.218110 415.450000
std 2683.848862 1431.661164 0.083496 437.818226
min 1016.000000 308.642000 0.090330 6.300000
25% 5305.250000 1414.907500 0.162262 76.450000
50% 7487.000000 2536.195000 0.198862 130.500000
75% 8869.500000 3989.522500 0.262670 777.500000
max 12212.000000 4864.220000 0.464125 1300.000000

Since the dataset is really small, we are going to train the model using a 4-fold cross-validation and score it on the whole dataset. Proper train-valid-test splitting wouldn’t make much sense here because we have 4 data points for each of the 12 samples. Let’s throw some algorithms at it and see if we end up with something useful.

Well, it lookes like KNN leads to something useful.