Kuzushiji-MNIST dataset (KMNIST)

Contents

Introduction to the Kuzushiji dataset
Benchmark models
- Benchmark models - Kuzushiji-MNIST
- Benchmark models - Kuzushiji-49

Introduction to the Kuzushiji dataset

Kuzushiji is a MNIST-like datasets released in 2018. Unlike most dataset walk-throughs this one is done in Julia. If you like MNIST-like datasets, then have a look at CMNIST as well.

The Kuzushiji dataset [1] is a MNIST-like dataset that contains 10 (Kuzushiji-MNIST) and 49 (Kuzushiji-49) phonetic letters of hiragana. This is a compnent of the Japanese writing system. The intention of the Kuzushiji dataset is link hiragana from classical literature to modern counterparts (UTF-8 encoded). Further, there exists Kuzushiji-Kanji which is a set samples of Kanji which are adopted Chinese characters. All three datasets are based on the Kuzushiji dataset by the National Institute of Japanese Literature. The dataset is licensed under CC BY-SA 4.0.

Since I don’t understand the example images on Hiragana and Kanji, here are my own dataset visualizations. Further, I think that even a classifier with 5 % accuracy outsmarts me and many other people that never had any contact with classical japanese literature. Therefore, it is difficult to look at the data to see if X_train and X_test are labeled correctly.

Since I’m not satisfied with existing packages for data exploration with Julia, we’ll have to write some basic helper functions first:

import CSV
import NPZ
import PyPlot
import Gadfly
import Images

function getCategoricalCount(InputArray)
    Count = Dict{Any,Int64}()
    for (Index, Category) in enumerate(InputArray)
        if haskey(Count, Category)
            Count[Category] +=1
        else
            Count[Category] = 1
        end
    end
    return sort(Count)
end

function getCategoricalIndices(InputArray)
    CategoricalIndices = Dict{Any,Array{Int64,1}}()
    for (Index, Category) in enumerate(InputArray)
        if haskey(CategoricalIndices, Category)
            push!(CategoricalIndices[Category], Index)
        else
            CategoricalIndices[Category] = [Index]
        end
    end
    return CategoricalIndices
end

function CollectAllImagesPerCategory(InputArray,CategoricalIndices,NumberOfExamples)
    if NumberOfExamples > size(CategoricalIndices)[1]
        NumberOfExamples = size(CategoricalIndices)[1]
    end
    ItemsXY = convert(Int64,ceil(sqrt(NumberOfExamples)))
    ImageSizeY = size(InputArray)[2]
    ImageSizeX = size(InputArray)[3]
    i = 1
    FirstRow = true
    FirstItemInRow = true
    OutputArrayRow = []
    OutputArray = []  
    for j in collect(1:ItemsXY)
        for k in collect(1:ItemsXY)
            if (FirstItemInRow == true)
                OutputArrayRow = InputArray[CategoricalIndices[i],:,:]
                OutputArrayRow = hcat(OutputArrayRow, zeros(ImageSizeY,ImageSizeX))
                FirstItemInRow = false
                i+=1
            elseif (i <= NumberOfExamples)
                OutputArrayRow = hcat(OutputArrayRow,InputArray[CategoricalIndices[i],:,:])
                OutputArrayRow = hcat(OutputArrayRow, zeros(ImageSizeY,ImageSizeX))
                i+=1
            else
                OutputArrayRow = hcat(OutputArrayRow, zeros(ImageSizeY,ImageSizeX * 2))
                i+=1
            end
        end
        FirstItemInRow = true
        if FirstRow == false
            OutputArray = vcat(OutputArray, OutputArrayRow)
            OutputArray = vcat(OutputArray, zeros(ImageSizeY,ImageSizeX*2*ItemsXY))
        else
            OutputArray = zeros(ImageSizeY,ImageSizeX*2*ItemsXY)
            OutputArray = vcat(OutputArray,OutputArrayRow)
            OutputArray = vcat(OutputArray, zeros(ImageSizeY,ImageSizeX*2*ItemsXY))
            FirstRow = false
        end
    end
    return hcat(zeros(size(OutputArray)[1],ImageSizeX),OutputArray)
end

function PlotExampleImagesPerCategory(InputArray, CategoricalIndices, NumberOfExamples,DatasetName)
    for i in collect(keys(CategoricalIndices))
        Examples = CollectAllImagesPerCategory(InputArray,CategoricalIndices[i],NumberOfExamples)
        PyPlot.figure(figsize=(12,12))
        PyPlot.imshow(Examples, cmap="gray")
        PyPlot.title("Examples of class "*string(i)*" of "*DatasetName)
        PyPlot.xticks([])
        PyPlot.yticks([])
        PyPlot.tight_layout()
        PyPlot.savefig("./graphics/"*DatasetName*"_examples_class_"*string(i)*".png")
        PyPlot.close()
    end
end

function PlotHistogram(Categories, Title, XLabel="Class", YLabel="Frequency")
    Plot = Gadfly.plot(x=[string(i) for i in collect(keys(Categories))],
        y=collect(values(Categories)),
        Gadfly.Geom.bar,
        color=[string(i) for i in collect(keys(Categories))],
        Gadfly.Guide.xlabel(XLabel),
        Gadfly.Guide.ylabel(YLabel),
        Gadfly.Guide.title(Title * " Histogram"),
        Gadfly.Guide.colorkey(title=XLabel));
    ImageParams = Gadfly.SVG("./graphics/"*Title*"_histogram.svg", 14Gadfly.cm, 10Gadfly.cm);
    Gadfly.draw(ImageParams, Plot);    
end

Kuzushiji-MNIST

The Kuzushiji-MNIST or KMNIST dataset contains 10 classes of hiragana characters with a resolution of 28x28 (grayscale) similar to MNIST. In total it contains 70k images, 60k for training and 10k for testing.

Let’s have a look what characters are contained in KMNIST:

ClassLabels_Kuzushiji_MNIST = CSV.File("./Kuzushiji-MNIST/data/kmnist_classmap.csv") |> CSV.DataFrame

nb!: by default Julia uses index 1 and not index 0

index	target value	codepoint	char
1	0	U+304A	お
2	1	U+304D	き
3	2	U+3059	す
4	3	U+3064	つ
5	4	U+306A	な
6	5	U+306F	は
7	6	U+307E	ま
8	7	U+3084	や
9	8	U+308C	れ
10	9	U+3092	を

Kuzushiji_MNIST_X_train = NPZ.npzread("./Kuzushiji-MNIST/data/kmnist-train-imgs.npz")["arr_0"];
Kuzushiji_MNIST_X_test = NPZ.npzread("./Kuzushiji-MNIST/data/kmnist-test-imgs.npz")["arr_0"];
Kuzushiji_MNIST_y_train = NPZ.npzread("./Kuzushiji-MNIST/data/kmnist-train-labels.npz")["arr_0"];
Kuzushiji_MNIST_y_test = NPZ.npzread("./Kuzushiji-MNIST/data/kmnist-test-labels.npz")["arr_0"];

The first entry in the training set is supposed to be classified as “8” (れ) and looks like this as an array:

28×28 Array{UInt8,2}:
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
    ⋮                             ⋮  ⋱                       ⋮            
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00

nb!: Julia stores UInt8 as hexadecimal (base16). If you want to see it (visually) in base10, then you would have to convert it to Int16: convert(Array{Int16,3},NPZ.npzread("./Kuzushiji-MNIST/data/kmnist-train-imgs.npz")["arr_0"]);

We can plot this character using PyPlot:

PyPlot.figure(figsize=(1,1));
PyPlot.imshow(Kuzushiji_MNIST_X_train[1,:,:], cmap="gray");
PyPlot.show();

I have no idea how this can be the same as this modern symbol: れ. But again, this is a modern representation and I don’t know a single thing about Japanese.

Kuzushiji_MNIST_CategoricalCount = getCategoricalCount(Kuzushiji_MNIST_y_train);
Kuzushiji_MNIST_CategoricalIndices = getCategoricalIndices(Kuzushiji_MNIST_y_train);
PlotExampleImagesPerCategory(Kuzushiji_MNIST_X_train, Kuzushiji_MNIST_CategoricalIndices,144, "Kuzushiji-MNIST")
PlotHistogram(Kuzushiji_MNIST_CategoricalCount, "Kuzushiji-MNIST Histogram")

The classes are distributed uniformly. Let’s have a look at a few examples per class:

From my point of there are some images that could be misclassified. However, besides detecting basic shapes they all look alike to mee.

Update: the dataset contains some black images: A bug report as been filed already: https://github.com/rois-codh/kmnist/issues/1.

Kuzushiji-49

Kuzushiji-49 is similar to KMNIST. However, it contains 270912 images of 49 classes which are not distributed evenly. It contains images of the following characters:

ClassLabels_Kuzushiji_49 = CSV.File("./Kuzushiji-49/data/k49_classmap.csv") |> CSV.DataFrame

index	target value	codepoint	char
1	0	U+3042	あ
2	1	U+3044	い
3	2	U+3046	う
4	3	U+3048	え
5	4	U+304A	お
6	5	U+304B	か
7	6	U+304D	き
8	7	U+304F	く
9	8	U+3051	け
10	9	U+3053	こ
11	10	U+3055	さ
12	11	U+3057	し
13	12	U+3059	す
14	13	U+305B	せ
15	14	U+305D	そ
16	15	U+305F	た
17	16	U+3061	ち
18	17	U+3064	つ
19	18	U+3066	て
20	19	U+3068	と
21	20	U+306A	な
22	21	U+306B	に
23	22	U+306C	ぬ
24	23	U+306D	ね
25	24	U+306E	の
26	25	U+306F	は
27	26	U+3072	ひ
28	27	U+3075	ふ
29	28	U+3078	へ
30	29	U+307B	ほ
31	30	U+307E	ま
32	31	U+307F	み
33	32	U+3080	む
34	33	U+3081	め
35	34	U+3082	も
36	35	U+3084	や
37	36	U+3086	ゆ
38	37	U+3088	よ
39	38	U+3089	ら
40	39	U+308A	り
41	40	U+308B	る
42	41	U+308C	れ
43	42	U+308D	ろ
44	43	U+308F	わ
45	44	U+3090	ゐ
46	45	U+3091	ゑ
47	46	U+3092	を
48	47	U+3093	ん
49	48	U+309D	ゝ

Kuzushiji_49_X_train = NPZ.npzread("./Kuzushiji-49/data/k49-train-imgs.npz")["arr_0"];
Kuzushiji_49_X_test = NPZ.npzread("./Kuzushiji-49/data/k49-test-imgs.npz")["arr_0"];
Kuzushiji_49_y_train = NPZ.npzread("./Kuzushiji-49/data/k49-train-labels.npz")["arr_0"];
Kuzushiji_49_y_test = NPZ.npzread("./Kuzushiji-49/data/k49-test-labels.npz")["arr_0"];

Kuzushiji_49_CategoricalCount = getCategoricalCount(Kuzushiji_49_y_train);
Kuzushiji_49_CategoricalIndices = getCategoricalIndices(Kuzushiji_49_y_train);
PlotExampleImagesPerCategory(Kuzushiji_49_X_train, Kuzushiji_49_CategoricalIndices,144, "Kuzushiji-49")
PlotHistogram(Kuzushiji_49_CategoricalCount, "Kuzushiji-49 Histogram")

The examples of each class look like this:

From my point of there are some images that could be misclassified. However, besides detecting basic shapes they all look alike to mee.

Kuzushiji-Kanji

The third benchmark dataset that is provided is the Kuzushiji-Kanji. These images are no longer 28x28 but 64x64 grayscale images. It contains 140426 images of 3832 classes in total. There is no train-test splitting available. The unpacked dataset looks like this:

tree .

.
├── U+241C6
│   ├── 689fa55040ec4f03.png
│   └── c0d603c6ce4a4538.png
├── U+24FA3
│   ├── 4190e728bfc948e0.png
│   └── 80582798ed70ce7c.png
├── U+25DA1
│   └── 512d7fcacddd25fd.png

...

└── U+FA5C
    ├── 15e2060396eba2b3.png
    └── 679e4b2f026f6297.png

3832 directories, 140424 files

We can see that most characters have only a few examples per class:

function ReadFilePaths(PathToDataFolder)
    OutputDict = Dict{String,Array{String,1}}()
    SymbolFolders = readdir(PathToDataFolder);
    for SymbolFolder in SymbolFolders
        OutputDict[SymbolFolder] = readdir(PathToDataFolder * SymbolFolder)
    end
    return OutputDict
end

function getClassCountKanji(FilePaths, XLabel="Class", YLabel="Frequency")
    FrequencyCountDict = Dict{String,Int64}();
    for Symbol in collect(keys(Kuzushiji_Kanji_SymbolsWithFilePaths))
        FrequencyCountDict[Symbol] = size(Kuzushiji_Kanji_SymbolsWithFilePaths[Symbol])[1]
    end
    return FrequencyCountDict
end

Kuzushiji_Kanji_SymbolsWithFilePaths = ReadFilePaths(PathToDataFolder);
collect(values(getClassCountKanji(Kuzushiji_Kanji_SymbolsWithFilePaths)))

Let’s have a look at an example image per class. Therefore, we have to write a new function that uses PyPlot again:

function KanjiExamplePlots(InputDict, NumberofXSubPlots)
    DictLength = length(InputDict);
    PlotXDim = NumberofXSubPlots*2 + 1;
    PlotYDim = 2 * div(DictLength,NumberofXSubPlots);
    PyPlot.figure(figsize=(PlotXDim, PlotYDim));
    PositionCounter = 1;
    for Symbol in collect(keys(InputDict))
        PyPlot.subplot(convert(Int64,ceil(div(DictLength,NumberofXSubPlots))+1), NumberofXSubPlots, PositionCounter)
        PyPlot.imshow(convert(Array{Float64,2},Images.load(PathToDataFolder*Symbol*"/"*InputDict[Symbol][1])),
            cmap="gray")
        PyPlot.title("Example of "*Symbol)
        PyPlot.xticks([])
        PyPlot.yticks([])
        PositionCounter += 1
    end
    PyPlot.tight_layout()
    PyPlot.savefig("./graphics/KanjiExamples.png")
    PyPlot.close()
end


KanjiExamplePlots(Kuzushiji_Kanji_SymbolsWithFilePaths,15)

A train/test split version should be released in near future. Further, it contains a few black images as well. I’ll provide some baseline results once train-test splits are released.

Benchmark models

The dataset comes with two
Benchmarks models. The first is a 4-Nearest Neighbors and the second a very simple convolutional neural network:

# kNN with neighbors=4 benchmark for Kuzushiji-MNIST
# Acheives 97.4% test accuracy

from sklearn.neighbors import KNeighborsClassifier
import numpy as np

def load(f):
    return np.load(f)['arr_0']

# Load the data
x_train = load('kuzushiji10-train-imgs.npz')
x_test = load('kuzushiji10-test-imgs.npz')
y_train = load('kuzushiji10-train-labels.npz')
y_test = load('kuzushiji10-test-labels.npz')

# Flatten images
x_train = x_train.reshape(-1, 784)
x_test = x_test.reshape(-1, 784)

clf = KNeighborsClassifier(n_neighbors=4, weights='distance', n_jobs=-1)
print('Fitting', clf)
clf.fit(x_train, y_train)
print('Evaluating', clf)

test_score = clf.score(x_test, y_test)
print('Test accuracy:', test_score)

# Based on MNIST CNN from Keras' examples: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py (MIT License)

from __future__ import print_function
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
import numpy as np

batch_size = 128
num_classes = 10
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28

def load(f):
    return np.load(f)['arr_0']

# Load the data
x_train = load('kuzushiji10-train-imgs.npz')
x_test = load('kuzushiji10-test-imgs.npz')
y_train = load('kuzushiji10-train-labels.npz')
y_test = load('kuzushiji10-test-labels.npz')

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('{} train samples, {} test samples'.format(len(x_train), len(x_test)))

# Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))

train_score = model.evaluate(x_train, y_train, verbose=0)
test_score = model.evaluate(x_test, y_test, verbose=0)
print('Train loss:', train_score[0])
print('Train accuracy:', train_score[1])
print('Test loss:', test_score[0])
print('Test accuracy:', test_score[1])

I’m not so sure what’s wrong nowadays but it seems like most people are neglecting proper train-valid-test splitting again after cross-validation became a bit more popular due to the recent machine learning hype. This dataset is far from big enough to neglect a validation set.

It’s time to run some fast and simple models but with proper train-valid-test splitting.

Benchmark models - Kuzushiji-MNIST

Their baseline results on the testing set are:

Model	MNIST	Kuzushiji-MNIST	Kuzushiji-49
4-Nearest Neighbour Baseline	97.14%	91.56%	86.01%
Keras Simple CNN Benchmark	99.06%	95.12%	89.25%

I’m not going to compare it to MNIST!

import ScikitLearn
import Flux

4-NN on Kuzushiji-MNIST

Well, lets be a tiny bit more creative here and not limit k-NN to 4 ;).

# dataset in 1D for ScikitLearn
Kuzushiji_MNIST_X_train_flat = reshape(Kuzushiji_MNIST_X_train, (size(Kuzushiji_MNIST_X_train)[1],size(Kuzushiji_MNIST_X_train)[2]*size(Kuzushiji_MNIST_X_train)[3]));
Kuzushiji_MNIST_X_test_flat = reshape(Kuzushiji_MNIST_X_test, (size(Kuzushiji_MNIST_X_test)[1],size(Kuzushiji_MNIST_X_test)[2]*size(Kuzushiji_MNIST_X_test)[3]));

ScikitLearn.@sk_import neighbors: KNeighborsClassifier;
GridSCV = ScikitLearn.GridSearch.GridSearchCV;


KNNClassifier = KNeighborsClassifier(n_jobs=-1)
GridParameters = Dict{String, Array}(
    "n_neighbors" => [3 4 5 6 7 8 9],
    "weights" => ["uniform", "distance"],
    "algorithm" => ["ball_tree", "kd_tree", "brute"],
    "leaf_size" => [15 30 45 60],
    "p" => [1 2 3 4 5] #using Manhattan distance (1), Euclidean distance (2), arbitray Minkowski distances
    );
GridObject = GridSCV(
    KNNClassifier,
    GridParameters,
    cv=5,
    verbose=2
    # n_jobs=-1 is not implemented yet :(
    );
GridFit = ScikitLearn.fit!(
    GridObject,
    Kuzushiji_MNIST_X_train_flat,
    Kuzushiji_MNIST_y_train
    )

That is what I intended to run. However, it turned out to be really slow on a slighly old notebook. Hence, I ran this reduced version I’ll post the proper results once I’m back home and have access to my workstation again:

KNNClassifier = KNeighborsClassifier(n_jobs=-1)
GridParameters = Dict{String, Array}(
    "n_neighbors" => [3 4],#[3 4 5 6 7 8 9],
    #"weights" => #["uniform", "distance"],
    #"algorithm" => #["ball_tree", "kd_tree", "brute"],
    #"leaf_size" => #[15 30 45 60],
    "p" => [1 2]#[1 2 3 4 5] #using Manhattan distance (1), Euclidean distance (2), arbitray Minkowski distances
    );
GridObject = GridSCV(
    KNNClassifier,
    GridParameters,
    cv=5,
    verbose=2,
    # n_jobs=-1 is not implemented yet :(
    );
GridFit = ScikitLearn.fit!(
    GridObject,
    Kuzushiji_MNIST_X_train_flat,
    Kuzushiji_MNIST_y_train
    )

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] n_neighbors=3, p=1
[CV] n_neighbors=3, p=1  -  617.4s
[CV] n_neighbors=3, p=1
[CV] n_neighbors=3, p=1  -  568.3s
[CV] n_neighbors=3, p=1
[CV] n_neighbors=3, p=1  -  527.4s
[CV] n_neighbors=3, p=1
[CV] n_neighbors=3, p=1  -  546.8s
[CV] n_neighbors=3, p=1
[CV] n_neighbors=3, p=1  -  467.9s
[CV] n_neighbors=3, p=2
[CV] n_neighbors=3, p=2  -  628.5s
[CV] n_neighbors=3, p=2
[CV] n_neighbors=3, p=2  -  627.2s
[CV] n_neighbors=3, p=2
[CV] n_neighbors=3, p=2  -  614.4s
[CV] n_neighbors=3, p=2
[CV] n_neighbors=3, p=2  -  612.5s
[CV] n_neighbors=3, p=2
[CV] n_neighbors=3, p=2  -  631.3s
[CV] n_neighbors=4, p=1
[CV] n_neighbors=4, p=1  -  496.8s
[CV] n_neighbors=4, p=1
[CV] n_neighbors=4, p=1  -  609.4s
[CV] n_neighbors=4, p=1
[CV] n_neighbors=4, p=1  -  620.5s
[CV] n_neighbors=4, p=1
[CV] n_neighbors=4, p=1  -  555.7s
[CV] n_neighbors=4, p=1
[CV] n_neighbors=4, p=1  -  489.0s
[CV] n_neighbors=4, p=2
[CV] n_neighbors=4, p=2  -  575.9s
[CV] n_neighbors=4, p=2
[CV] n_neighbors=4, p=2  -  500.7s
[CV] n_neighbors=4, p=2
[CV] n_neighbors=4, p=2  -  480.4s
[CV] n_neighbors=4, p=2
[CV] n_neighbors=4, p=2  -  521.8s
[CV] n_neighbors=4, p=2
[CV] n_neighbors=4, p=2  -  561.0s

ScikitLearn.Skcore.GridSearchCV
  estimator: PyCall.PyObject
  param_grid: Dict{String,Array}
  scoring: Nothing nothing
  loss_func: Nothing nothing
  score_func: Nothing nothing
  fit_params: Dict{Any,Any}
  n_jobs: Int64 1
  iid: Bool true
  refit: Bool true
  cv: Int64 5
  verbose: Int64 2
  error_score: String "raise"
  scorer_: ScikitLearnBase.score (function of type typeof(ScikitLearnBase.score))
  best_params_: Dict{Symbol,Any}
  best_score_: Float64 0.9625166666666667
  grid_scores_: Array{ScikitLearn.Skcore.CVScoreTuple}((4,))
  best_estimator_: PyCall.PyObject

That was slow…

GridFit.best_params_
Dict{Symbol,Any} with 2 entries:
  :n_neighbors => 3
  :p           => 2

Let’s predict the test set and evaluate it:

Kuzushiji_MNIST_y_test_pred = ScikitLearn.predict(GridFit,Kuzushiji_MNIST_X_test_flat);
Kuzushiji_MNIST_accuracy = sum(Kuzushiji_MNIST_y_test_pred .== Kuzushiji_MNIST_y_test)/length(Kuzushiji_MNIST_y_test_pred)

This yields an accuracy of 0.912 which is pretty close to the reference result of 91.56 %. However, our model here is properly cross-validated and therefore we can assume that it is a bit more robust.

Brute-force collection on Kuzushiji-MNIST

This one is still with Python since I didn’t had any time to convert it to Julia so far. I guess I should rewrite some things properly as well since it had grown organically for quite some time now.

Since the original brute-force pipeline required >40 h!!!, I’m going to run a rather reduced version of this here.

Benchmark models - Kuzushiji-49

4-NN on Kuzushiji-49

Will be delivered in a few days since my notebook is to slow for this.

Brute-force collection on Kuzushiji-49

Since the original brute-force pipeline required >40 h!!!, I’m going to run a rather reduced version of this here.

References

[1] Clanuwat, T.; Bober-Irizar, M.; Kitamoto, A.; Lamb, A.; Yamamoto, K.; Ha, D. (2018): Deep Learning for Classical Japanese Literature. Neural Information Processing Systems 2018 Workshop on Machine Learning for Creativity and Design preprint: https://arxiv.org/abs/1812.01718