skearnを使用したk-NN（K Nearest Neighbor：k近傍法）の実装方法

はじめに

本記事では、sklearnのk-NN（K Nearest Neighbor）のライブラリを使用してアヤメのクラス分類をしながら、k-NNの実装方法を記述していきます。

k-NNとは？

k-NNは、入力された未知データに対して、もっとも似ている学習データをK個選択して、選択された学習データの中で最も多く選ばれているクラスを未知データのクラスとして予測する手法です。

実装方法

1. データの読み込み

# 前処理　データセットのダウンロード

%matplotlib inline

from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split

# アヤメデータセットを用いる
iris = datasets.load_iris()
# データセットのdict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

# 例として、3,4番目の特徴量の２次元データで使用
# iristデータの特徴量['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']の3、４番目を使用
#x = iris.data[:, [2,3]]
x = iris.data[:, [0,1]]
#x = iris.data

# クラスラベルを取得
# target_names : ['setosa' 'versicolor' 'virginica']
y = iris.target

# データの分割
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

2. 交差検証による学習データの識別率評価

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors = 6)

knn.fit(x_train, y_train)

scores = cross_val_score(knn, x_train, y_train, cv = 5)
print(scores)

print("k-NNによる交差検証の識別率平均：{}".format(np.mean(scores)))

実行結果

[0.66666667 0.76190476 0.76190476 0.80952381 0.80952381]
k-NNによる交差検証の識別率平均：0.7619047619047619

3. 識別状況の可視化

# データのプロット

import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

# データの可視化
plt.style.use('ggplot')

# trainデータとtestデータを連結
x_combined_std = np.vstack((x_train, x_test))
y_combined = np.hstack((y_train, y_test))

fig = plt.figure(figsize=(13, 8))
plot_decision_regions(x_combined_std, y_combined, clf=knn, res=0.02)
plt.show()

実行結果

f:id:Yunos:20200507092700p:plain — k-NNのクラス分類の可視化