[機械学習]バリデーション手法徹底解説 Part.2

2023年12月26日

本記事 Part.2では、前回 Part.1にて紹介したよく用いられるバリデーション手法についてコードを用いて説明します！

はじめに

本記事 Part.2では、前回 Part.1にて紹介したよく用いられるバリデーション手法についてコードを用いて説明します。

[blogcard url=”https://blog.since2020.jp/data_analysis/%e6%a9%9f%e6%a2%b0%e5%ad%a6%e7%bf%92%e3%83%90%e3%83%aa%e3%83%87%e3%83%bc%e3%82%b7%e3%83%a7%e3%83%b3%e6%89%8b%e6%b3%95-%e5%be%b9%e5%ba%95%e8%a7%a3%e8%aa%ac-part-1/”]

なぜやるのか

機械学習の様々な手法を初めて触れる方々にとって、scikit-learnなどのドキュメントは読んでて楽しいものの、情報量が多く、理解しにくいことがあります。

そこで、文字での説明に加えてコードを示すことで、手法の概念をより明確にし、具体的な実装方法を視覚的に理解しやすくすることができます。

私が目指すのは、読者の皆さんが機械学習の手法をただ理解するだけでなく、実際にコードとして実装できるようになることを支援することです。

文法の説明

test_size：テストセットに割り当てるデータの割合または数を指定しますtest_size=0.2はデータの20%をテストセットに使用し、残りの80%をトレーニングセットに使用することを意味します。

random_state：データ分割の際の乱数シードを指定します

n_splits：交差検証における分割の数を指定します。例えば、n_splits=5は5分割交差検証を意味します。

shuffle：データを分割する前にシャッフルするかどうかを指定します。例えば、shuffle=Trueを設定すると、データがランダムにシャッフルされた後に分割されます。

.split(X) ：データセットを分割する際に、文字列や特定の特徴量を個別に分解するのではなく、データセット全体に対して行われる分割操作です。このメソッドは、指定された数のセグメントにデータセットを分割し、各セグメントに属する行のインデックス番号を返します。

実装

ホールドアウト法（Holdout Method）:

# ホールドアウト（Holdout Method）
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=28)

K分割交差検証（K-Fold Cross-Validation）:

# K分割交差検証（K-Fold Cross-Validation） 
from sklearn.model_selection import KFold 
kf = KFold(n_splits=5, shuffle=True, random_state=28) 
for train_index, test_index in kf.split(X): 
    X_train, X_test = X.iloc[train_index], X.iloc[test_index] 
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

グループK分割交差検証（Group K-Fold）：

# グループK分割交差検証（Group K-Fold） 
from sklearn.model_selection import GroupKFold 
gkf = GroupKFold(n_splits=5) 
for train_index, test_index in gkf.split(X, y, groups=groups): 
    X_train, X_test = X.iloc[train_index], X.iloc[test_index] 
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

層化K分割交差検証（Stratified K-Fold Cross-Validation）:

# 層化K分割交差検証（Stratified K-Fold Cross-Validation） 
from sklearn.model_selection import StratifiedKFold 
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=28) 
for train_index, test_index in skf.split(X, y): 
    X_train, X_test = X.iloc[train_index], X.iloc[test_index] 
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

リーブ・ワン・アウト法（Leave-One-Out Cross-Validation）:

# リーブ・ワン・アウト法（Leave-One-Out Cross-Validation） 
from sklearn.model_selection import LeaveOneOut 
loo = LeaveOneOut() 
for train_index, test_index in loo.split(X): 
    X_train, X_test = X.iloc[train_index], X.iloc[test_index] 
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]