可以借用sklearn中的StratifiedKFold来来实现K折交叉验证,同时根据标签中不同类别占比来进行拆分数据的,从而解决样本不均衡问题。
- #!/usr/bin/python3
- # -*- coding:utf-8 -*-
- """
- @author: xcd
- @file: StratifiedKFold-test.py
- @time: 2021/1/26 10:14
- @desc:
- """
-
- import numpy as np
- from sklearn.model_selection import KFold, StratifiedKFold
-
- X = np.array([
- [1, 2, 3, 4],
- [11, 12, 13, 14],
- [21, 22, 23, 24],
- [31, 32, 33, 34],
- [41, 42, 43, 44],
- [51, 52, 53, 54],
- [61, 62, 63, 64],
- [71, 72, 73, 74]
- ])
-
- y = np.array([1, 1, 1, 1, 1, 1, 0, 0])
-
- sfolder = StratifiedKFold(n_splits=4, random_state=0, shuffle=True)
- folder = KFold(n_splits=4, random_state=0, shuffle=False)
-
- for train, test in sfolder.split(X, y):
- print(train, test)
-
- print("-------------------------------")
- for train, test in folder.split(X, y):
- print(train, test)
-
-
-
-
- for fold, (train_idx, val_idx) in enumerate(sfolder.split(X, y)):
- train_set, val_set = X[train_idx], X[val_idx]
-
跟KFold有明显的对比,StratifiedKFold用法类似Kfold,但是他是分层采样,确保训练集,测试集中各类别样本的比例与原始数据集中相同。
- ###
- Parameters
-
- n_splits : int, default=3
- Number of folds. Must be at least 2.
-
- shuffle : boolean, optional
- Whether to shuffle each stratification of the data before splitting into batches.
-
- random_state :
- int, RandomState instance or None, optional, default=None
-
- If int, random_state is the seed used by the random number generatorIf RandomState instance, random_state is the random number generator;
-
- If None, the random number generator is the RandomState instance used
-
- by `np.random`. Used when ``shuffle`` == True.
- ###
-