大家好,本文中,我将和大家一起学习如何训练 LightGBM 模型来估计电子商务广告的点击率的推荐系统的例子。将在Criteo数据集上训练一个基于LightGBM的模型。
LightGBM是一个基于树的梯度提升学习算法框架。是基于分布式框架设计的,因而非常高效,具有以下优点:
LightGBM梯度提升树算法,推荐原理是基于内容的过滤,用于在基于内容的问题中进行快速训练和低内存使用。
接下来我们一起看个具体的案例,更加深刻的理解LightGBM是如何被用于推荐系统中的。
- import sys
- import os
- import numpy as np
- import lightgbm as lgb
- import papermill as pm
- import scrapbook as sb
- import pandas as pd
- import category_encoders as ce
- from tempfile import TemporaryDirectory
- from sklearn.metrics import roc_auc_score, log_loss
-
- import recommenders.models.lightgbm.lightgbm_utils as lgb_utils
- import recommenders.datasets.criteo as criteo
-
- print("System version: {}".format(sys.version))
- print("LightGBM version: {}".format(lgb.__version__))
-
- System version: 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 19:16:44)
- [GCC 7.3.0]
- LightGBM version: 2.2.1
-
现在设置 LightGBM 的主要相关参数。基本上,该任务是一个二分类 (预测点击或不点击),因此目标函数设置为二分类 logloss,并使用 AUC 指标作为数据集类中不平衡的影响较小的指标。
通常,我们可以调整模型中的
以获得更好的模型性能。
可以在中找到一些关于如何调优这些参数的建议。
- MAX_LEAF = 64
- MIN_DATA = 20
- NUM_OF_TREES = 100
- TREE_LEARNING_RATE = 0.15
- EARLY_STOPPING_ROUNDS = 20
- METRIC = "auc"
- SIZE = "sample"
-
- params = {
- 'task': 'train',
- 'boosting_type': 'gbdt',
- 'num_class': 1,
- 'objective': "binary",
- 'metric': METRIC,
- 'num_leaves': MAX_LEAF,
- 'min_data': MIN_DATA,
- 'boost_from_average': True,
- #set it according to your cpu cores.
- 'num_threads': 20,
- 'feature_fraction': 0.8,
- 'learning_rate': TREE_LEARNING_RATE,
- }
-
这里使用 CSV 作为示例数据输入。我们的示例数据是来自 Criteo数据集。Criteo数据集是一个著名的行业基准数据集,用于开发CTR预测模型,它经常被研究论文采用作为评估数据集。原始数据集对于轻量级演示来说太大了,因此我们从其中抽取一小部分作为演示数据集。
Criteo中有39列特征,其中13列是数值特征(I1-I13),其他26列是类别特征(C1-C26)。
数据传送门:https://www.kaggle.com/c/criteo-display-ad-challenge
- nume_cols = ["I" + str(i) for i in range(1, 14)]
- cate_cols = ["C" + str(i) for i in range(1, 27)]
- label_col = "Label"
-
- header = [label_col] + nume_cols + cate_cols
- with TemporaryDirectory() as tmp:
- all_data = criteo.load_pandas_df(size=SIZE, local_cache_path=tmp, header=header)
- display(all_data.head())
-
首先,从原始的所有数据中切割三个集合
值得注意的是,考虑到Criteo是一种时间序列流数据,这在推荐场景中也很常见,我们按其顺序对数据进行了拆分。
- # split data to 3 sets
- length = len(all_data)
- train_data = all_data.loc[:0.8*length-1]
- valid_data = all_data.loc[0.8*length:0.9*length-1]
- test_data = all_data.loc[0.9*length:]
-
考虑到 LightGBM 可以自行处理低频特征和缺失值,在基本使用中,我们只使用序号编码器对类字符串的分类特征进行编码。
- ord_encoder = ce.ordinal.OrdinalEncoder(cols=cate_cols)
-
- def encode_csv(df, encoder, label_col, typ='fit'):
- if typ == 'fit':
- df = encoder.fit_transform(df)
- else:
- df = encoder.transform(df)
- y = df[label_col].values
- del df[label_col]
- return df, y
-
- train_x, train_y = encode_csv(train_data, ord_encoder, label_col)
- valid_x, valid_y = encode_csv(valid_data, ord_encoder, label_col, 'transform')
- test_x, test_y = encode_csv(test_data, ord_encoder, label_col, 'transform')
-
- print('Train Data Shape: X: {trn_x_shape}; Y: {trn_y_shape}.\nValid Data Shape: X: {vld_x_shape}; Y: {vld_y_shape}.\nTest Data Shape: X: {tst_x_shape}; Y: {tst_y_shape}.\n'
- .format(trn_x_shape=train_x.shape,
- trn_y_shape=train_y.shape,
- vld_x_shape=valid_x.shape,
- vld_y_shape=valid_y.shape,
- tst_x_shape=test_x.shape,
- tst_y_shape=test_y.shape,))
- train_x.head()
-
- Train Data Shape: X: (80000, 39); Y: (80000,).
- Valid Data Shape: X: (10000, 39); Y: (10000,).
- Test Data Shape: X: (10000, 39); Y: (10000,).
-
当超参数和数据都准备就绪时,我们可以创建一个模型:
- lgb_train = lgb.Dataset(train_x, train_y.reshape(-1), params=params, categorical_feature=cate_cols)
- lgb_valid = lgb.Dataset(valid_x, valid_y.reshape(-1), reference=lgb_train, categorical_feature=cate_cols)
- lgb_test = lgb.Dataset(test_x, test_y.reshape(-1), reference=lgb_train, categorical_feature=cate_cols)
- lgb_model = lgb.train(params,
- lgb_train,
- num_boost_round=NUM_OF_TREES,
- early_stopping_rounds=EARLY_STOPPING_ROUNDS,
- valid_sets=lgb_valid,
- categorical_feature=cate_cols)
-
- [1] valid_0's auc: 0.728695
- Training until validation scores don't improve for 20 rounds.
- [2] valid_0's auc: 0.742373
- [3] valid_0's auc: 0.747298
- [4] valid_0's auc: 0.747969
- ...
- [39] valid_0's auc: 0.756966
- Early stopping, best iteration is:
- [19] valid_0's auc: 0.763092
-
现在看看模型的性能如何:
- test_preds = lgb_model.predict(test_x)
- auc = roc_auc_score(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))
- logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds), eps=1e-12)
- res_basic = {"auc": auc, "logloss": logloss}
- print(res_basic)
- sb.glue("res_basic", res_basic)
-
- {'auc': 0.7674356153037237,
- 'logloss': 0.466876775528735}
-
接下来,由于 LightGBM 对密集的数值特征有较好的有效处理能力,我们尝试将原始数据中的所有分类特征转换为数值特征,通过标签编码和二分类编码。同样由于Criteo的序列特性,我们所采用的标签编码是一个一个执行的,也就是说我们是根据每个样本之前的前一个样本的信息(sequence label-encoding和sequence count-encoding)对样本进行有序编码。此外,我们还对低频分类特征进行了筛选,并用数值特征对应列的均值来填补缺失的值。
在 lgb_utils.NumEncoder,主要步骤如下。
注意,上述过程中使用的统计数据只在拟合训练集时更新,而在转换测试集时保持静态,因为测试数据的标签应该是未知的。
- label_col = 'Label'
- num_encoder = lgb_utils.NumEncoder(cate_cols, nume_cols, label_col)
- train_x, train_y = num_encoder.fit_transform(train_data)
- valid_x, valid_y = num_encoder.transform(valid_data)
- test_x, test_y = num_encoder.transform(test_data)
- del num_encoder
- print('Train Data Shape: X: {trn_x_shape}; Y: {trn_y_shape}.\nValid Data Shape: X: {vld_x_shape}; Y: {vld_y_shape}.\nTest Data Shape: X: {tst_x_shape}; Y: {tst_y_shape}.\n'
- .format(trn_x_shape=train_x.shape,
- trn_y_shape=train_y.shape,
- vld_x_shape=valid_x.shape,
- vld_y_shape=valid_y.shape,
- tst_x_shape=test_x.shape,
- tst_y_shape=test_y.shape,))
-
- 2019-04-29 11:26:21,158 [INFO] Filtering and fillna features
- 100%|██████████| 26/26 [00:02<00:00, 12.36it/s]
- 100%|██████████| 13/13 [00:00<00:00, 711.59it/s]
- 2019-04-29 11:26:23,286 [INFO] Ordinal encoding cate features
- 2019-04-29 11:26:24,680 [INFO] Target encoding cate features
- 100%|██████████| 26/26 [00:03<00:00, 6.72it/s]
- 2019-04-29 11:26:28,554 [INFO] Start manual binary encoding
- 100%|██████████| 65/65 [00:04<00:00, 15.87it/s]
- 100%|██████████| 26/26 [00:02<00:00, 8.17it/s]
- 2019-04-29 11:26:35,518 [INFO] Filtering and fillna features
- 100%|██████████| 26/26 [00:00<00:00, 171.81it/s]
- 100%|██████████| 13/13 [00:00<00:00, 2174.25it/s]
- 2019-04-29 11:26:35,690 [INFO] Ordinal encoding cate features
- 2019-04-29 11:26:35,854 [INFO] Target encoding cate features
- 100%|██████████| 26/26 [00:00<00:00, 53.42it/s]
- 2019-04-29 11:26:36,344 [INFO] Start manual binary encoding
- 100%|██████████| 65/65 [00:03<00:00, 20.02it/s]
- 100%|██████████| 26/26 [00:01<00:00, 17.67it/s]
- 2019-04-29 11:26:41,081 [INFO] Filtering and fillna features
- 100%|██████████| 26/26 [00:00<00:00, 158.08it/s]
- 100%|██████████| 13/13 [00:00<00:00, 2203.78it/s]
- 2019-04-29 11:26:41,267 [INFO] Ordinal encoding cate features
- 2019-04-29 11:26:41,429 [INFO] Target encoding cate features
- 100%|██████████| 26/26 [00:00<00:00, 53.08it/s]
- 2019-04-29 11:26:41,922 [INFO] Start manual binary encoding
- 100%|██████████| 65/65 [00:03<00:00, 20.10it/s]
- 100%|██████████| 26/26 [00:01<00:00, 18.37it/s]
-
- Train Data Shape: X: (80000, 268); Y: (80000, 1).
- Valid Data Shape: X: (10000, 268); Y: (10000, 1).
- Test Data Shape: X: (10000, 268); Y: (10000, 1).
-
- lgb_train = lgb.Dataset(train_x, train_y.reshape(-1), params=params)
- lgb_valid = lgb.Dataset(valid_x, valid_y.reshape(-1), reference=lgb_train)
- lgb_model = lgb.train(params,
- lgb_train,
- num_boost_round=NUM_OF_TREES,
- early_stopping_rounds=EARLY_STOPPING_ROUNDS,
- valid_sets=lgb_valid)
-
- [1] valid_0's auc: 0.731759
- Training until validation scores don't improve for 20 rounds.
- [2] valid_0's auc: 0.747705
- [3] valid_0's auc: 0.751667
- ...
- [58] valid_0's auc: 0.770408
- [59] valid_0's auc: 0.770489
- Early stopping, best iteration is:
- [39] valid_0's auc: 0.772136
-
- test_preds = lgb_model.predict(test_x)
- auc = roc_auc_score(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))
- logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds), eps=1e-12)
- res_optim = {"auc": auc, "logloss": logloss}
- print(res_optim)
- sb.glue("res_optim", res_optim)
-
- {'auc': 0.7757371640011422,
- 'logloss': 0.4606505068849181}
-
现在我们完成了LightGBM的基本训练和测试,接下来保存并重新加载模型,然后再次评估它。
- with TemporaryDirectory() as tmp:
- save_file = os.path.join(tmp, r'finished.model')
- lgb_model.save_model(save_file)
- loaded_model = lgb.Booster(model_file=save_file)
-
- # eval the performance again
- test_preds = loaded_model.predict(test_x)
-
- auc = roc_auc_score(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))
- logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds), eps=1e-12)
- print({"auc": auc, "logloss": logloss})
-
- {'auc': 0.7757371640011422,
- 'logloss': 0.4606505068849181}
-