利用sklearn的Pipeline简化建模过程

时间 2020-05-17

原文原文链接

不少框架都会提供一种Pipeline的机制，经过封装一系列操做的流程，调用时按计划执行便可。好比netty中有ChannelPipeline，TensorFlow的计算图也是如此。框架

下面简要介绍sklearn中pipeline的使用：dom

from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # 定义类别型特征预处理器
categorical_transformer=Pipeline(steps=[ ('imputer',SimpleImputer(strategy='most_frequent')), ('onehot',OneHotEncoder(handle_unknown='ignore')) ]) # 定义数值型特征预处理器
numerical_transformer=SimpleImputer(strategy='constant') # 将类别与数值型特征预处理器，分别应用于对应列上
preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, ['Age']), ('cat', categorical_transformer, ['Embarked']) ]) # 定义Pipeline，传入预处理器与选择的模型
my_pipeline=Pipeline(steps=[ ('preprocessor',preprocessor), ('model',RandomForestClassifier(n_estimators=100,random_state=0)) ]) # 使用pipeline
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.2,random_state=0) my_pipeline.fit(X_train.copy(),y_train.copy())# 训练，预处理会改变原始数据，不想改变copy一下
preds=my_pipeline.predict(X_valid)# 预测