作nlp的时候,若是用到tf-idf,sklearn中用CountVectorizer与TfidfTransformer两个类,下面对和两个类进行讲解测试
CountVectorizer与TfidfTransformer在处理训练数据的时候都用fit_transform方法,在测试集用transform方法。fit包含训练的意思,表示训练好了去测试,若是在测试集中也用fit_transform,那显然致使结果错误。
#变量:content_train 训练集,content_test测试集
vectorizer = CountVectorizer()
tfidftransformer = TfidfTransformer()
#训练 用fit_transform
count_train=vectorizer.fit_transform(content_train)
tfidf = tfidftransformer.fit_transform(count_train)
#测试
count_test=vectorizer.transform(content_test)
test_tfidf = tfidftransformer.transform(count_test)
测试集的if-idf
test_weight = test_tfidf.toarray()spa
咱们老是须要保存tf-idf的词典,而后计算测试集的tfidf,这里要注意sklearn中保存有两种方法:pickle与joblib。咱们这里用picklecode
train_content = segmentWord(X_train) test_content = segmentWord(X_test) # replace 必须加,保存训练集的特征 vectorizer = CountVectorizer(decode_error="replace") tfidftransformer = TfidfTransformer() # 注意在训练的时候必须用vectorizer.fit_transform、tfidftransformer.fit_transform # 在预测的时候必须用vectorizer.transform、tfidftransformer.transform vec_train = vectorizer.fit_transform(train_content) tfidf = tfidftransformer.fit_transform(vec_train) # 保存通过fit的vectorizer 与 通过fit的tfidftransformer,预测时使用 feature_path = 'models/feature.pkl' with open(feature_path, 'wb') as fw: pickle.dump(vectorizer.vocabulary_, fw) tfidftransformer_path = 'models/tfidftransformer.pkl' with open(tfidftransformer_path, 'wb') as fw: pickle.dump(tfidftransformer, fw)
注意:vectorizer 与tfidftransformer都要保存,并且只能 fit_transform 以后保存,表示vectorizer 与tfidftransformer已经用训练集训练好了。orm
# 加载特征 feature_path = 'models/feature.pkl' loaded_vec = CountVectorizer(decode_error="replace", vocabulary=pickle.load(open(feature_path, "rb"))) # 加载TfidfTransformer tfidftransformer_path = 'models/tfidftransformer.pkl' tfidftransformer = pickle.load(open(tfidftransformer_path, "rb")) #测试用transform,表示测试数据,为list test_tfidf = tfidftransformer.transform(loaded_vec.transform(test_content))