[Scikit-learn教程] 03.02 文本处理：分类与优化

时间 2019-11-17

标签 scikit learn 教程 03.02 文本处理分类优化繁體版

原文原文链接

回顾

上一节咱们经过Scikit-learn提供的多种方法从网络以及硬盘获取到了原始的文本数据，并采用tf-idf方法成功地提取了文本特征，你能够从下面的例子中再次复习这一过程。算法

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# 选取参与分析的文本类别
categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
# 从硬盘获取原始数据
twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",
        categories=categories,
        load_content = True,
        encoding="latin1",
        decode_error="strict",
        shuffle=True, random_state=42)
# 统计词语出现次数
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
# 使用tf-idf方法提取文本特征
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# 打印特征矩阵规格
print(X_train_tfidf.shape)
复制代码

本节咱们将在完成特征提取工做的基础上，继续完成文本信息挖掘的下一步——训练并优化分类器。数组

训练分类器

能够用于文本分类的机器学习算法有不少，朴素贝叶斯算法（Naïve Bayes）就是其中一个优秀表明。Scikit-learn包含了朴素贝叶斯算法的多种改进模型，最适于文本词数统计方面的模型叫作多项式朴素贝叶斯（Multinomial Naïve Bayes），它能够经过如下的方式来调用。bash

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
categories = ["alt.atheism", "soc.religion.christian",
              "comp.graphics", "sci.med"]
twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",
        categories=categories,
        load_content = True,
        encoding="latin1",
        decode_error="strict",
        shuffle=True, random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)[/amalthea_pre_exercise_code]
[amalthea_sample_code]
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
print(&quot;分类器的相关信息：&quot;)
print(clf)
复制代码

这样就完成了一个分类器的训练过程。为了使用一个新文档来进行分类器的分类预测工做，咱们必须使用一样的数据处理手段处理咱们的新文档。以下面的例子所示，咱们使用了一组自定义的字符串，用来判断它们的分类状况。字符串组必须通过transform方法的处理才能进行预测。网络

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
categories = ["alt.atheism", "soc.religion.christian",
              "comp.graphics", "sci.med"]
twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",
        categories=categories,
        load_content = True,
        encoding="latin1",
        decode_error="strict",
        shuffle=True, random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
# 预测用的新字符串，你能够将其替换为任意英文句子
docs_new = ["Nvidia is awesome!"]
# 字符串处理
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

# 进行预测
predicted = clf.predict(X_new_tfidf)

# 打印预测结果
for doc, category in zip(docs_new, predicted):
    print("%r =&gt; %s" % (doc, twenty_train.target_names[category]))
复制代码

做为受西方理论指导的一种基础的机器学习算法，朴素贝叶斯虽然很简单，有时候很朴素，可是它的运行速度很是的快，效果也很是的理想，可以跟不少更复杂的算法相提并论。app

创建Pipeline

为了简化对于原始数据的清洗、特征提取以及分类过程，Scikit-learn提供了Pipeline类来实现一个整合式的分类器创建过程。分类器能够经过创建一个Pipeline的方式来实现，而各类特征提取、分类方法均可以在创建Pipeline的时候直接指定，从而大大提升编码和调试的效率，以下所示：dom

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
categories = ["alt.atheism", "soc.religion.christian",
              "comp.graphics", "sci.med"]
twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",
        categories=categories,
        load_content = True,
        encoding="latin1",
        decode_error="strict",
        shuffle=True, random_state=42)[/amalthea_pre_exercise_code]
[amalthea_sample_code]
from sklearn.pipeline import Pipeline
# 创建Pipeline
text_clf = Pipeline([("vect", CountVectorizer()),
                     ("tfidf", TfidfTransformer()),
                     ("clf", MultinomialNB()),
])
# 训练分类器
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
# 打印分类器信息
print(text_clf)
复制代码

使用测试数据评估分类器性能

咱们能够采用上述的方法对测试数据集进行预测，而后使用Numpy所提供的函数获得评测结果：机器学习

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
if "text_clf" not in dir() :
	categories = ["alt.atheism", "soc.religion.christian",
	              "comp.graphics", "sci.med"]
	twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",categories=categories,  load_content = True, 
	                           encoding="latin1", decode_error="strict",shuffle=True, random_state=42)
	text_clf = Pipeline([("vect", CountVectorizer()),
	                     ("tfidf", TfidfTransformer()),
	                     ("clf", MultinomialNB()),
	])
	text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
import numpy as np
# 获取测试数据
twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,
                        categories=categories,
                        load_content = True, 
                        encoding=&#39;latin1&#39;,
                        decode_error=&#39;strict&#39;,
                        shuffle=True, random_state=42)
docs_test = twenty_test.data
# 使用测试数据进行分类预测
predicted = text_clf.predict(docs_test)
# 计算预测结果的准确率
print(&quot;准确率为：&quot;)
print(np.mean(predicted == twenty_test.target))
复制代码

若是正常运行上述代码，咱们应该能够获得83.4%的准确率。咱们有不少办法来改进这个成绩，使用业界公认的最适于文本分类的算法——支持向量机（SVM，Support Vector Machine）就是一个很好的方向（虽然它会比朴素贝叶斯稍微慢一点）。咱们能够经过改变Pipeline中分类器所指定的对象轻松地实现这一点：函数

import numpy as np
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;twenty_train&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
if &#39;twenty_test&#39; not in dir() :
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data

from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
                     (&#39;tfidf&#39;, TfidfTransformer()),
                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;,
                                            penalty=&#39;l2&#39;,
                                            alpha=1e-3,
                                            n_iter=5,
                                            random_state=42)),
                    ])
_ = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
print(&quot;准确率：&quot;)
print(np.mean(predicted == twenty_test.target))
复制代码

咱们能够看到，相对于朴素贝叶斯，SVM方法获得的准确率有了很大的进步。工具

Scikit-learn提供了更多的评测工具来更好地帮助咱们进行分类器的性能分析，以下所示，咱们能够获得预测结果中关于每一种分类的准确率、召回率、F值等等以及它们的混淆矩阵。

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;predicted&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data
	from sklearn.linear_model import SGDClassifier
	text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
	                     (&#39;tfidf&#39;, TfidfTransformer()),
	                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;, penalty=&#39;l2&#39;,
	                                           alpha=1e-3, n_iter=5, random_state=42)),
	])
	_ = text_clf.fit(twenty_train.data, twenty_train.target)
	predicted = text_clf.predict(docs_test)
from sklearn import metrics
print(&quot;打印分类性能指标：&quot;)
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))
print(&quot;打印混淆矩阵：&quot;)
metrics.confusion_matrix(twenty_test.target, predicted)
复制代码

不出所料，经过混淆矩阵咱们能够发现，相对于计算机图形学（comp.graphics），与无神论（alt.atheism）以及基督教（soc.religion.christian）相关的两种分类更难以被区分出来。

使用网格搜索来进行参数优化

咱们已经了解了不少机器学习过程当中所遇到的参数，好比TfidfTransformer中的use_idf。分类器每每会拥有不少的参数，好比说朴素贝叶斯算法中包含平滑参数alpha，SVM算法会包含惩罚参数alpha以及其余一些能够设置的函数。

为了不调整这一系列参数而带来的繁杂工做，咱们可使用网格搜索方法来寻找各个参数的最优值。以下面的例子所示，咱们能够在采用SVM算法创建分类器时尝试设置以下参数：使用单词或是使用词组、使用IDF或是不使用IDF、惩罚参数为0.01或是0.001。

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;text_clf&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data
	from sklearn.linear_model import SGDClassifier
	text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
	                     (&#39;tfidf&#39;, TfidfTransformer()),
	                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;, penalty=&#39;l2&#39;,
	                                           alpha=1e-3, n_iter=5, random_state=42)),
	])
from sklearn.grid_search import GridSearchCV
# sklearn 0.18.1 版本请使用如下方式导入网格搜索库
# from sklearn.model_selection import GridSearchCV

# 设置参与搜索的参数
parameters = {&#39;vect__ngram_range&#39;: [(1, 1), (1, 2)],
              &#39;tfidf__use_idf&#39;: (True, False),
              &#39;clf__alpha&#39;: (1e-2, 1e-3),
}

# 构建分类器
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
print(gs_clf)
复制代码

很明显，逐个进行这样一个搜索过程会消耗较大的计算资源。若是咱们拥有一个多核CPU平台，咱们就能够并行计算这8个任务（每一个参数有两种取值，三个参数共有个参数组合），这须要咱们修改n_jobs这个参数。若是咱们设置这个参数的值为-1，网格搜索过程将会自动检测计算环境所存在的CPU核心数量，并使用所有核心进行并行工做。

一个具体的网格搜索模型与普通的分类器模型一致，咱们可使用一个较小的子数据块来加快模型的训练过程。对GridSearchCV对象调用fit方法以后将获得一个与以前案例相似的分类器，咱们可使用这个分类器来进行预测。

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;gs_clf&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data
	from sklearn.linear_model import SGDClassifier
	text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
	                     (&#39;tfidf&#39;, TfidfTransformer()),
	                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;, penalty=&#39;l2&#39;,
	                                           alpha=1e-3, n_iter=5, random_state=42)),
	])
	parameters = {&#39;vect__ngram_range&#39;: [(1, 1), (1, 2)],
	              &#39;tfidf__use_idf&#39;: (True, False),
	              &#39;clf__alpha&#39;: (1e-2, 1e-3),
	}
	gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
# 使用部分训练数据训练分类器
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
# 查看分类器对于新文本的预测结果，你能够自行改变下方的字符串来观察分类效果
twenty_train.target_names[gs_clf.predict([&#39;An apple a day keeps doctor away&#39;])[0]]
复制代码

分类器同时包含best_score_和best_params_两个属性，这两个属性包含了最佳预测结果以及取得最佳预测结果时的参数配置。固然，咱们也能够浏览gs_clf.cv_results_来获取更详细的搜索结果（这是sklearn 0.18.1版本新加入的特性），这个参数能够很容易地导入到pandas中进行更为深刻的研究。

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;gs_clf&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data
	from sklearn.linear_model import SGDClassifier
	text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
	                     (&#39;tfidf&#39;, TfidfTransformer()),
	                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;, penalty=&#39;l2&#39;,
	                                           alpha=1e-3, n_iter=5, random_state=42)),
	])
	parameters = {&#39;vect__ngram_range&#39;: [(1, 1), (1, 2)],
	              &#39;tfidf__use_idf&#39;: (True, False),
	              &#39;clf__alpha&#39;: (1e-2, 1e-3),
	}
	gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
print("最佳准确率：%r" % (gs_clf.best_score_))

for param_name in sorted(parameters.keys()):
    print(&quot;%s: %r&quot; % (param_name, gs_clf.best_params_[param_name]))
复制代码

小结

至此，咱们已经完整实践了一个使用机器学习方法进行文本分类工做的全过程，咱们了解了从网络获取数据并进行读取、清洗原始数据并提取特征向量、使用不一样算法来构建分类器、并使用网格搜索方法来进行参数调优等有监督机器学习中较为常见的各个知识点。关于更为复杂的一些问题，好比中文文本处理、文本聚类分析等等，咱们将在以后的文章中进行讨论。

（本篇课程内容来自于Scikit-Learn - Working With Text Data，转载请注明来源。）

[Scikit-learn教程] 03.02 文本处理：分类与优化

回顾

训练分类器

创建Pipeline

使用测试数据评估分类器性能

使用网格搜索来进行参数优化

小结

推荐阅读

用PaddlePaddle调戏邮件诈骗犯（完结篇）

这评论有毒！——文本分类的通常套路

我作了一个叫“瑟曦”的机器人，但是她动不动就想让格雷果爵士弄死我。