特征抽取：特征字典向量化和特征哈希变换

时间 2019-11-11

标签特征抽取字典向量哈希变换繁體版

原文原文链接

注：本文是人工智能研究网的学习笔记算法

sklearn.feature_extaction模块提供了从原始数据如文本，图像等中抽取可以被机器学习算法直接处理的特征向量。数组

Feature extraction和Feature selection是不一样的：前者将任意的数据变换成机器学习算法可用的数值型特征；后者是一个做用于特征空间上的机器学习技术，是对特征空间的再次变换。app

Loading Features From Dicts
Features hashing
Text Feature Extraction
Image Feature Extraction

Loading Features From Dicts

DictVectorizer类能够用来把标准Python dict对象表示的特征数组转换成Numpy/Scipy的表示形式，以便于scikit-learn estimators的使用。机器学习

尽管速度不是很快，Python的dict使用起来仍是至关方便的，并且还能够稀疏存储（absent feature need not be stored）；字典的形式便于将特征的取值和名称一一对应起来。函数

DictVectorizer实现了one-of-K或者叫“one-hot”编码对标称型特征。标称型特征（Categorical feature）是“attribute-value”pairs，其中value是属性的可能的取值列表，必须是有限的离散的没有大小顺序的。（e.g 男女，话题类别）学习

下面的例子中，‘city’是一个categorical attribute而‘temperature’是一个典型的numerical feature。ui

measurements = [
    {'city': 'Dubai', 'temperature': 33.0},
    {'city': 'London', 'temperature': 12.0},
    {'city': 'San Fransisco', 'temperature': 18.0},
]

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()

print(vec.fit_transform(measurements).toarray())
print(vec.get_feature_names())

Features hashing

Features hashing是一个高速的，低存储的向量化的类。
通常的vectorizer是为训练过程当中遇到的特征构建一个hash table，而FeatureHasher类则直接对特征应用一个hash函数来决定特征在样本矩阵中的列索引。这样的作法使得计算速度提高而且节省了内存，the hasher没法记住输入特征的样子，并且不逊在你想变换操做：inverse_transform。this

由于哈希函数可能会致使原本不相关的特征之间发生冲突，因此使用了有符号的hash函数。对一个特征，其hash值的符号决定了被存储到输出矩阵中的值的符号。经过这种方式就可以消除特征hash映射时发生的冲突而不是累计冲突。并且任意输出的值的指望均值是0.编码

若是non_negative=True被传入构造函数，将会取绝对值。这样会发生一些冲突（collision）可是哈希特征映射的输出就能够被传入到一些只能接受非负特征的学习器对象好比：
sklearn.naive_bayes.MultinomialNB分类器和sklearn.feature_selection.chi2特征选择器。人工智能

Features hashing接受参数类型可使：mappings（字典或者其变体容器）或者（feature，value）对，或者strings。这取决于构造器参数：input_type。

Mapping被看作是由（feature，value）构成的一个列表，而单个字符串隐式的等于1，因此['feat1', 'feat2', 'feat3']被解释成（feature，value）的列表：[('feat1', 1), ('feat2',2), ('feat3', 3)]。若是一个特征在一个样本中出现了屡次，相关联的值就会累加起来：（好比('feat', 2)和('feat', 3.5)会累计起来成为('feat', 5.5)）。

FeatureHasher的输出一般是CSR格式的scipy.sparse matrix。

Feature hashing 可被用于文档分类中去，可是与text.CountVectorizer不一样，FeatureHasher不作单词切分或其余的预处理操做，除了Unicode-to-UTF-8编码之外。

Text Feature Extraction

The Bag of Words represention 词袋模型
Sparsity
Common Vectorizer usage
TF-idf term weighting
Decoding text files
Limitations of the Bag of Words represention 词袋模型的局限性
Vertorizing a large text corpus with the hashing trick
Performing out-of-core scaling with HashingVectorizer
Customizing the vectorizer classes

The Bag of Words represention

Sparsity

Common Vectorizer usage

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
print(vectorizer)
corpus = [
    'This is the first documents.',
    'This is the second documents.',
    'And the third document.',
    'Is this the first documents?',
]

X = vectorizer.fit_transform(corpus)
print(X)

在默认的设置中，提取的字符串长度至少要有两个字符，低于两个字符的会被忽略，好比'a'

analyze = vectorizer.build_analyzer()
analyze('This is a text document to analyze.') == (['this', 'is', 'text', 'document', 'to', 'analyze'])

在fit阶段被analyser发现的每个词语（term）都会被分配一个独特的整数索引（unique interger index），该索引对应于特征向量矩阵中的一列，所有小写化。

使用下面的方法获取某一个词语在矩阵中的第几列。

所以，在训练语料中没有见到过的单词将会被将来的转换方法彻底忽略。