NLTK 第一篇:介绍

NLTK(Natural Language Toolkit)是一个功能强大的天然语言处理工具,它提供了一组天然语言算法,例如切分词(Tokenize),词性标注(Part-Of-Speech Tagging),词干分析(Stem)和命名实体识别(Named Entity Recognition),分类算法(classification)等。 安装和引用NLTKhtml

pip install nltk

import nltk

一,切词

文本是由段落(Paragraph)构成的,段落是由句子(Sentence)构成的,句子是由单词构成的。切词是文本分析的第一步,它把文本段落分解为较小的实体(如单词或句子),每个实体叫作一个Token,Token是构成句子(sentence )的单词,是段落(paragraph)的句子。NLTK可以实现句子切分和单词切分两种功能。python

1,句子切分(断句)算法

句子切分是指把段落切分红句子:函数

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

句子切分的结果:工具

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 
'The sky is pinkish-blue.', "You shouldn't eat cardboard"]

2,单词切分(分词)学习

单词切分是把句子切分红单词this

from nltk.tokenize import word_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=word_tokenize(text)
print(tokenized_text)

单词切分的结果是:spa

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 
'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.',
'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']

能够发现,切词以后,标点符号也包括在结果中。.net

二,处理切词

对切词的处理,须要移除标点符号和移除停用词和词汇规范化。code

1,移除标点符号

对每一个切词调用该函数,移除字符串中的标点符号,string.punctuation包含了全部的标点符号,从切词中把这些标点符号替换为空格。

import string

s='abc.'
s.translate(str.maketrans(string.punctuation," "*len(string.punctuation),"")

2,移除停用词

停用词(stopword)是文本中的噪音单词,没有任何意义,经常使用的英语停用词,例如:is, am, are, this, a, an, the。NLTK的语料库中由一个停用词,用户必须从切词列表中把停用词去掉。

from nltk.corpus import stopwords

stop_words = stopwords.words("english")

word_tokens = nltk.tokenize.word_tokenize(text.strip())
filtered_sentence = [w for w in word_tokens if not w in stop_words]

三,词汇规范化(Lexicon Normalization)

词汇规范化是指把词的各类派生形式转换为词根,在NLTK中存在两种抽取词干的方法porter和wordnet。

词形还原(lemmatization)利用上下文语境和词性来肯定相关单词的变化形式,根据词性来获取相关的词根,也叫lemma。抽取词干(stem)是把单词转换为词干。

from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

四,词性标注

词性(POS)标记的主要目标是识别给定单词的语法组,POS标记查找句子内的关系,并为该单词分配相应的标签。

sent = "Albert Einstein was born in Ulm, Germany in 1879."
tokens=nltk.word_tokenize(sent)
nltk.pos_tag(tokens)

 

五,分类

 

参考文档:

NLTK in Python

Text Analytics for Beginners using NLTK

NLTK学习笔记 -- 字符串操做

【NLP】Python NLTK 走进大秦帝国

相关文章
相关标签/搜索