NLTK读书笔记和实践问题记录

时间 2019-11-11

标签 nltk 读书笔记实践问题记录繁體版

原文原文链接

python版本3.4.2：html

一、书上的例子是python

from nltk.corpus import wordnet as wncurl

wn.synset('car.n.01').lemma_names #得到同义词集函数

wn.synset('car.n.01').definition #得到定义ui

在3.4.2下执行获得输出：url

<bound method Synset.lemma_names of Synset('car.n.01')>和spa

<bound method Synset.definition of Synset('car.n.01')>命令行

多是版本问题，在上面命令行后加上（）便可，即以下：code

wn.synset('car.n.01').lemma_names()htm

wn.synset('car.n.01').definition()

二、书上是from urllib import urlopen,可是报错：ImportError: cannot import name 'urlopen'；实际缘由是python3的库和python2的库的位置不一样，这里应该改为：

from urllib.request import urlopen。说道这里，顺便说一下from ... import ...和import的不一样，若是使用import，则导入后若是访问这个模块的功能，须要全路径写上，而from ... import呢，访问时就直接写上import后面的便可（可能的意思是这个import的东东是from这里来的）。

三、python idle在backspace删除时老是感受删除半个byte，有个白框框，能够按住alt键，一次删一个，按ctrl是一次删一个词

四、可能也是python3的缘故，urlopen(url).read()返回的是bytes，而不是str，python中str和bytes转化比较简单，例如bytes--》string，a.decode(encoding="utf-8");string-->bytes，a.encode(encoding="utf8")

五、对于天然语言处理，首先要将文本分词，将标点符号和单词分开，而后再处理

六、http://www.gutenberg.org/cache/epub/2554/pg2554.txt --《罪与罚》的地址变动

七、使用nltk.clean_html(htmltext),报错：builtins.NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function，发现nltk再也不提供clearn_html和clean_url两个函数功能。可使用Beautiful Soup项目提供的功能来处理html

八、安装方法：

import easy_install，easy_install packageName或者：

curl http://www.crummy.com/software/BeautifulSoup/bs4/download/4.1/beautifulsoup4-4.1.2.tar.gz >> beautifulsoup4-4.1.2.tar.gz

tar zxvf beautifulsoup4-4.1.2.tar.gz

cd beautifulsoup4-4.1.2

python setup.py install

九、BeautifulSoup 4以后，import的包改成 bs4,以前是import BeautifulSoup，如今改成import bs4. 具体使用方法：

http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

十、因为没法可靠地检验出文本内容的开始和结束、所以在从原始文本中挑出内容以前，须要手工检查文件来发现标记内容开始和结尾的特定字符串（使用find/rfind--反向查找）