NLTK——NLTK的正则表达式分词器（nltk.regexp_tokenize）

时间 2019-12-11

标签 nltk 正则表达式分词器 nltk.regexp regexp tokenize 栏目正则表达式繁體版

原文原文链接

在《Python天然语言处理》一书中的P121出现来一段利用NLTK自带的正则表达式分词器——nlt.regexp_tokenize,书中代码为:正则表达式

1 text = 'That U.S.A. poster-print ex-costs-ed $12.40 ... 8% ?  _'
2     pattern = r'''(?x)    # set flag to allow verbose regexps
3         ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
4        |\w+(-\w+)*        # words with optional internal hyphens
5        |\$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
6        |\.\.\.            # ellipsis
7        |(?:[.,;"'?():-_`])  # these are separate tokens; includes ], [
8      '''

其中text变量结尾的“8%”和“_”是我本身加上去的。post

预期输出应该是：spa

1 ['That', 'U.S.A.', 'poster-print', 'ex-costs-ed', '$12.40', '...', '8%', '?', '_']

可实际代码是：.net

1 [('', '', ''), ('A.', '', ''), ('', '-print', ''), ('', '-ed', ''), ('', '', '.40'), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

会出现这样的问题是因为nltk.internals.compile_regexp_to_noncapturing()在V3.1版本的NLTK中已经被抛弃（尽管在更早的版本中它仍然能够运行），为此咱们把以前定义的pattern稍做修改（参考：http://www.javashuo.com/article/p-mzwaeppw-dz.html）code

1 pattern = r'''(?x)    # set flag to allow verbose regexps
2         (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
3        |\w+(?:-\w+)*        # words with optional internal hyphens
4        |\$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
5        #|\w+(?:-\w+)* 
6        |\.\.\.            # ellipsis
7        |(?:[.,;"'?():-_`])  # these are separate tokens; includes ], [
8      '''

实际输出结果是:regexp

1 ['That', 'U.S.A.', 'poster-print', 'ex-costs-ed', '$12.40', '...', '8', '?', '_']

咱们发现‘8’应该显示成‘8%’才对，后发现将第三行的‘*’去掉或者将第三四行调换位置便可正常显示，修改后代码以下：blog

1 pattern = r'''(?x)    # set flag to allow verbose regexps
2         (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
3        #|\w+(?:-\w+)*        # words with optional internal hyphens
4        |\$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
5        |\w+(?:-\w+)* 
6        |\.\.\.            # ellipsis
7        |(?:[.,;"'?():-_`])  # these are separate tokens; includes ], [
8      '''

此时结果显示正常，因此得出结论就是‘*’影响了它下面的正则表达式中的百分号'%'的匹配。至于为何就不得而知了。token