英文词干提取(stemming)算法 - Lovins, Porter

时间 2019-12-04

标签英文词干提取 stemming 算法 lovins porter 繁體版

原文原文链接

英文词干提取有多种方式，在实践中，可能涉及到机器学习数据挖掘等多方面的内容。
这里主要介绍的是易于实现的几种原始算法：算法

Lovins (1968)segmentfault
Porter (1980)app
Porter2 (2000)less

1. Lovins

Lovins是最先的实现机器学习

1.1. 简介

算法涉及以下部件：ide

ending, 词后缀，共有294个，详细列表见最后学习
condition, 词后缀去除条件，每一个ending对应一个condition，共有29个，详细列表见最后优化
transformation, 转换ending的方式，共有35个，详细列表见最后rest

算法分为两部：code

对英文词，根据ending列表，按照ending从长到短扫描，找到第一个符合condition的ending
根据剩下的stem应用transformation，将ending转为恰当的形式

1.2. 例子

第一步

英文词为nationally，按照endling列表，从长到短扫描，首先找到 .09. ationally B，
对应的规则是B Minimum stem length = 3，要求去除ending后，剩余的部分长度大于等于3
nationally 去除 ationally 后只剩下 n, 不符合condition

继续扫描ending，找到 .07. ionally A，对应的规则是 A No restrictions on stem,没有任何限制。
因而最终选定 ionally做为ending

第二步

英文词nationally的stem是nat, 查找transformation，发现没有符合的transformation，不进行变换直接输出。
好比又一个词sitting，第一步获得stem是sitt, 第二步这里会应用第一条transformation，最终输出sit

1.Appendix.A endings 列表

.11.
alistically B   arizability A   izationally B

.10.
antialness A    arisations A    arizations A    entialness A

.09.
allically C     antaneous A     antiality A     arisation A
arization A     ationally B     ativeness A     eableness E
entations A     entiality A     entialize A     entiation A
ionalness A     istically A     itousness A     izability A
izational A

.08.
ableness A      arizable A      entation A      entially A
eousness A      ibleness A      icalness A      ionalism A
ionality A      ionalize A      iousness A      izations A
lessness A

.07.
ability A       aically A       alistic B       alities A
ariness E       aristic A       arizing A       ateness A
atingly A       ational B       atively A       ativism A
elihood E       encible A       entally A       entials A
entiate A       entness A       fulness A       ibility A
icalism A       icalist A       icality A       icalize A
ication G       icianry A       ination A       ingness A
ionally A       isation A       ishness A       istical A
iteness A       iveness A       ivistic A       ivities A
ization F       izement A       oidally A       ousness A

.06.
aceous A        acious B        action G        alness A
ancial A        ancies A        ancing B        ariser A
arized A        arizer A        atable A        ations B
atives A        eature Z        efully A        encies A
encing A        ential A        enting C        entist A
eously A        ialist A        iality A        ialize A
ically A        icance A        icians A        icists A
ifully A        ionals A        ionate D        ioning A
ionist A        iously A        istics A        izable E
lessly A        nesses A        oidism A

.05.
acies A         acity A         aging B         aical A
alist A         alism B         ality A         alize A
allic BB        anced B         ances B         antic C
arial A         aries A         arily A         arity B
arize A         aroid A         ately A         ating I
ation B         ative A         ators A         atory A
ature E         early Y         ehood A         eless A
elity A         ement A         enced A         ences A
eness E         ening E         ental A         ented C
ently A         fully A         ially A         icant A
ician A         icide A         icism A         icist A
icity A         idine I         iedly A         ihood A
inate A         iness A         ingly B         inism J
inity CC        ional A         ioned A         ished A
istic A         ities A         itous A         ively A
ivity A         izers F         izing F         oidal A
oides A         otide A         ously A

.04.
able A          ably A          ages B          ally B
ance B          ancy B          ants B          aric A
arly K          ated I          ates A          atic B
ator A          ealy Y          edly E          eful A
eity A          ence A          ency A          ened E
enly E          eous A          hood A          ials A
ians A          ible A          ibly A          ical A
ides L          iers A          iful A          ines M
ings N          ions B          ious A          isms B
ists A          itic H          ized F          izer F
less A          lily A          ness A          ogen A
ward A          wise A          ying B          yish A

.03.
acy A           age B           aic A           als BB
ant B           ars O           ary F           ata A
ate A           eal Y           ear Y           ely E
ene E           ent C           ery E           ese A
ful A           ial A           ian A           ics A
ide L           ied A           ier A           ies P
ily A           ine M           ing N           ion Q
ish C           ism B           ist A           ite AA
ity A           ium A           ive A           ize F
oid A           one R           ous A

.02.
ae A            al BB           ar X            as B
ed E            en F            es E            ia A
ic A            is A            ly B            on S
or T            um U            us V            yl R
s' A            's A

.01.
a A             e A             i A             o A
s W             y B

1.Appendix.B conditions 列表

A   No restrictions on stem
B   Minimum stem length = 3
C   Minimum stem length = 4
D   Minimum stem length = 5
E   Do not remove ending after e
F   Minimum stem length = 3 and do not remove ending after e
G   Minimum stem length = 3 and remove ending only after f
H   Remove ending only after t or ll
I   Do not remove ending after o or e
J   Do not remove ending after a or e
K   Minimum stem length = 3 and remove ending only after l, i or u*e
L   Do not remove ending after u, x or s, unless s follows o
M   Do not remove ending after a, c, e or m
N   Minimum stem length = 4 after s**, elsewhere = 3
O   Remove ending only after l or i
P   Do not remove ending after c
Q   Minimum stem length = 3 and do not remove ending after l or n
R   Remove ending only after n or r
S   Remove ending only after dr or t, unless t follows t
T   Remove ending only after s or t, unless t follows o
U   Remove ending only after l, m, n or r
V   Remove ending only after c
W   Do not remove ending after s or u
X   Remove ending only after l, i or u*e
Y   Remove ending only after in
Z   Do not remove ending after f
AA  Remove ending only after d, f, ph, th, l, er, or, es or t
BB  Minimum stem length = 3 and do not remove ending after met or ryst
CC  Remove ending only after l

1.Appendix.C transformations 列表

1   remove one of double b, d, g, l, m, n, p, r, s, t
2   iev   ->   ief
3   uct   ->   uc
4   umpt  ->   um
5   rpt   ->   rb
6   urs   ->   ur
7   istr  ->   ister
7a  metr  ->   meter
8   olv   ->   olut
9   ul    ->   l except following a, o, i
10  bex   ->   bic
11  dex   ->   dic
12  pex   ->   pic
13  tex   ->   tic
14  ax    ->   ac
15  ex    ->   ec
16  ix    ->   ic
17  lux   ->   luc
18  uad   ->   uas
19  vad   ->   vas
20  cid   ->   cis
21  lid   ->   lis
22  erid  ->   eris
23  pand  ->   pans
24  end   ->   ens except following s
25  ond   ->   ons
26  lud   ->   lus
27  rud   ->   rus
28  her   ->   hes except following p, t
29  mit   ->   mis
30  ent   ->   ens except following m
31  ert   ->   ers
32  et    ->   es except following n
33  yt    ->   ys
34  yz    ->   ys

2. Porter

2.1. 简介

元音与辅音

元音辅音与常见的定义略有不一样：

元音(Vowel) - A E I O U, 以及辅音后边的Y
辅音(Consonant) - 除了 A E I O U，以及元音后边的Y

单词的分组

连续的元音看做元音组V，连续的辅音看做辅音组C，因而任意一个单词均可以表示成VC交错的形式，例如：

segmentfault -> s/e/gm/e/ntf/au/lt -> CVCVCVC
porter -> p/o/rt/e/r -> CVCVC
application -> a/ppl/i/c/a/t/io/n -> VCVCVCVC
apple -> a/ppl/e -> V/C/V

综合起来，能够表示为 VC 组的形式：$$ C^m[V] $$
其中参数m相似于Lovin中condition的stem长度，用于后续的判断

规则

Porter算法以rule为主，rule的形式为：

(condition) S1 -> S2

condition做用于去除了S1的stem，除了m还有其余特征：

m - 表示VC组的数目
* - 表示任意字符, 和子串，v,d,o配合使用
大写字母 - 表示子串
v - 表示一个元音字符
d - 表示两个同样的辅音
o - 表示cvc, 其中第二个c不能是W,X,Y

S1是词的后缀，S2的变化后的后缀

和Lovin不一样，一个词语通过多个规则的串联处理，输出目标词(Lovin是一次性输出)
例如 hopping, 首先应用规则(*v*) ING ->, 变为hopp
而后应用规则(*d and not (*L or *S or *Z)) -> single letter，从hopp变为hop

流程

整个算法是从上往下应用规则，有些规则比较特殊，若是触发了要处理额外的规则
规则不少，因而对规则进行分组(step)，这里的分组是为了逻辑上作区分(实际上算法也能够根据分组优化)，整个算法就是从头到位执行的，流程以下：

do Step_1a
do Step_1b (若是命中step 2b.2 or step 2b.3, 则作一些额外工做)
do Step_1c
do Step_2
do Step_3
do Step_4
do Step_5a
do Step_5b

每一个Step的详细内容见附录

2.2. 例子

2.Appendix Step 1a

SSES  ->   SS
      IES   ->   I
      SS    ->   SS
      S     ->

2.Appendix Step 1b

(m>0) EED     ->   EE
(*v*) ED      ->
(*v*) ING     ->

If the second or third of the rules in Step 1b is successful, the following is done:

      AT      ->   ATE
      BL      ->   BLE
      IZ      ->   IZE
      (*d and not (*L or *S or *Z)) -> single letter
      (m=1 and *o)  ->   E

2.Appendix Step 1c

(*v*) Y       ->   I

2.Appendix Step 2

(m>0) ATIONAL ->   ATE
(m>0) TIONAL  ->   TION
(m>0) ENCI    ->   ENCE
(m>0) ANCI    ->   ANCE
(m>0) IZER    ->   IZE
(m>0) ABLI    ->   ABLE
(m>0) ALLI    ->   AL
(m>0) ENTLI   ->   ENT
(m>0) ELI     ->   E
(m>0) OUSLI   ->   OUS
(m>0) IZATION ->   IZE
(m>0) ATION   ->   ATE
(m>0) ATOR    ->   ATE
(m>0) ALISM   ->   AL
(m>0) IVENESS ->   IVE
(m>0) FULNESS ->   FUL
(m>0) OUSNESS ->   OUS
(m>0) ALITI   ->   AL
(m>0) IVITI   ->   IVE
(m>0) BILITI  ->   BLE

2.Appendix Step 3

(m>0) ICATE   ->   IC
(m>0) ATIVE   ->
(m>0) ALIZE   ->   AL
(m>0) ICITI   ->   IC
(m>0) ICAL    ->   IC
(m>0) FUL     ->
(m>0) NESS    ->

2.Appendix Step 4

(m>1) AL      ->
(m>1) ANCE    ->
(m>1) ENCE    ->
(m>1) ER      ->
(m>1) IC      ->
(m>1) ABLE    ->
(m>1) IBLE    ->
(m>1) ANT     ->
(m>1) EMENT   ->
(m>1) MENT    ->
(m>1) ENT     ->
(m>1 and (*S or *T)) ION   ->
(m>1) OU      ->
(m>1) ISM     ->
(m>1) ATE     ->
(m>1) ITI     ->
(m>1) OUS     ->
(m>1) IVE     ->
(m>1) IZE     ->

2.Appendix Step 5a

(m>1) E   ->
(m=1 and not *o) E   ->

2.Appendix Step 5b

(m > 1 and *d and *L)   ->   single letter