Jcseg是什么?java
Jcseg是基于mmseg算法的一个轻量级中文分词器,同时集成了关键字提取,关键短语提取,关键句子提取和文章自动摘要等功能,而且提供了一个基于Jetty的web服务器,方便各大语言直接http调用,同时提供了最新版本的lucene, solr, elasticsearch的分词接口!Jcseg自带了一个 jcseg.properties文件用于快速配置而获得适合不一样场合的分词应用,例如:最大匹配词长,是否开启中文人名识别,是否追加拼音,是否追加同义词等!c++
六种切分模式:git
+--------Jcseg chinese word tokenizer demo---------------+
|- @Author chenxin<chenxin619315@gmail.com> |
|- :seg_mode : switch to specified tokenizer mode. |
|- (:complex,:simple,:search,:detect,:delimiter,:NLP) |
|- :keywords : switch to keywords extract mode. |
|- :keyphrase : switch to keyphrase extract mode. |
|- :sentence : switch to sentence extract mode. |
|- :summary : switch to summary extract mode. |
|- :help : print this help menu. |
|- :quit : to exit the program. |
+--------------------------------------------------------+
jcseg~tokenizer:complex>>
歧义和同义词:研究生命起源,混合词:作B超检查身体,x射线本质是什么,今天去奇都ktv唱卡拉ok去,哆啦a梦是一个动漫中的主角,单位和全角: 2009年8月6日开始大学之旅,岳阳今天的气温为38.6℃,也就是101.48℉,中文数字/分数:你分三十分之二,小陈拿三十分之五,剩下的三十分之二十三所有是个人,那是一九九八年前的事了,四川麻辣烫很好吃,五四运动留下的五四精神。笔记本五折包邮亏本大甩卖。人名识别:我是陈鑫,也是jcseg的做者,三国时期的诸葛亮是个天才,咱们一块儿给刘翔加油,罗志高兴奋极了由于老吴送了他一台笔记本。外文名识别:冰岛时间7月1日,正在当地拍片的汤姆·克鲁斯经过发言人认可,他与第三任妻子凯蒂·赫尔墨斯(第一二任妻子分别为咪咪·罗杰斯、妮可·基德曼)的婚姻即将结束。配对标点:本次『畅想杯』黑客技术大赛的得主为电信09-2BF的张三,奖励C++程序设计语言一书和【畅想网络】的『PHP教程』一套。特殊字母:【Ⅰ】(Ⅱ),英文数字: bug report chenxin619315@gmail.com or visit http://code.google.com/p/jcseg, we all admire the hacker spirit!特殊数字: ① ⑩⑽㈩.
歧义/n和/o同义词/n :/w研究/vn琢磨/vn研讨/vn钻研/vn生命/n起源/n,/w混合词:/w作/v b超/n检查/vn身体/n,/w x射线/n x光线/n本质/n是/a什么/n,/w今天/t去/q奇都ktv/nz唱/n卡拉ok/nz去/q,/w哆啦a梦/nz是/a一个/q动漫/n中/q的/u主角/n,/w单位/n和/o全角/nz :/w 2009年/m 8月/m 6日/m开始/n大学/n之旅,/w岳阳/ns今天/t的/u气温/n为/u 38.6℃/m ,/w也就是/v 101.48℉/m ,/w中文/n国语/n数字/n //w分数/n :/w你/r分/h三十分之二/m ,/w小陈/nr拿/nh三十分之五/m ,/w剩下/v的/u三十分之二十三/m所有/a是/a个人/nt,/w那是/c一九九八年/m 1998年/m前/v的/u事/i了/i,/w四川/ns麻辣烫/n很/m好吃/v,/w五四运动/nz留下/v的/u五四/m 54/m精神/n。/w笔记本/n五折/m 5折/m包邮亏本/v大甩卖甩卖。/w人名/n识别/v :/w我/r是/a陈鑫/nr,/w也/e是/a jcseg/en的/u做者/n,/w三国/mq时期/n的/u诸葛亮/nr是个天才/n,/w咱们/r一块儿/d给/v刘翔/nr加油/v,/w罗志高/nr兴奋/v极了/u由于/c老吴/nr送了他/r一台笔记本/n。/w外文/n名/j识别/v:/w冰岛/ns时间/n 7月/m 1日/m,/w正在/u当地/s拍片/vi的/u汤姆·克鲁斯/nr阿汤哥/nr经过/v发言人/n认可/v,/w他/r与/u第三/m任/q妻子/n凯蒂·赫尔墨斯/nr(/w第一/a二/j任/q妻子/n分别为咪咪·罗杰斯/nr、/w妮可·基德曼/nr)/w的/u婚姻/n即将/d结束/v。/w配对/v标点/n :/w本次/r『/w畅想杯/nz』/w黑客/n技术/n大赛/vn的/u得主/n为/u电信/nt 09/en -/w bf/en 2bf/en的/u张三/nr,/w奖励/vn c++/en程序设计/gi语言/n一书/ns和/o【/w畅想网络/nz】/w的/u『/w PHP教程/nz』/w一套/m。/w特殊/a字母/n :/w【/wⅠ/nz】/w(/wⅡ/m)/w,/w英文/n英语/n数字/n :/w bug/en report/en chenxin/en 619315/en gmail/en com/en chenxin619315@gmail.com/en or/en visit/en http/en :/w //w //w code/en google/en com/en code.google.com/en //w p/en //w jcseg/en ,/w we/en all/en admire/en appreciate/en like/en love/en enjoy/en the/en hacker/en spirit/en mind/en !/w特殊/a数字/n :/w ①/m ⑩/m⑽/m㈩/m ./w
Jcseg从1.9.8才开始上传到了maven仓库!web
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>jcseg-core</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>jcseg-analyzer</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>jcseg-elasticsearch</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>jcseg-server</artifactId>
<version>2.2.0</version>
</dependency>
//lucene 5.x
//Analyzer analyzer = new JcsegAnalyzer5X(JcsegTaskConfig.COMPLEX_MODE);
//available constructor: since 1.9.8
//1, JcsegAnalyzer5X(int mode)
//2, JcsegAnalyzer5X(int mode, String proFile)
//3, JcsegAnalyzer5X(int mode, JcsegTaskConfig config)
//4, JcsegAnalyzer5X(int mode, JcsegTaskConfig config, ADictionary dic)
//lucene 4.x版本
//Analyzer analyzer = new JcsegAnalyzer4X(JcsegTaskConfig.COMPLEX_MODE);
//lucene 6.3.0以及以上版本
Analyzer analyzer = new JcsegAnalyzer(JcsegTaskConfig.COMPLEX_MODE);
//available constructor:
//1, JcsegAnalyzer(int mode)
//2, JcsegAnalyzer(int mode, String proFile)
//3, JcsegAnalyzer(int mode, JcsegTaskConfig config)
//4, JcsegAnalyzer(int mode, JcsegTaskConfig config, ADictionary dic)
//非必须(用于修改默认配置): 获取分词任务配置实例
JcsegAnalyzer jcseg = (JcsegAnalyzer) analyzer;
JcsegTaskConfig config = jcseg.getTaskConfig();
//追加同义词, 须要在 jcseg.properties中配置jcseg.loadsyn=1
config.setAppendCJKSyn(true);
//追加拼音, 须要在jcseg.properties中配置jcseg.loadpinyin=1
config.setAppendCJKPinyin();
//更多配置, 请查看 org.lionsoul.jcseg.tokenizer.core.JcsegTaskConfig
<!-- 复杂模式分词: -->
<fieldtype name="textComplex" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="complex"/>
</analyzer>
</fieldtype>
<!-- 简易模式分词: -->
<fieldtype name="textSimple" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="simple"/>
</analyzer>
</fieldtype>
<!-- 检测模式分词: -->
<fieldtype name="textDetect" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="detect"/>
</analyzer>
</fieldtype>
<!-- 检索模式分词: -->
<fieldtype name="textSearch" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="search"/>
</analyzer>
</fieldtype>
<!-- NLP模式分词: -->
<fieldtype name="textSearch" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="nlp"/>
</analyzer>
</fieldtype>
<!-- 空格分隔符模式分词: -->
<fieldtype name="textSearch" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="delimiter"/>
</analyzer>
</fieldtype>
备注:算法
可选的analyzer名字:编程
jcseg :对应Jcseg的检索模式切分算法
jcseg_complex :对应Jcseg的复杂模式切分算法
jcseg_simple :对应Jcseg的简易切分算法
jcseg_detect :对应Jcseg的检测模式切分算法
jcseg_search :对应Jcseg的检索模式切分算法
jcseg_nlp :对应Jcseg的NLP模式切分算法
jcseg_delimiter :对应Jcseg的分隔符模式切分算法
配置测试地址:json
http://localhost:9200/_analyze?analyzer=jcseg_search&text=一百美圆等于多少人民币
对应测试结果:c#
GET _analyze?pretty
{
"analyzer": "jcseg_complex",
"text": "中达广场浦发银行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "达",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "广场",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "发",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "银行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_simple",
"text": "中达广场浦发银行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "达",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "广场",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "发",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "银行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_detect",
"text": "中达广场浦发银行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "达",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "广场",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "发",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "银行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_search",
"text": "中达广场浦发银行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "达",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "广",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "广场",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "场",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 5
},
{
"token": "发",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 6
},
{
"token": "银",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 7
},
{
"token": "银行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 8
},
{
"token": "行",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 9
},
{
"token": "信",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 10
},
{
"token": "信用",
"start_offset": 8,
"end_offset": 10,
"type": "word",
"position": 11
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 12
},
{
"token": "用",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 13
},
{
"token": "卡",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 14
},
{
"token": "中",
"start_offset": 11,
"end_offset": 12,
"type": "word",
"position": 15
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 16
},
{
"token": "心",
"start_offset": 12,
"end_offset": 13,
"type": "word",
"position": 17
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_nlp",
"text": "中达广场浦发银行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "达",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "广场",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "发",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "银行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_delimiter",
"text": "中达广场浦发银行信用卡中心"
}
{
"tokens": [
{
"token": "中达广场浦发银行信用卡中心",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 0
}
]
}
也能够直接使用集成了jcseg的elasticsearch运行包:elasticsearch-jcseg,开封就可使用。api
jcseg-server模块嵌入了jetty,实现了一个绝对高性能的服务器,给jcseg的所有Api功能都加上了restful接口,而且标准化了api结果的json输出格式,各大语言直接使用http客户端调用便可。数组
# 在最后传入jcseg-server.properties配置文件的路径
java -jar jcseg-server-{version}.jar ./jcseg-server.properties
懒得翻译了,默默的多念几遍就会了!
# jcseg server configuration file with standard json syntax
{
# jcseg server configuration
"server_config": {
# server port
"port": 1990,
# default conmunication charset
"charset": "utf-8",
# http idle timeout in ms
"http_connection_idle_timeout": 60000,
# jetty maximum thread pool size
"max_thread_pool_size": 200,
# thread idle timeout in ms
"thread_idle_timeout": 30000,
# http output buffer size
"http_output_buffer_size": 32768,
# request header size
"http_request_header_size": 8192,
# response header size
"http_response_header_size": 8192
},
# global setting for jcseg, yet another copy of the old
# configuration file jcseg.properties
"jcseg_global_config": {
# maximum match length. (5-7)
"jcseg_maxlen": 7,
# recognized the chinese name.
# (true to open and false to close it)
"jcseg_icnname": true,
# maximum length for pair punctuation text.
# set it to 0 to close this function
"jcseg_pptmaxlen": 7,
# maximum length for chinese last name andron.
"jcseg_cnmaxlnadron": 1,
# Whether to clear the stopwords.
# (set true to clear stopwords and false to close it)
"jcseg_clearstopword": false,
# Whether to convert the chinese numeric to arabic number.
# (set to true open it and false to close it) like '\u4E09\u4E07' to 30000.
"jcseg_cnnumtoarabic": true,
# Whether to convert the chinese fraction to arabic fraction.
# @Note: for lucene,solr,elasticsearch eg.. close it.
"jcseg_cnfratoarabic": false,
# Whether to keep the unrecognized word.
# (set true to keep unrecognized word and false to clear it)
"jcseg_keepunregword": true,
# Whether to start the secondary segmentation for the complex english words.
"jcseg_ensencondseg": true,
# min length of the secondary simple token.
# (better larger than 1)
"jcseg_stokenminlen": 2,
#thrshold for chinese name recognize.
# better not change it before you know what you are doing.
"jcseg_nsthreshold": 1000000,
#The punctuations that will be keep in an token.
# (Not the end of the token).
"jcseg_keeppunctuations": "@#%.&+"
},
# dictionary instance setting.
# add yours here with standard json syntax
"jcseg_dict": {
"master": {
"path": [
"{jar.dir}/lexicon"
# absolute path here
#"/java/JavaSE/jcseg/lexicon"
],
# Whether to load the part of speech of the words
"loadpos": true,
# Whether to load the pinyin of the words.
"loadpinyin": true,
# Whether to load the synoyms words of the words.
"loadsyn": true,
# whether to load the entity of the words.
"loadentity": true,
# Whether to load the modified lexicon file auto.
"autoload": true,
# Poll time for auto load. (in seconds)
"polltime": 300
}
# add more of yours here
# ,"name" : {
# "path": [
# "absolute jcseg standard lexicon path 1",
# "absolute jcseg standard lexicon path 2"
# ...
# ],
# "autoload": 0,
# "polltime": 300
# }
},
# JcsegTaskConfig instance setting.
# @Note:
# All the config instance here is extends from the global_setting above.
# do nothing will extends all the setting from global_setting
"jcseg_config": {
"master": {
# extends and Override the global setting
"jcseg_pptmaxlen": 0,
"jcseg_cnfratoarabic": true,
"jcseg_keepunregword": false
}
# this one is for keywords,keyphrase,sentence,summary extract
# @Note: do not delete this instance if u want jcseg to
# offset u extractor service
,"extractor": {
"jcseg_pptmaxlen": 0,
"jcseg_clearstopword": true,
"jcseg_cnnumtoarabic": false,
"jcseg_cnfratoarabic": false,
"jcseg_keepunregword": false,
"jcseg_ensencondseg": false
}
# well, this one is for NLP only
,"nlp" : {
"jcseg_ensencondseg": false,
"jcseg_cnfratoarabic": true,
"jcseg_cnnumtoarabic": true
}
# add more of yours here
# ,"name": {
# ...
# }
},
# jcseg tokenizer instance setting.
# Your could let the instance service for you by access:
# http://jcseg_server_host:port/tokenizer/instance_name
# instance_name is the name of instance you define here.
"jcseg_tokenizer": {
"master": {
# jcseg tokenizer algorithm, could be:
# 1: SIMPLE_MODE
# 2: COMPLEX_MODE
# 3: DETECT_MODE
# 4: SEARCH_MODE
# 5: DELIMITER_MODE
# 6: NLP_MODE
# see org.lionsoul.jcseg.tokenizer.core.JcsegTaskConfig for more info
"algorithm": 2,
# dictionary instance name
# choose one of your defines above in the dict scope
"dict": "master",
# JcsegTaskConfig instance name
# choose one of your defines above in the config scope
"config": "master"
}
# this tokenizer instance is for extractor service
# do not delete it if you want jcseg to offset you extractor service
,"extractor": {
"algorithm": 2,
"dict": "master",
"config": "extractor"
}
# this tokenizer instance of for NLP analysis
# keep it for you NLP project
,"nlp" : {
"algorithm": 6,
"dict": "master",
"config": "nlp"
}
# add more of your here
# ,"name": {
# ...
# }
}
}
api地址:http://jcseg_server_host:port/extractor/keywords?text=&number=&autoFilter=true|false
api参数:
text: post或者get过来的文档文本
number: 要提取的关键词个数
autoFilter: 是否自动过滤掉低分数关键字
api返回:
{
//api错误代号,0正常,1参数错误, -1内部错误
"code": 0,
//api返回数据
"data": {
//关键字数组
"keywords": [],
//操做耗时
"took": 0.001
}
}
更多配置请参考:org.lionsoul.jcseg.server.controller.KeywordsController
api地址:http://jcseg_server_host:port/extractor/keyphrase?text=&number=
api参数:
text: post或者get过来的文档文本
number: 要提取的关键短语个数
api返回:
{
"code": 0,
"data": {
"took": 0.0277,
//关键短语数组
"keyphrase": []
}
}
更多配置请参考:org.lionsoul.jcseg.server.controller.KeyphraseController
api地址:http://jcseg_server_host:port/extractor/sentence?text=&number=
api参数:
text: post或者get过来的文档文本
number: 要提取的关键句子个数
api返回:
{
"code": 0,
"data": {
"took": 0.0277,
//关键句子数组
"sentence": []
}
}
更多配置请参考:org.lionsoul.jcseg.server.controller.SentenceController
api地址:http://jcseg_server_host:port/extractor/summary?text=&length=
api参数:
text: post或者get过来的文档文本
length: 要提取的摘要的长度
api返回:
{
"code": 0,
"data": {
"took": 0.0277,
//文章摘要
"summary": ""
}
}
更多配置请参考:org.lionsoul.jcseg.server.controller.SummaryController
api地址:http://jcseg_server_host:port/tokenizer/tokenizer_instance?text=&ret_pinyin=&ret_pos=...
api参数:
tokenizer_instance: 表示在jcseg-server.properties中定义的分词实例名称
text: post或者get过来的文章文本
ret_pinyin: 是否在分词结果中返回词条拼音(2.0.1版本后已经取消)
ret_pos: 是否在分词结果中返回词条词性(2.0.1版本后已经取消)
api返回:
{
"code": 0,
"data": {
"took": 0.00885,
//词条对象数组
"list": [
{
word: "哆啦a梦", //词条内容
position: 0, //词条在原文中的索引位置
length: 4, //词条的词个数(非字节数)
pinyin: "duo la a meng", //词条的拼音
pos: "nz", //词条的词性标注
entity: null //词条的实体标注
}
]
}
}
更多配置请参考:org.lionsoul.jcseg.server.controller.TokenizerController
jcseg.properties查找步骤:
因此,默认状况下能够在jcseg-core-{version}.jar同目录下来放一份jcseg.properties来自定义配置。
JcsegTaskConfig构造方法以下:
JcsegTaskConfig(); //不作任何配置文件查找来初始化
JcsegTaskConfig(boolean autoLoad); //autoLoad=true会自动查找配置来初始化
JcsegTaskConfig(java.lang.String proFile); //从指定的配置文件中初始化配置对象
JcsegTaskConfig(InputStream is); //从指定的输入流中初始化配置对象
demo代码:
//建立JcsegTaskConfig使用默认配置,不作任何配置文件查找
JcsegTaskConfig config = new JcsegTaskConfig();
//该方法会自动按照上述“jcseg.properties查找步骤”来寻找jcseg.properties而且初始化:
JcsegTaskConfig config = new JcsegTaskConfig(true);
//依据给定的jcseg.properties文件建立而且初始化JcsegTaskConfig
JcsegTaskConfig config = new JcsegTaskConfig("absolute or relative jcseg.properties path");
//调用JcsegTaskConfig#load(String proFile)方法来从指定配置文件中初始化配置选项
config.load("absolute or relative jcseg.properties path");
ADictionary构造方法以下:
ADictionary(JcsegTaskConfig config, java.lang.Boolean sync)
//config:上述的JcsegTaskConfig实例
//sync: 是否建立线程安全词库,若是你须要在运行时操做词库对象则指定true,
// 若是jcseg.properties中autoload=1则会自动建立同步词库
demo代码:
//Jcseg提供org.lionsoul.jcseg.tokenzier.core.DictionaryFactory来方便词库的建立与日后的兼容
//一般能够经过
// DictionaryFactory#createDefaultDictionary(JcsegTaskConfig)
// DictionaryFactory.createSingletonDictionary(JcsegTaskConfig)
//两方法来建立词库对象而且加载词库文件,建议使用createSingletonDictionary来建立单例词库
//config为上面建立的JcsegTaskConfig对象.
//若是给定的JcsegTaskConfig里面的词库路径信息正确
//ADictionary会依据config里面的词库信息加载所有有效的词库;
//而且该方法会依据config.isAutoload()来决定词库的同步性仍是非同步性,
//config.isAutoload()为true就建立同步词库, 反之就建立非同步词库,
//config.isAutoload()对应jcseg.properties中的lexicon.autoload;
//若是config.getLexiconPath() = null,DictionaryFactory会自动加载classpath下的词库
//若是不想让其自动加载lexicon下的词库
//能够调用:DictionaryFactory.createSingletonDictionary(config, false)建立ADictionary便可;
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
//建立一个非同步的按照config.lexPath配置加载词库的ADictioanry.
ADictionary dic = DictionaryFactory.createDefaultDictionary(config, false);
//建立一个同步的按照config.lexPath加载词库的ADictioanry.
ADictionary dic = DictionaryFactory.createDefaultDictionary(config, true);
//依据 config.isAutoload()来决定同步性,默认按照config.lexPath来加载词库的ADictionary
ADictionary dic = DictionaryFactory.createDefaultDictionary(config, config.isAutoload());
//指定ADictionary加载给定目录下的全部词库文件的词条.
//config.getLexiconPath为词库文件存放有效目录数组.
for ( String path : config.getLexiconPath() ) {
dic.loadDirectory(path);
}
//指定ADictionary加载给定词库文件的词条.
dic.load("/java/lex-main.lex");
dic.load(new File("/java/lex-main.lex"));
//指定ADictionary加载给定输入流的词条
dic.load(new FileInputStream("/java/lex-main.lex"));
//阅读下面的“若是自定义使用词库”来获取更多信息
ISegment接口核心分词方法:
public IWord next();
//返回下一个切分的词条
demo代码:
//依据给定的ADictionary和JcsegTaskConfig来建立ISegment
//一般使用SegmentFactory#createJcseg来建立ISegment对象
//将config和dic组成一个Object数组给SegmentFactory.createJcseg方法
//JcsegTaskConfig.COMPLEX_MODE表示建立ComplexSeg复杂ISegment分词对象
//JcsegTaskConfig.SIMPLE_MODE表示建立SimpleSeg简易Isegmengt分词对象.
//JcsegTaskConfig.DETECT_MODE表示建立DetectSeg Isegmengt分词对象.
//JcsegTaskConfig.SEARCH_MODE表示建立SearchSeg Isegmengt分词对象.
//JcsegTaskConfig.DELIMITER_MODE表示建立DelimiterSeg Isegmengt分词对象.
//JcsegTaskConfig.NLP_MODE表示建立NLPSeg Isegmengt分词对象.
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{config, dic}
);
//设置要分词的内容
String str = "研究生命起源。";
seg.reset(new StringReader(str));
//获取分词结果
IWord word = null;
while ( (word = seg.next()) != null ) {
System.out.println(word.getValue());
}
//建立JcsegTaskConfig分词配置实例,自动查找加载jcseg.properties配置项来初始化
JcsegTaskConfig config = new JcsegTaskConfig(true);
//建立默认单例词库实现,而且按照config配置加载词库
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
//依据给定的ADictionary和JcsegTaskConfig来建立ISegment
//为了Api日后兼容,建议使用SegmentFactory来建立ISegment对象
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{new StringReader(str), config, dic}
);
//备注:如下代码能够反复调用,seg为非线程安全
//设置要被分词的文本
String str = "研究生命起源。";
seg.reset(new StringReader(str));
//获取分词结果
IWord word = null;
while ( (word = seg.next()) != null ) {
System.out.println(word.getValue());
}
从1.9.9版本开始,Jcseg已经默认将jcseg.properties和lexicon所有词库打包进了jcseg-core-{version}.jar中,若是是经过JcsegTaskConfig(true)构造的JcsegTaskConfig或者调用了JcsegTaskConfig#autoLoad()方法,在找不到自定义配置文件状况下Jcseg会自动的加载classpath中的配置文件,若是config.getLexiconPath() = null DictionaryFactory默认会自动加载classpath下的词库。
//1, 默认构造JcsegTaskConfig,不作任何配置文件寻找来初始化
JcsegTaskConfig config = new JcsegTaskConfig();
//2, 设置自定义词库路径集合
config.setLexiconPath(new String[]{
"relative or absolute lexicon path1",
"relative or absolute lexicon path2"
//add more here
});
//3, 经过config构造词库而且DictionaryFactory会按照上述设置的词库路径自动加载所有词库
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
//1, 构造默认的JcsegTaskConfig,不作任何配置文件寻找来初始化
JcsegTaskConfig config = new JcsegTaskConfig();
//2, 构造ADictionary词库对象
//注意第二个参数为false,阻止DictionaryFactory自动检测config.getLexiconPath()来加载词库
ADictionary dic = DictionaryFactory.createSingletonDictionary(config, false);
//3, 手动加载词库
dic.load(new File("absolute or relative lexicon file path")); //加载指定词库文件下所有词条
dic.load("absolute or relative lexicon file path"); //加载指定词库文件下所有词条
dic.load(new FileInputStream("absolute or relative lexicon file path")); //加载指定InputStream输入流下的所有词条
dic.loadDirectory("absolute or relative lexicon directory"); //加载指定目录下的所有词库文件的所有词条
dic.loadClassPath(); //加载classpath路径下的所有词库文件的所有词条(默认路径/lexicon)
TextRankKeywordsExtractor(ISegment seg);
//seg: Jcseg ISegment分词对象
//1, 建立Jcseg ISegment分词对象
JcsegTaskConfig config = new JcsegTaskConfig(true);
config.setClearStopwords(true); //设置过滤中止词
config.setAppendCJKSyn(false); //设置关闭同义词追加
config.setKeepUnregWords(false); //设置去除不识别的词条
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{config, dic}
);
//2, 构建TextRankKeywordsExtractor关键字提取器
TextRankKeywordsExtractor extractor = new TextRankKeywordsExtractor(seg);
extractor.setMaxIterateNum(100); //设置pagerank算法最大迭代次数,非必须,使用默认便可
extractor.setWindowSize(5); //设置textRank计算窗口大小,非必须,使用默认便可
extractor.setKeywordsNum(10); //设置最大返回的关键词个数,默认为10
//3, 从一个输入reader输入流中获取关键字
String str = "现有的分词算法可分为三大类:基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法。按照是否与词性标注过程相结合,又能够分为单纯分词方法和分词与标注相结合的一体化方法。";
List<String> keywords = extractor.getKeywords(new StringReader(str));
//4, output:
//"分词","方法","分为","标注","相结合","字符串","匹配","过程","大类","单纯"
TextRankSummaryExtractor(ISegment seg, SentenceSeg sentenceSeg);
//seg: Jcseg ISegment分词对象
//sentenceSeg: Jcseg SentenceSeg句子切分对象
//1, 建立Jcseg ISegment分词对象
JcsegTaskConfig config = new JcsegTaskConfig(true);
config.setClearStopwords(true); //设置过滤中止词
config.setAppendCJKSyn(false); //设置关闭同义词追加
config.setKeepUnregWords(false); //设置去除不识别的词条
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{config, dic}
);
//2, 构造TextRankSummaryExtractor自动摘要提取对象
SummaryExtractor extractor = new TextRankSummaryExtractor(seg, new SentenceSeg());
//3, 从一个Reader输入流中获取length长度的摘要
String str = "Jcseg是基于mmseg算法的一个轻量级开源中文分词器,同时集成了关键字提取,关键短语提取,关键句子提取和文章自动摘要等功能,而且提供了最新版本的lucene,%20solr,%20elasticsearch的分词接口。Jcseg自带了一个%20jcseg.properties文件用于快速配置而获得适合不一样场合的分词应用。例如:最大匹配词长,是否开启中文人名识别,是否追加拼音,是否追加同义词等!";
String summary = extractor.getSummary(new StringReader(str), 64);
//4, output:
//Jcseg是基于mmseg算法的一个轻量级开源中文分词器,同时集成了关键字提取,关键短语提取,关键句子提取和文章自动摘要等功能,而且提供了最新版本的lucene, solr, elasticsearch的分词接口。
//-----------------------------------------------------------------
//5, 从一个Reader输入流中提取n个关键句子
String str = "you source string here";
extractor.setSentenceNum(6); //设置返回的关键句子个数
List<String> keySentences = extractor.getKeySentence(new StringReader(str));
TextRankKeyphraseExtractor(ISegment seg);
//seg: Jcseg ISegment分词对象
//1, 建立Jcseg ISegment分词对象
JcsegTaskConfig config = new JcsegTaskConfig(true);
config.setClearStopwords(false); //设置不过滤中止词
config.setAppendCJKSyn(false); //设置关闭同义词追加
config.setKeepUnregWords(false); //设置去除不识别的词条
config.setEnSecondSeg(false); //关闭英文自动二次切分
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{config, dic}
);
//2, 构建TextRankKeyphraseExtractor关键短语提取器
TextRankKeyphraseExtractor extractor = new TextRankKeyphraseExtractor(seg);
extractor.setMaxIterateNum(100); //设置pagerank算法最大迭代词库,非必须,使用默认便可
extractor.setWindowSize(5); //设置textRank窗口大小,非必须,使用默认便可
extractor.setKeywordsNum(15); //设置最大返回的关键词个数,默认为10
extractor.setMaxWordsNum(4); //设置最大短语词长,默认为5
//3, 从一个输入reader输入流中获取短语
String str = "支持向量机普遍应用于文本挖掘,例如,基于支持向量机的文本自动分类技术研究一文中很详细的介绍支持向量机的算法细节,文本自动分类是文本挖掘技术中的一种!";
List<String> keyphrases = extractor.getKeyphrase(new StringReader(str));
//4, output:
//支持向量机, 自动分类
名词n、时间词t、处所词s、方位词f、数词m、量词q、区别词b、代词r、动词v、形容词a、状态词z、副词d、介词p、连词c、助词u、语气词y、叹词e、拟声词o、成语i、习惯用语l、简称j、前接成分h、后接成分k、语素g、非语素字x、标点符号w)外,从语料库应用的角度,增长了专有名词(人名nr、地名ns、机构名称nt、其余专有名词nz)。
格式:
词根,同义词1[/可选拼音],同义词2[/可选拼音],...,同义词n[/可选拼音]
例如:
单行定义:
研究,研讨,钻研,研磨/yan mo,研发
多行定义:(只要词根同样,定义的所有同义词就都属于同一个集合)
中央一台,央视一台,中央第一台
中央一台,中央第一频道,央视第一台,央视第一频道
1,第一个词为同义词的根词条,这个词条必须是CJK_WORD词库中必须存在的词条,若是不存在,这条同义词定义会被忽略。
2,根词会做为不一样行同义词集合的区别,若是两行同义词定义的根词同样,会自动合并成一个同义词集合。
3,jcseg中使用org.lionsoul.jcseg.tokenizer.core.SynonymsEntry来管理同义词集合,每一个IWord词条对象都会有一个SynonymsEntry属性来指向本身的同义词集合。
4,SynonymsEntry.rootWord存储了同义词集合的根词,同义词的合并建议统一替换成根词。
5,除去根词外的其余同义词,jcseg会自动检测而且建立相关的IWord词条对象而且将其加入CJK_WORD词库中,也就是说其余同义词不必定要是CJK_WORD词库中存在的词条。
6,其余同义词会自动继承词根的词性和实体定义,也会继承CJK_WORD词库中该词条的拼音定义(若是存在该词),也能够在词条后面经过增长"/拼音"来单独定义拼音。
7,同一同义词定义的集合中的所有IWord词条都指向同一个SynonymsEntry对象,也就是同义词之间会自动相互引用。
拷贝到{ESHOME}/ plugins/jcseg目录下,重启es
Kibana上操做,先获取须要修改的mapping
GET _template/logstash
拷贝当前获取的mapping,并添加指定分词器为jcseg_search
PUT _template/logstash
{
"order": 0,
"version": 50001,
"template": "logstash-*",
"settings": {
"index": {
"refresh_interval": "5s"
}
},
"mappings": {
"_default_": {
"dynamic_templates": [
{
"message_field": {
"path_match": "message",
"mapping": {
"norms": false,
"analyzer": "jcseg_search",
"search_analyzer": "jcseg_search",
"type": "text"
},
"match_mapping_type": "string"
}
},
{
"string_fields": {
"mapping": {
"norms": false,
"analyzer": "jcseg_search",
"search_analyzer": "jcseg_search",
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"match_mapping_type": "string",
"match": "*"
}
}
],
"_all": {
"norms": false,
"analyzer": "jcseg_search",
"search_analyzer": "jcseg_search",
"enabled": true
},
"properties": {
"@timestamp": {
"include_in_all": false,
"type": "date"
},
"geoip": {
"dynamic": true,
"properties": {
"ip": {
"type": "ip"
},
"latitude": {
"type": "half_float"
},
"location": {
"type": "geo_point"
},
"longitude": {
"type": "half_float"
}
}
},
"@version": {
"include_in_all": false,
"type": "keyword"
}
}
}
},
"aliases": {}
}
此修改只针对新生成的索引有效