拼音搜索 -- 自动解析拼音汉子组合（包含多音字，拼音缩写）

时间 2019-12-24

标签拼音搜索自动解析汉子组合包含多音字缩写繁體版

原文原文链接

最近在工做中用到拼音搜索，目前参考靠网上的例子作出一套，在这跟你们分享一下。java

这套代码能够识别包快拼音缩写在内的拼音与汉字混合的字符串（例如：xiug手机h --> 修改手机号）git

话很少说，直接开始：正则表达式

1. 首先有一张中文词语对应拼音的表，而后建一张词语点击量的表（用于记录词语的经常使用度）

PinyinWord tablesql

CREATE TABLE "public"."pinyinword" (
    "id" text COLLATE "default" NOT NULL,
    "word" text COLLATE "default" NOT NULL,
    "whole" text COLLATE "default" NOT NULL,
    "acronym" text COLLATE "default" NOT NULL,
    "wordlength" int4 NOT NULL,
    "wholelength" int4 NOT NULL,
    "acronymlength" int4 NOT NULL
)

WordClick tablemybatis

CREATE TABLE "public"."wordclick" (
    "wordcontent" text COLLATE "default",
    "id" text COLLATE "default" NOT NULL
)

表中数据自行初始化app

2. 接下来介绍两个数据类型，在分析input时起到很重要的做用

/**
* 词元
*/
public class Lexeme {
    private String content; //词元内容
    private LexemeType lexemeType; //词元类型
}

public enum LexemeType {
    CHINESE, //中文
    WHOLE, //全拼
    ACRONYM //拼音首字母缩写
}

/**
* 中文句子（处理用户输入的类）
*/
public class ChineseSentence {
    private String content; // 用户输入内容
    private List<Lexeme> sentenceUnits; // content中包含的词元
    private SentenceType sentenceType; // 句子最低级类型（不能set，赋值请看initSentenceType()）

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    public List<Lexeme> getSentenceUnits() {
        return sentenceUnits;
    }

    public SentenceType getSentenceType() {
        return sentenceType;
    }

    public void setSentenceUnits(List<Lexeme> sentenceUnits) {
        this.sentenceUnits = sentenceUnits;
        initSentenceType();
    }

    private void initSentenceType() {
        sentenceType = SentenceType.CHINESE_SENTENCE;
        for (Lexeme lexeme : sentenceUnits) {
            if (lexeme.getLexemeType() == LexemeType.ACRONYM) {
                sentenceType = SentenceType.ACRONYM_SENTENCE;
                break;
            } else if (lexeme.getLexemeType() == LexemeType.WHOLE
                    && sentenceType == SentenceType.CHINESE_SENTENCE) {
                sentenceType = SentenceType.WHOLE_SENTENCE;
            }
        }
    }
}

3. 接下来就是处理用户输入（xiug手机h），使用正则表达式将目标分解成词元（Lexeme）并生成句子

//正则表达式（从网上copy下来作了一些修改，识别中文和疑似的拼音）
private static final String SUSPECTED_PINYIN_REGEX  
    = "[\\u4e00-\\u9fa5]|(sh|ch|zh|[^aoeiuv])?[iuv]?(ai|ei|ao|ou|er|ang?|eng?|ong|a|o|e|i|u|ng|n)?";

使用这个正则表达式可能会截取出不存在的拼音组合，好比说jvaoelasticsearch

这种直接成 j,v,a,o（找一个拼音组合的库，看看截出来的拼音属不属于库里便可）测试

通过截取并给每一个词元附一个lexemeType，获得下边的结果优化

4. 接下来就是对句子中词元逐个进行分析

首先简要说明一下分析原理this

先看一下查询条件

解释一下查询参数，首先是lexemeType 这个字段是指定搜索的词级，必须按照句子的最低词级进行搜索

例如： '修改' --> LexemeType.CHINESE

'xiu改' --> LexemeType.WHOLE

'修g' --> LexemeType.ACRONYM

Search结尾的三个参数是用来作搜索的，他们在SQL中用来作like操做, 这样能够击中索引

and pinyinword.acronym like #{acronymSearch} || '%'

因为用户输入的句子中可能含有中文或者拼音，这两种类型里须要进行过滤

好比说用户输入 ‘修g’ 咱们用最低词级进行搜索就是 like 'xg' % 这样可能搜到 '鞋柜' 因此我用了chineseFilter 和 pinyinFilter 来进行过滤（把 '%修%' append到chineseFilter中），这样查询条件就变成了

and pinyinword.acronym like #{acronymSearch} || '%'
and pinyinword.word like #{chineseFilter}

这样就不会搜到 '鞋柜' 了

来看一下mybatis下的SQL，这里join了wordclick表，取得了每一个词语的点击量，用来排序

<select id="searchByClickCount" resultType="model.value.WordClickCount" parameterType="model.options.PinyinWordAnalyzeSearchOptions">
        select
        pw.word word, count(wc.id) clickCount
        from
        PinyinWord pw left join wordclick wc on wc.wordcontent = pw.word
        where 1=1
        <choose>
            <when test="lexemeType.equals('CHINESE')">
                <if test="chineseSearch!=null">
                    and pw.word like #{chineseSearch} || '%'
                </if>
                group by pw.word
                order by clickCount desc
                <if test="paging">
                    limit 5 offset 0
                </if>
            </when>
            <when test="lexemeType.equals('WHOLE')">
                <if test="wholeSearch!=null">
                    and pw.whole like #{wholeSearch} || '%'
                </if>
                <if test="chineseFilter!=null">
                    and pw.word like #{chineseFilter}
                </if>
                group by pw.word
                order by clickCount desc
                <if test="paging">
                    limit 5 offset 0
                </if>
            </when>
            <otherwise>
                <if test="acronymSearch!=null">
                    and pw.acronym like #{acronymSearch} || '%'
                </if>
                <if test="chineseFilter!=null">
                    and pw.word like #{chineseFilter}
                </if>
                <if test="pinyinFilter!=null">
                    and pw.whole like #{pinyinFilter}
                </if>
                group by pw.word
                order by clickCount desc
                <if test="paging">
                    limit 5 offset 0
                </if>
            </otherwise>
        </choose>
    </select>

而后是分析用户的输入，把查询条件生成出来

这是部分代码，足以明了查询条件生成原则了

LexemeType currentLexemeType; //当前词元类型
        LexemeType lastLexemeType = null; //以前词元最低级
        List<Lexeme> lexemes = sentence.getSentenceUnits(); //累积词元最低级
        for (int i = 0; i < lexemes.size(); i++) {
            Lexeme lexeme = lexemes.get(i);
            currentLexemeType = lexeme.getLexemeType();

            String content = lexeme.getContent();
            switch (currentLexemeType) { 
                case CHINESE: //若当前词元为中文
                    String pinyin = convertSmartAll(content); //转成拼音（pinyin4j）
                    chineseSearch.append(content); //append到chineseSearch字段
                    wholeSearch.append(pinyin); //append到wholeSearch字段
                    acronymSearch.append(pinyin.charAt(0)); //append到acronymSearch字段
                    chineseFilter.append(content).append("%"); //append到chineseFilter字段
                    break;
                case WHOLE: //若为拼音 同理中文
                    wholeSearch.append(content);
                    acronymSearch.append(content.charAt(0));
                    pinyinFilter.append(content).append("%");
                    break;
                case ACRONYM: //同理
                    acronymSearch.append(content);
                    break;
            }
            //将lastLexemeType 转换成当前词元和当前lastLexemeType中的第一级别的LexeType，由于搜索时须要词元最低级
            lastLexemeType = LexemeType.changeDown(lastLexemeType, currentLexemeType); 
            //new searchOptions
            PinyinWordAnalyzeSearchOptions options = new PinyinWordAnalyzeSearchOptions(
                    chineseSearch.toString(), wholeSearch.toString(), acronymSearch.toString(),
                    chineseFilter.toString(), pinyinFilter.toString(), lastLexemeType
            );
            // 结果出来啦。。。
            List<WordClickCount> wordClickCounts = mapper.searchByClickCount(options);

测试一下：

@Test
    public void analyzeAndSearchTest() throws Exception {
        List<List<WordClickCount>> results = pinyinWordService.analyzeSearch("xiugaishoujihao"); //为了初始化 pinyin4j
        long start1 = System.currentTimeMillis();
        for (int i = 0; i < 100; i++) {
            long start = System.currentTimeMillis();
            List<List<WordClickCount>> results1 = pinyinWordService.analyzeSearch("xiugais机haoqyxgai修改");
            long end = System.currentTimeMillis();
            System.out.println(end - start + " ms");
        }
        long end1 = System.currentTimeMillis();
        System.out.println(end1 - start1 + " ms");
    }

测试结果

测试分解100条 11个词元的用户输入，话费22.1秒，平均每一个221ms，效果还行

将来优化：

在sql中，使用了表关联和count操做，当数据量比较大的时候，能够考虑将pinyinword 加一个字段，天天跑定时把count update到pinyinword表中，这样能够对pinyinword进行单表查询了

参考资料：

谈谈我作拼音搜索的一点经验

LC Analysis for Elasticsearch