一、在处理简繁转换的时候,最简单的方式是逐字进行简繁体转换,可是对于一简多繁、一繁多简的状况,须要结合语义、词组等进行转换。而这就涉及到一个难点:如何从一串长长的字符串中将一个个词组提取出来,也就是中文分词的问题。ios
二、中文分词指的是将一个汉字序列切分红一个一个单独的词。分词就是将连续的字序列按照必定的规范从新组合成词序列的过程。咱们知道,在英文的行文中,单词之间是以空格做为天然分界符的,而中文只是字、句和段能经过明显的分界符来简单划界,惟独词没有一个形式上的分界符,虽然英文也一样存在短语的划分问题,不过在词这一层上,中文比之英文要复杂的多、困难的多。算法
三、现有的分词算法可分为三大类:基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法。macos
四、较经常使用的方法是采用:字符匹配数组
这种方法又叫作机械分词方法,它是按照必定的策略将待分析的汉字串与一个“充分大的”机器词典中的词条进行配,若在词典中找到某个字符串,则匹配成功(识别出一个词)。按照扫描方向的不一样,串匹配分词方法能够分为正向匹配和逆向匹配;按照不一样长度优先匹配的状况,能够分为最大(最长)匹配和最小(最短)匹配;经常使用的几种机械分词方法以下:app
1)正向最大匹配法(由左到右的方向);ide
2)逆向最大匹配法(由右到左的方向);函数
3)最少切分(使每一句中切出的词数最小);code
4)双向最大匹配法(进行由左到右、由右到左两次扫描)token
苹果从很早就开始支持中文分词了,并且咱们几乎人人天天都会用到,回想一下,在使用手机时,长按一段文字,每每会选中按住位置的一个词语,这里就是一个分词的绝佳用例。ip
苹果给出了完整的API,想要全面了解的能够直接看文档:CFStringTokenizer Reference
一、相关系统库<CoreFoundation.framework>
二、相关目标头文件<CoreFoundation/CFStringTokenizer.h>
CFStringTokenizerRef CFStringTokenizerCreate(CFAllocatorRef alloc, CFStringRef string, CFRange range, CFOptionFlags options, CFLocaleRef locale) API_AVAILABLE(macos(10.5), ios(3.0), watchos(2.0), tvos(9.0));
第一个参数 alloc,通常传入NULL(使用当前默认的CFAllocator)便可 第二个参数 string,传入将要被提取分词的字符串(__bridge CFStringRef)string 第三个参数 range, 字符串string须要提取分词的范围,通常是整个string 第四个参数 options, 设置分词标准,比较实用的是kCFStringTokenizerUnitWordBoundary。CFOptionFlags有如下枚举: kCFStringTokenizerUnitWord = 0, kCFStringTokenizerUnitSentence = 1, kCFStringTokenizerUnitParagraph = 2, kCFStringTokenizerUnitLineBreak = 3, kCFStringTokenizerUnitWordBoundary = 4, kCFStringTokenizerAttributeLatinTranscription = 1UL << 16, kCFStringTokenizerAttributeLanguage = 1UL << 17, 第五个参数 locale, 本地化,可指定特殊的语言或区域,NULL为自动识别
/*! @function CFStringTokenizerCreate @abstract Creates a tokenizer instance. @param alloc The CFAllocator which should be used to allocate memory for the tokenizer and its storage for values. This parameter may be NULL in which case the current default CFAllocator is used. @param string The string to tokenize. @param range The range of characters within the string to be tokenized. The specified range must not exceed the length of the string. @param options Use one of the Tokenization Unit options to specify how the string should be tokenized. Optionally specify one or more attribute specifiers to tell the tokenizer to prepare specified attributes when it tokenizes the string. @param locale The locale to specify language or region specific behavior. Pass NULL if you want tokenizer to identify the locale automatically. @result A reference to the new CFStringTokenizer. */
CFStringTokenizerTokenType CFStringTokenizerAdvanceToNextToken(CFStringTokenizerRef tokenizer) API_AVAILABLE(macos(10.5), ios(3.0), watchos(2.0), tvos(9.0));
直接传入建立好的分词器tokenizer,每调用一次按照字符串顺序提取一个分词
/*! @function CFStringTokenizerAdvanceToNextToken @abstract Token enumerator. @param tokenizer The reference to CFStringTokenizer returned by CFStringTokenizerCreate. @result Type of the token if succeeded in finding a token and setting it as current token. kCFStringTokenizerTokenNone if failed in finding a token. @discussion If there is no preceding call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken, it finds the first token in the range specified to CFStringTokenizerCreate. If there is a current token after successful call to CFStringTokenizerGoToTokenAtIndex or CFStringTokenizerAdvanceToNextToken, it proceeds to the next token. If succeeded in finding a token, set it as current token and return its token type. Otherwise invalidate current token and return kCFStringTokenizerTokenNone. The range and attribute of the token can be obtained by calling CFStringTokenizerGetCurrentTokenRange and CFStringTokenizerCopyCurrentTokenAttribute. If the token is a compound (with type kCFStringTokenizerTokenHasSubTokensMask or kCFStringTokenizerTokenHasDerivedSubTokensMask), its subtokens and (or) derived subtokens can be obtained by calling CFStringTokenizerGetCurrentSubTokens. */
CFRange CFStringTokenizerGetCurrentTokenRange(CFStringTokenizerRef tokenizer) API_AVAILABLE(macos(10.5), ios(3.0), watchos(2.0), tvos(9.0));
获取上次执行CFStringTokenizerAdvanceToNextToken后获取到的分词在字符串中的范围Range
/*! @function CFStringTokenizerGetCurrentTokenRange @abstract Returns the range of current token. @param tokenizer The reference to CFStringTokenizer returned by CFStringTokenizerCreate. @result Range of current token, or {kCFNotFound,0} if there is no current token. */
// 要分词的字符串 NSString *string = @"今天下雨了嗎?小明說下雨了,小紅說沒下雨。那麼,小明和小紅誰在說謊呢?"; NSMutableArray *keywords = [[NSMutableArray alloc] init]; CFStringTokenizerRef ref = CFStringTokenizerCreate(NULL, (__bridge CFStringRef)string, CFRangeMake(0, string.length), kCFStringTokenizerUnitWordBoundary, NULL);// 建立分词器 CFRange range;// 当前分词的位置 // 获取第一个分词的范围 CFStringTokenizerAdvanceToNextToken(ref); range = CFStringTokenizerGetCurrentTokenRange(ref); // 循环遍历获取全部分词并记录到数组中 NSString *keyWord; while (range.length>0) { keyWord = [string substringWithRange:NSMakeRange(range.location, range.length)]; [keywords addObject:keyWord]; CFStringTokenizerAdvanceToNextToken(ref); range = CFStringTokenizerGetCurrentTokenRange(ref); NSLog(@"%@",keyWord); } NSLog(@"keywords = %@", keywords); CFRelease(ref);
运行结果以下
2017-10-20 14:09:23.569459+0800 TokenizerDemo[7220:227855] 今天 2017-10-20 14:09:23.569608+0800 TokenizerDemo[7220:227855] 下 2017-10-20 14:09:23.569742+0800 TokenizerDemo[7220:227855] 雨 2017-10-20 14:09:23.569844+0800 TokenizerDemo[7220:227855] 了 2017-10-20 14:09:23.570082+0800 TokenizerDemo[7220:227855] 嗎 2017-10-20 14:09:23.570207+0800 TokenizerDemo[7220:227855] ? 2017-10-20 14:09:23.570313+0800 TokenizerDemo[7220:227855] 小明 2017-10-20 14:09:23.570431+0800 TokenizerDemo[7220:227855] 說 2017-10-20 14:09:23.570522+0800 TokenizerDemo[7220:227855] 下雨 2017-10-20 14:09:23.570615+0800 TokenizerDemo[7220:227855] 了 2017-10-20 14:09:23.570695+0800 TokenizerDemo[7220:227855] , 2017-10-20 14:09:23.570764+0800 TokenizerDemo[7220:227855] 小紅 2017-10-20 14:09:23.570860+0800 TokenizerDemo[7220:227855] 說 2017-10-20 14:09:23.570936+0800 TokenizerDemo[7220:227855] 沒 2017-10-20 14:09:23.571007+0800 TokenizerDemo[7220:227855] 下雨 2017-10-20 14:09:23.571117+0800 TokenizerDemo[7220:227855] 。 2017-10-20 14:09:23.571373+0800 TokenizerDemo[7220:227855] 那麼 2017-10-20 14:09:23.571529+0800 TokenizerDemo[7220:227855] , 2017-10-20 14:09:23.571773+0800 TokenizerDemo[7220:227855] 小明 2017-10-20 14:09:23.572000+0800 TokenizerDemo[7220:227855] 和 2017-10-20 14:09:23.572235+0800 TokenizerDemo[7220:227855] 小紅 2017-10-20 14:09:23.572559+0800 TokenizerDemo[7220:227855] 誰 2017-10-20 14:09:23.573009+0800 TokenizerDemo[7220:227855] 在 2017-10-20 14:09:23.573432+0800 TokenizerDemo[7220:227855] 說謊 2017-10-20 14:09:23.573892+0800 TokenizerDemo[7220:227855] 呢 2017-10-20 14:09:23.574219+0800 TokenizerDemo[7220:227855] ?