使用libSvm实现文本分类的基本过程,此文参考 使用libsvm实现文本分类 对前期数据准备及后续的分类测试进行了验证,同时对文中做者的分词组件修改为hanLP分词,对数字进行过滤,仅保留长度大于1的词进行处理。html
转上文做者写的分类流程:java
文本预处理阶段,增长了基于hanLP的分词,代码以下:maven
/** * 使用hanlp进行分词 * Created by zhouyh on 2018/5/30. */ public class HanLPDocumentAnalyzer extends AbstractDocumentAnalyzer implements DocumentAnalyzer { private static final Log LOG = LogFactory.getLog(HanLPDocumentAnalyzer.class); public HanLPDocumentAnalyzer(ConfigReadable configuration) { super(configuration); } @Override public Map<String, Term> analyze(File file) { String doc = file.getAbsolutePath(); LOG.debug("Process document: file=" + doc); Map<String, Term> terms = Maps.newHashMap(); BufferedReader br = null; try { br = new BufferedReader(new InputStreamReader(new FileInputStream(file), charSet)); String line = null; while((line = br.readLine()) != null) { LOG.debug("Process line: " + line); List<com.hankcs.hanlp.seg.common.Term> termList = HanLP.segment(line); if (termList!=null && termList.size()>0){ for (com.hankcs.hanlp.seg.common.Term hanLPTerm : termList){ String word = hanLPTerm.word; if (!word.isEmpty() && !super.isStopword(word)){ if (word.trim().length()>1){ Pattern compile = Pattern.compile("(\\d+\\.\\d+)|(\\d+)|([\\uFF10-\\uFF19]+)"); Matcher matcher = compile.matcher(word); if (!matcher.find()){ Term term = terms.get(word); if (term == null){ term = new TermImpl(word); terms.put(word, term); } term.incrFreq(); } } } else { LOG.debug("Filter out stop word: file=" + file + ", word=" + word); } } } } } catch (IOException e) { throw new RuntimeException("", e); } finally { try { if(br != null) { br.close(); } } catch (IOException e) { LOG.warn(e); } LOG.debug("Done: file=" + file + ", termCount=" + terms.size()); } return terms; } public static void main(String[] args){ String filePath = "/Users/zhouyh/work/yanfa/xunlianji/UTF8/train/ClassFile/C000008/0.txt"; HanLPDocumentAnalyzer hanLPDocumentAnalyzer = new HanLPDocumentAnalyzer(new Configuration()); hanLPDocumentAnalyzer.analyze(new File(filePath)); String str = "测试hanLP分词"; System.out.println(str); // Pattern compile = Pattern.compile("(\\d+\\.\\d+)|(\\d+)|([\\uFF10-\\uFF19]+)"); // Matcher matcher = compile.matcher("9402"); // if (matcher.find()){ // System.out.println(matcher.group()); // } } }
这里对原做者提供的训练集资源作了合并,将训练集扩大到10个类别,每一个类别的8000文本中,前6000文本做为训练集,后2000文本做为测试集,文本结构以下图所示:ide
测试集中是一样的结构。测试
生成的特征向量与libsvm须要的训练集格式以下面所示:spa
libsvm训练集格式文档:debug
针对测试集也经过上述方式处理。3d
使用libSvm训练分类文本code
文本转换:htm
./svm-scale -l 0 -u 1 /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/train.txt > /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/train-scale.txt
测试集也作一样转换:
./svm-scale -l 0 -u 1 /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/test.txt > /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/test-scale.txt
进行模型训练,此部分耗时较长:
./svm-train -h 0 -t 0 /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/train-scale.txt /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/model.txt
训练过程以下图所示:
训练完成会生成model文件
采用预先处理好的测试文本进行分类测试:
./svm-predict /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/test-scale.txt /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/model.txt /Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/predict.txt
获得结果为:Accuracy = 81.6568% (16333/20002) (classification)
总体流程作完,获得文件以下图所列:
至此,仿照原做者的思路,对libsvm的分类流程作了一次实践。
JAVA代码测试
创建相关java项目,引入libsvm的jar包,我这里采用maven搭建,引入jar包:
<!-- https://mvnrepository.com/artifact/tw.edu.ntu.csie/libsvm --> <!-- libsvm jar包 --> <dependency> <groupId>tw.edu.ntu.csie</groupId> <artifactId>libsvm</artifactId> <version>3.17</version> </dependency>
同时要把libsvm包中的svm_predict.java及svm_train.java引入,并对svm_predict.java的类作简单改动,将预测的结果值返回,测试代码以下:
public class LibSvmAlgorithm { public static void main(String[] args){ String[] testArgs = {"/Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/test-scale.txt", "/Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/model.txt", "/Users/zhouyh/work/yanfa/xunlianji/UTF8/heji/predict1.txt"}; try { Double accuracy = svm_predict.main(testArgs); System.out.println(accuracy); } catch (IOException e) { e.printStackTrace(); } } }