#coding4fun#词频统计优化思路

时间 2020-02-15

标签 coding4fun coding fun 词频统计优化思路繁體版

原文原文链接

关于这期的coding4fun，我选择的是hashmap方式实现。总体思路和流程你们可能都差很少，C++同窗们的总结写的很好，一些逻辑优化都有总结，我这里介绍下java实现的一些优化吧。java

使用ByteString代替String

开始读出文件转成String对象，而后经过String对象操做，代码写起来都比较方便。安全

可是有一个问题，文件读取出来的byte[]转成String对象很是耗时，一个1G的String对象分配内存时间就很长了，String对象内部使用char[]，经过byte[]构造String对象须要根据编码遍历byte[]。这个过程很是耗时，确定是能够优化的。多线程

因而我使用ByteString类代替Stringide

class ByteString{
byte[] bs;
int start;
int end;
}

hashcode()和equals()方法参考String的实现。测试

在code4fun的16核机器上测试以下代码：优化

代码1：编码

byte[] bs = new byte[1024*1024*1024];
long st = System.currentTimeMillis();
new String(bs);
System.out.println(System.currentTimeMillis() - st);  // 2619ms

代码2：线程

byte[] bs = new byte[1024*1024*1024];
long st = System.currentTimeMillis();
int count = 100000;
for(int i = 0; i &lt; count; i++)
new ByteString(bs, 0, 100);
System.out.println(System.currentTimeMillis() - st);  //10ms

循环中代码要精简

Hashmap的实现，给单词计数时避免不了以下的代码：code

ByteString str = new ByteString(bs, start, end);
Count count = map.get(str);
If(count == null){
count = new Count(str,1);
map.put(str,count);
} else{
count.add(1);
}

原本这段代码没什么问题，可是当单词个数足够大的时候（最终1.1G的文件，有2亿多单词），这段代码就值得优化了。第一行建立的对象，只有单词第一次出现有用，其余时间均可以不用建立。对象

因而建立一个Pmap类，继承HahsMap，并添加了一个get(ByteStringbs,intstart,intend)方法。上面的代码改成

Count count = map.get(bs, start, end);
If(count == null){
ByteString str = new ByteString(bs, start, end);
count = new Count(str,1);
map.put(str,count);
} else{
count.add(1);
}

能避免锁就不用锁，不能避免就减少范围

concurrentHashMap的实现当然精妙，只是能不用锁尽可能不用，实在用的时候，尽可能减小范围。CAS的方式虽然比锁好，可是仍是有消耗。

咱们使用多线程的方式统计，因此统计结果对象须要线程安全。开始使用AtomicInteger，可是跟count++比起来效率仍是差的很是多，单词个数越多越明显。

尝试使用volatile关键字效果也是不理想，而后比不上count++。

最后使用两个字段来解决这个问题：在线程内部统计单词个数时，使用count++方式；到合并环节，单词数已经很少，使用AtomicInteger的方式累加，基本不影响效率。

经过减小锁的范围和锁的次数，来达到提高效率的目标。