环境:CDH 5.13.0 (详见5. Word Count
)html
例若有以下两个文档:java
基于这两个文本文档,构造一个词典:git
Dictionary = {1:”Bob”, 2. “like”, 3. “to”, 4. “play”, 5. “basketball”, 6. “also”, 7. “football”, 8. “games”, 9. “Jim”, 10. “too”}。
这个词典一共包含10个不一样的单词,利用词典的索引号,上面两个文档每个均可以用一个10维向量表示(用整数数字0~n(n为正整数)表示某个单词在文档中出现的次数):github
向量中每一个元素表示词典中相关元素在文档中出现的次数(下文中,将用单词的直方图表示)。
不过,在构造文档向量的过程当中能够看到,咱们并无表达单词在原来句子中出现的次序。apache
基于BOW模型和情感词词典,咱们能够将每一个词分类为积极、消极、中立三种之一。根据全篇文章的词统计,判断该文档的情感偏向bash
import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.HashMap; import java.util.Map; import java.util.Iterator; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class SentimentAnalysis { public static class Map extends Mapper<Object, Text, Text, Text> { private HashMap<String, String> emotinDict = new HashMap<String, String>(); @Override public void setup(Context context) throws IOException { Configuration configuration = context.getConfiguration(); String dict = configuration.get("dict", ""); BufferedReader reader = new BufferedReader((new FileReader(dict))); String line = reader.readLine(); while (line != null) { String[] word_emotion = line.split("\t"); emotinDict.put(word_emotion[0].trim().toLowerCase(), word_emotion[1].trim()); line = reader.readLine(); } reader.close(); } @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String fileName = ((FileSplit) context.getInputSplit()).getPath() .getName(); String line = value.toString().trim(); StringTokenizer tokenizer = new StringTokenizer(line); Text filename = new Text(fileName); while (tokenizer.hasMoreTokens()) { String word = tokenizer.nextToken().trim().toLowerCase(); if (emotinDict.containsKey(word)) { context.write(filename, new Text(emotinDict.get(word))); } } } } public static class Reduce extends Reducer<Text, Text, Text, Text> { @Override public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { HashMap<String, Integer> map = new HashMap<String, Integer>(); for (Text value : values) { String emotion = value.toString(); int count = map.containsKey(emotion) ? map.get(emotion) : 0; map.put(emotion, count + 1); } StringBuilder builder = new StringBuilder(); Iterator<String> iterator = map.keySet().iterator(); while (iterator.hasNext()) { String emotion = iterator.next(); int count = map.get(emotion); builder.append(emotion).append("\t").append(count); context.write(key, new Text(builder.toString())); builder.setLength(0); } } } // args[0] : input path // args[1] : output path // args[2] : path of emotion dictionary. public static void main(String[] args) throws Exception { Configuration configuration = new Configuration(); // args[2] : path of emotion dictionary. configuration.set("dict", args[2]); Job job = Job.getInstance(configuration); job.setJarByClass(SentimentAnalysis.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); // args[0] : input path // args[1] : output path FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
下载app
四份样本文档:ide
将样本文档放入input/
文件夹oop
将SentimentAnalysis.java
,input
, emotionDict.txt
拷贝到同一文件夹下。在命令行下,将文件打包并上传HDFS执行post
hdfs dfs -put input /. #将input上传的HDFS export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar hadoop com.sun.tools.javac.Main *.java #编译java文件 jar cf SentimentAnalysis.jar *.class #将class打包成jar chmod 777 emotionDict.txt #开放所有权限 # hadoop 会在云端执行jar程序 # jar 参数指明程序的类型 # SentimentAnalysis.jar 是程序的名称 # SentimentAnalysis 是入口class的名称 # /input 是HDFS的路径input,至关于args[0] # /output 是HDFS的路径output,至关于args[1] # ${PWD}/emotionDict.txt 使用的是本地的emotionDict文件,至关于args[0] hadoop jar SentimentAnalysis.jar SentimentAnalysis /input /output ${PWD}/emotionDict.txt #用hadoop执行jar hdfs dfs -cat /output/* #查看输出,看到_SUCCESS
情感分计算公式为:
sentiment = (positive - negative) / (postitive + negative)
代码中略做修改。原始代码出自cloudera
官方示例 Example: Sentiment Analysis Using MapReduce Custom Counters