Day 20: 斯坦福CoreNLP —— 用Java给Twitter进行情感分析

时间 2019-12-26

标签 day 斯坦福 corenlp java 进行情感分析栏目 Java 繁體版

原文原文链接

编者注：咱们发现了有趣的系列文章《30天学习30种新技术》，正在翻译，一天一篇更新，年终礼包。下面是第 20 天的内容。html

今天学习如何使用斯坦福CoreNLP Java API来进行情感分析(sentiment analysis)。前几天，我还写了一篇关于如何使用TextBlob API在Python里作情感分析，我已经开发了一个应用程序，会筛选出给定关键词的推文(tweets)的情感，如今看看它能作什么。java

应用

该演示应用程序在OpenShift http://sentiments-t20.rhcloud.com/ 运行，它有两个功能：git

第一个功能是，若是你给定Twitter搜索条件的列表会，它会显示最近20推关于给定的搜索词的情绪。必需要勾选下图所示的复选框来启用此功能，（情感）积极的推文将显示绿色，而消极的推文是红色的。
github
第二个功能是作一些文字上的情感分析，以下图
web

什么是斯坦福CoreNLP？

斯坦福CoreNLP是一个Java天然语言分析库，它集成了全部的天然语言处理工具，包括词性的终端（POS）标注器，命名实体识别（NER），分析器，对指代消解系统，以及情感分析工具，并提供英语分析的模型文件。segmentfault

准备

基本的Java知识是必需的，安装最新的Java开发工具包（JDK ），能够是OpenJDK 7或Oracle JDK 7。
从官方网站下载斯坦福CoreNLP包。
注册一个OpenShift账户，它是彻底免费的，能够分配给每一个用户1.5 GB的内存和3 GB的磁盘空间。
安装RHC客户端工具，须要有ruby 1.8.7或更新的版本，若是已经有ruby gem，输入 sudo gem install rhc ，确保它是最新版本。要更新RHC的话，执行命令 sudo gem update rhc，如需其余协助安装RHC命令行工具，请参阅该页面： https://www.openshift.com/developers/rhc-client-tools-install
经过 rhc setup 命令设置您的OpenShift账户，此命令将帮助你建立一个命名空间，并上传你的SSH keys到OpenShift服务器。

Github仓库

今天的演示应用程序的代码能够在GitHub找到：day20-stanford-sentiment-analysis-demo数组

在两分钟内启动并运行SentimentsApp

开始建立应用程序，名称为sentimentsapp。ruby

$ rhc create-app sentimentsapp jbosseap --from-code=https://github.com/shekhargulati/day20-stanford-sentiment-analysis-demo.git

还可使用以下指令：服务器

$ rhc create-app sentimentsapp jbosseap -g medium --from-code=https://github.com/shekhargulati/day20-stanford-sentiment-analysis-demo.git

这将为应用程序建立一个容器，设置全部须要的SELinux政策和cgroup的配置，OpenShift也将建立一个私人git仓库并克隆到本地。而后，它会复制版本库到本地系统。最后，OpenShift会给外界提供一个DNS，该应用程序将在http://newsapp-{domain-name}.rhcloud.com/ 下能够访问（将 domain-name 更换为本身的域名）。oracle

该应用程序还须要对应Twitter应用程序的4个环境变量，经过去https://dev.twitter.com/apps/new 建立一个新的Twitter应用程序，而后建立以下所示的4个环境变量。

$ rhc env set TWITTER_OAUTH_ACCESS_TOKEN=<please enter value> -a sentimentsapp

$ rhc env set TWITTER_OAUTH_ACCESS_TOKEN_SECRET=<please enter value> -a sentimentsapp

$rhc env set TWITTER_OAUTH_CONSUMER_KEY=<please enter value> -a sentimentsapp

$rhc env set TWITTER_OAUTH_CONSUMER_SECRET=<please enter value> -a sentimentsapp

从新启动应用程序，以确保服务器能够读取环境变量。

$ rhc restart-app --app sentimentsapp

开始在pom.xml中为stanford-corenlp和twitter4j增长Maven的依赖关系，使用3.3.0版本斯坦福corenlp做为情感分析的API。

<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.3.0</version>
</dependency>

<dependency>
    <groupId>org.twitter4j</groupId>
    <artifactId>twitter4j-core</artifactId>
    <version>[3.0,)</version>
</dependency>

该twitter4j依赖关系须要Twitter搜索。

经过更新 pom.xml 文件里的几个特性将Maven项目更新到Java 7：

<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>

如今就能够更新Maven项目了（右键单击>Maven>更新项目）。

启用CDI

使用CDI来进行依赖注入。CDI、上下文和依赖注入是一个Java EE 6规范，可以使依赖注入在Java EE 6的项目中。

在 src/main/webapp/WEB-INF 文件夹下建一个名为beans.xml中一个新的XML文件，启动CDI

<beans xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/beans_1_0.xsd">

</beans>

搜索Twitter的关键字

建立了一个新的类TwitterSearch，它使用Twitter4J API来搜索Twitter关键字。该API须要的Twitter应用程序配置参数，使用的环境变量获得这个值，而不是硬编码。

import java.util.Collections;
import java.util.List;

import twitter4j.Query;
import twitter4j.QueryResult;
import twitter4j.Status;
import twitter4j.Twitter;
import twitter4j.TwitterException;
import twitter4j.TwitterFactory;
import twitter4j.conf.ConfigurationBuilder;

public class TwitterSearch {

    public List<Status> search(String keyword) {
        ConfigurationBuilder cb = new ConfigurationBuilder();
        cb.setDebugEnabled(true).setOAuthConsumerKey(System.getenv("TWITTER_OAUTH_CONSUMER_KEY"))
                .setOAuthConsumerSecret(System.getenv("TWITTER_OAUTH_CONSUMER_SECRET"))
                .setOAuthAccessToken(System.getenv("TWITTER_OAUTH_ACCESS_TOKEN"))
                .setOAuthAccessTokenSecret(System.getenv("TWITTER_OAUTH_ACCESS_TOKEN_SECRET"));
        TwitterFactory tf = new TwitterFactory(cb.build());
        Twitter twitter = tf.getInstance();
        Query query = new Query(keyword + " -filter:retweets -filter:links -filter:replies -filter:images");
        query.setCount(20);
        query.setLocale("en");
        query.setLang("en");;
        try {
            QueryResult queryResult = twitter.search(query);
            return queryResult.getTweets();
        } catch (TwitterException e) {
            // ignore
            e.printStackTrace();
        }
        return Collections.emptyList();

    }


}

在上面的代码中，筛选了Twitter的搜索结果，以确保没有转推(retweet)、或带连接的推文、或有图片的推文，这样作的缘由是为了确保咱们获得的是有文字的推。

情感分析器(SentimentAnalyzer)

建立了一个叫SentimentAnalyzer的类，这个类就是对某一条推文进行情感分析的。

public class SentimentAnalyzer {

    public TweetWithSentiment findSentiment(String line) {

        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        int mainSentiment = 0;
        if (line != null && line.length() > 0) {
            int longest = 0;
            Annotation annotation = pipeline.process(line);
            for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
                Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);
                int sentiment = RNNCoreAnnotations.getPredictedClass(tree);
                String partText = sentence.toString();
                if (partText.length() > longest) {
                    mainSentiment = sentiment;
                    longest = partText.length();
                }

            }
        }
        if (mainSentiment == 2 || mainSentiment > 4 || mainSentiment < 0) {
            return null;
        }
        TweetWithSentiment tweetWithSentiment = new TweetWithSentiment(line, toCss(mainSentiment));
        return tweetWithSentiment;

    }
}

复制 englishPCFG.ser.gz 和 sentiment.ser.gz 模型到src/main/resources/edu/stanford/nlp/models/lexparser 和src/main/resources/edu/stanford/nlp/models/sentiment 文件夹下。

建立SentimentsResource

最后，建立了JAX-RS资源类。

public class SentimentsResource {

    @Inject
    private SentimentAnalyzer sentimentAnalyzer;

    @Inject
    private TwitterSearch twitterSearch;

    @GET
    @Produces(value = MediaType.APPLICATION_JSON)
    public List<Result> sentiments(@QueryParam("searchKeywords") String searchKeywords) {
        List<Result> results = new ArrayList<>();
        if (searchKeywords == null || searchKeywords.length() == 0) {
            return results;
        }

        Set<String> keywords = new HashSet<>();
        for (String keyword : searchKeywords.split(",")) {
            keywords.add(keyword.trim().toLowerCase());
        }
        if (keywords.size() > 3) {
            keywords = new HashSet<>(new ArrayList<>(keywords).subList(0, 3));
        }
        for (String keyword : keywords) {
            List<Status> statuses = twitterSearch.search(keyword);
            System.out.println("Found statuses ... " + statuses.size());
            List<TweetWithSentiment> sentiments = new ArrayList<>();
            for (Status status : statuses) {
                TweetWithSentiment tweetWithSentiment = sentimentAnalyzer.findSentiment(status.getText());
                if (tweetWithSentiment != null) {
                    sentiments.add(tweetWithSentiment);
                }
            }

            Result result = new Result(keyword, sentiments);
            results.add(result);
        }
        return results;
    }
}

上述代码执行如下操做：

检查搜索关键字(searchkeywords)是否“不是无效且不为空”，而后将其拆分到一个数组里，只考虑三个搜索条件。
而后对每个搜索条件找到对应的推文，并作情感分析。
最后将返回结果列表给用户。

今天就是这些，欢迎反馈。

原文 Day 20: Stanford CoreNLP--Performing Sentiment Analysis of Twitter using Java
翻译整理 SegmentFault