API爬虫--Twitter实战

时间 2020-07-07

原文原文链接

本篇将从实际例子出发，展现如何使用api爬取twitter的数据。java

1. 建立APP

进入https://apps.twitter.com/，建立本身的app。只有有了app才能够访问twitter的api并抓取数据。只需建立最简单的app便可，各类信息随意填写，并不须要进一步的认证，咱们要的只是app的Consumer Key (API Key)， Consumer Secret (API Secret)， Access Token 和 Access Token Secret。鉴于单app的爬取次数限制，能够申请不少app来提升总次数。数据库

2. 肯定要使用的API

twitter提供多种类型的api，其中经常使用的有REST API和Streaming API。前者是常见的api类型，后者则能够跟踪监视一个用户或者一个话题。api

REST API下面有不少的api，有价值爬取的有如下几个：网络

GET statuses/user_timeline：返回一个用户发的推文。注意twitter里回复也至关于发推文。
GET friends/ids：返回一个用户的followees。
GET followers/ids：返回一个用户的followers。
GET users/show：返回一个用户的信息。

3. 官方类库

下载twitter的类库。说实话，api爬虫好很差写，全看类库提供的功能强不强。twitter提供了多种语言的类库，本文选择java类库。app

4. 验证受权

凡是访问api，都须要验证受权，也即：OAuth。通常流程为：以app的id和key，用户的用户名和密码为参数访问受权api，返回一个token（一个字符串），即算是受权完成，以后只需访问其余api时带上这个参数就好了。框架

固然，不一样的网站受权过程各有不一样。较为繁琐的好比人人网须要先跳转至回调网页，用户登录后再返回token。twitter的受权过程也不简单（须要屡次http请求），可是幸运的是类库中已经帮咱们实现了此过程。ide

例，twitter的Auth1.1受权，其中须要设置的四个参数在app管理界面就能看到：网站

ConfigurationBuilder cb = new ConfigurationBuilder(); cb.setOAuthAccessToken(accessToken); cb.setOAuthAccessTokenSecret(accessTokenSecret); cb.setOAuthConsumerKey(consumerKey); cb.setOAuthConsumerSecret(consumerSecret); OAuthAuthorization auth = new OAuthAuthorization(cb.build()); Twitter twitter = new TwitterFactory().getInstance(auth);

twitter还提供一种无需用户受权（需app受权）的选择，访问某些api时可用次数比Auth1.1受权的要多：ui

ConfigurationBuilder cb = new ConfigurationBuilder(); cb.setApplicationOnlyAuthEnabled(true); Twitter twitter = new TwitterFactory(cb.build()).getInstance(); twitter.setOAuthConsumer(consumerKey, consumerSecret); try { twitter.getOAuth2Token(); } catch (TwitterException e) { e.printStackTrace(); }

5. 调用API

受权以后，咱们就能够真正地开始爬数据了。spa

REST API

爬取用户follower，getFollowersIDs方法每次返回最多5000个follower，cursor用户标记从哪开始：

IDs iDs = twitter.getFollowersIDs(uid, cursor);

爬取用户推文：

ResponseList<Status> status = twitter.getUserTimeline(uid, page);

Streaming API

监视一个用户的全部行为，其中UserStreamListener太长了只截取了一部分：

TwitterStream twitterStream;
twitterStream = new TwitterStreamFactory(cb.build()).getInstance(); twitterStream.addListener(listener); twitterStream.user(); private static final UserStreamListener listener = new UserStreamListener() { @Override public void onStatus(Status status) { System.out.println("onStatus @" + status.getUser().getScreenName() + " - " + status.getText() + status.getCreatedAt()); } @Override public void onDeletionNotice(StatusDeletionNotice statusDeletionNotice) { System.out.println("Got a status deletion notice id:" + statusDeletionNotice.getStatusId()); } @Override public void onDeletionNotice(long directMessageId, long userId) { System.out.println("Got a direct message deletion notice id:" + directMessageId); } @Override public void onTrackLimitationNotice(int numberOfLimitedStatuses) { System.out.println("Got a track limitation notice:" + numberOfLimitedStatuses); } @Override public void onScrubGeo(long userId, long upToStatusId) { System.out.println("Got scrub_geo event userId:" + userId + " upToStatusId:" + upToStatusId); } ...

6. 如何提速

api都是有访问次数限制的，twitter中很多都是以15分钟为单位的。为了爬取能达到必定的速度，我申请了50个app，只要pc给力，那么我就有50倍于单app的速度了。

那么我是单线程轮流用50个app仍是50个线程一块儿呢？显然50个线程不可行，一般20个线程以上的时候花费在线程同步上的时间就很可观了，而且咱们写的是爬虫，50个线程同时写数据库会严重拖慢速度。那么单线程呢？考虑到每一个app用完其访问次数是须要必定时间的，特别要是网络情况很差的话次数用完可能会花费数分钟，那么15分钟显然没法让每一个app都能访问api，形成了浪费。

因此我选择了线程池。IO密集型任务，通常将线程数设置为cpu的核数的两倍。同时设置两个队列，分别供各个线程读取数据和写数据。n个线程同时跑爬虫，再分一个线程出来维护那两个队列。框架以下：

好了，到这里应该能写twitter的api爬虫了。剩下的就是阅读各个api繁琐的文档，以及和各类bug搏斗的时间了╥﹏╥