Tweepy1_抓取Twitter数据

时间 2019-11-18

标签 tweepy1 tweepy 抓取数据繁體版

原文原文链接

python机器学习-乳腺癌细胞挖掘（博主亲自录制视频）https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

以前一直想用爬虫登录并抓取twitter数据，试过scrapy，requests等包，都没成功，多是我还不太熟悉的缘由，不过html

今天发现了一个新包tweepy，专门用于在Python中处理twitter API。先尝试一下教程的第一个例子，通过了本身的一点修改python

代码以下：mysql

Tweepy抓取twitter数据 1 
   
import re  
import tweepy  
  
auth = tweepy.OAuthHandler("xxxxx",  
                           "xxxxx")  
auth.set_access_token("xxxxx",  
                      "xxxxx")  
  
api = tweepy.API(auth)  
  
  
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')  
public_tweets = api.home_timeline()  
num = 0  
for tweet in public_tweets:  
    print num  
    num += 1  
    text_noem = highpoints.sub('--emoji--', tweet.text)  
    text_noem = text_noem.encode('utf8')

代码解释：正则表达式

第3-4行：导入tweepy和re模块。之因此这么简单的代码中要用re是由于在提取推文过程当中遇到了emoji表情，而emoji unicode是没法编码成 gbk 的，因此要用正则表达式把全部表情都替换掉。sql

第6-9行：设置API和token，这个须要注册后在apps.twitter.com新建application后得到。api

第11行：根据auth返回API对象，用于具体返回responsesapp

第14行：设置emoji表情的正则表达式，用于过滤出全部的表情，此处参考了下面注明的一篇stackoverflow文章。机器学习

第15行：获取用户时间线上的信息scrapy

第16行：设置一个计数的变量ide

第17行：遍历全部的推文：

循环内：

第18-22行：输出序号，并输出推文内容，将全部的emoji unicode用 ’--emoji--‘ 替代并将unicode编码为utf8以解决不能输出的问题。

抓取Twitter数据的重点是twitter要求全部requets都必须通过OAuth认证，而tweepy这个包在这方面的设定让authentication变得十分方便。

参考文献：

http://stackoverflow.com/questions/13729638/how-can-i-filter-emoji-characters-from-my-input-so-i-can-save-in-mysql-5-5

http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html

Tweepy 3.5.0 Doc (1) Getting started

开始

简介

若是你是第一次接触Tweepy，就请从这里开始。这个教程的目标是提供你学习Tweepy所需的信息，让你学习完本教程后能熟练使用Tweepy。咱们在这主要谈论重要的基础内容，而不会涉及太多细节，

你好 Tweepy

[python] view plain copy

import tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
public_tweets = api.home_timeline()
for tweet in public_tweets:
print tweet.text

这个例子能够下载你Twitter主页上的推文，而且把相应的文本内容打印到控制台。Twitter要求全部请求（requests）都经过OAuth协议进行受权（身份认证）。Authentication Tutorial（身份认证教程）（连接）中有受权的详细介绍。

API

API类为Twitter的因此REST API方法提供接口（The API class provides access to the entire twitter RESTful API methods.）每种方法接受不一样的参数，可是都返回response。更多请参见API Reference（连接）

模型

当咱们使用一种API方法时，咱们大多数状况下会获得一个Tweepy model 类实例，其中包含了从Twitter返回的可让咱们应用到app中的数据。好比下面这行代码就返回了一个User model：

[python] view plain copy

# Get the User object for twitter...
user = api.get_user('twitter')

Model中包含了数据和一些有用的方法：

[python] view plain copy

print user.screen_name
print user.followers_count
for friend in user.friends():
print friend.screen_name

更多内容请参见 ModelsReference（连接）

https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149（博主视频教学主页）