笔者这次因为须要作数据分析,因此写了一份儿爬虫,爬取新浪微博的微博评论和评论人信息以及转发状况和转发后的点赞状况。html
爬取新浪微博的原则是能爬移动端,打死不爬pc端,由于移动端的数据获取的url分析起来简单容易理解,没有pc端那么多须要考虑的参数,移动端的回复url格式python
'https://m.weibo.cn/api/comments/show?id={}&page={}'.format(weibo_id,i) #weibo_id:爬取的微博的id
而微博的id你们是很容易经过pc端的网址找到的json
在这里咱们以爬取王思聪IG夺冠发布抽奖微博为例 https://weibo.com/1826792401/H1rMeFWa2?from=page_1003061826792401_profile&wvr=6&mod=weibotime&type=comment#_rnd1543159215181api
这里的weibo_id = 'H1rMeFWa2',你们将此id带入上面的url中,用浏览器打开即可看见很明显的效果样式。浏览器
这里我就不具体截图了,具体的分析过程我也不说太多,都是对字典、列表等基本数据结构的应用。cookie
最终的爬取结果能够存储到excel文件汇中。可是提醒你们注意的问题是字符编码的问题,注意utf-8和其余字符集的区别,以及写入文件时的字符集调整。数据结构
代码就贴下来,供你们学习。工具
# -*- coding:utf-8 -*- __author__ = 'TengYu' import requests import xlwt import re import json import time headers = { 'User-agent' : 'Your-agent', 'Cookie':'Your-cookie' } #工具类,用来去除爬取的正文中一些不须要的连接、标签等 class Tool: deleteImg = re.compile('<img.*?>') newLine =re.compile('<tr>|<div>|</tr>|</div>') deleteAite = re.compile('//.*?:') deleteAddr = re.compile('<a.*?>.*?</a>|<a href='+'\'https:') deleteTag = re.compile('<.*?>') deleteWord = re.compile('回复@|回覆@|回覆|回复') @classmethod def replace(cls,x): x = re.sub(cls.deleteWord,'',x) x = re.sub(cls.deleteImg,'',x) x = re.sub(cls.deleteAite,'',x) x = re.sub(cls.deleteAddr, '', x) x = re.sub(cls.newLine,'',x) x = re.sub(cls.deleteTag,'',x) return x.strip() # comment类,用来获取评论 class comment(object): def get_comment(self): count = 0 i = 0 File = open('filename','w') excel = xlwt.Workbook(encoding='utf-8') sheet = excel.add_sheet('sheet1') sheet.write(0,0,'id') sheet.write(0,1,'sex') sheet.write(0,2,'name') sheet.write(0,3,'time') sheet.write(0,4,'loc') sheet.write(0,5,'text') sheet.write(0,6,'likes') while count < 400 and i < 101: i += 1 url = 'https://m.weibo.cn/api/comments/show?id=H1rMeFWa2&page='+str(i) # 更新url print (url) try: response = requests.get(url,headers=headers) resjson = json.loads(response.text) data = resjson.get('data') datanext = data.get('data') for j in range(0,len(datanext)): count += 1 temp = datanext[j] text = temp.get('text') text = Tool.replace(text) File.write(str(text) + "\n") like_counts = temp.get('like_counts') created_at = temp.get('created_at') user = temp.get('user') screen_name = user.get('screen_name') userid = user.get('id') info_url = "https://m.weibo.cn/api/container/getIndex?containerid=230283"+str(userid)+"_-_INFO" # 转发人信息的url r = requests.get(info_url) infojson = json.loads(r.text) infodata = infojson.get('data') cards = infodata.get('cards') sex = '' loc = '' for l in range(0,len(cards)): temp = cards[l] card_group = temp.get('card_group') for m in range(0,len(card_group)): s = card_group[m] if s.get('item_name') == '性别': sex = s.get('item_content') if s.get('item_name') == '所在地': loc = s.get('item_content') if sex is None: sex = '未知' if loc is None: loc = '未知' sheet.write(count,0,userid) sheet.write(count,1,str(sex)) sheet.write(count,2,str(screen_name)) sheet.write(count,3,created_at) sheet.write(count,4,text) sheet.write(count,5,str(loc)) sheet.write(count,6,like_counts) print ("已经抓取"+str(count)+"条数据") time.sleep(20) except Exception as e: print (e) excel.save('filename.xls') if __name__ == '__main__': Comment = comment() Comment.get_comment()
尊重原做,转载请注明,转载自:https://blog.csdn.net/kr2563学习