无心中发现贴吧也出了个漂流瓶的东西,随手翻了翻发现竟然有好多妹子图,闲来无事因而就想写个爬虫程序把图片所有抓取下来。python
这里是贴吧漂流瓶地址
http://tieba.baidu.com/bottle...json
首先打开抓包神器 Fiddler ,而后打开漂流瓶首页,加载几页试试,在Fiddler中过滤掉图片数据以及非 http 200 状态码的干扰数据后,发现每一页的数据获取都颇有规律,这就给抓取提供了便利。具体获取一页内容的url以下:windows
http://tieba.baidu.com/bottle...python2.7
看参数很容易明白,page_number 就是当前页码,page_size 就是当前页中包含的漂流瓶数量。网站
访问后获得的是一个json格式的数据,结构大体以下:编码
{ "error_code": 0, "error_msg": "success", "data": { "has_more": 1, "bottles": [ { "thread_id": "5057974188", "title": "美得不可一世", "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg" }, { "thread_id": "5057974188", "title": "美得不可一世", "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg" }, ... } }
内容很直白一眼就看出,bottles
中的数据就是咱们想要的(thread_id
瓶子具体id, title
妹纸吐槽的内容, img_url
照片真实地址),遍历 bottles
就能够得到当前页的全部漂流瓶子。(其实如今获得的只是封面图哦,打开具体的瓶子有惊喜,由于我比较懒就懒得写了,不过我也分析了内部的数据,具体url是:http://tieba.baidu.com/bottle...瓶子thread_id>)url
还有一个参数 has_more
猜想是是否存在下一页的意思。
到这里采集方式应该能够肯定了。就是从第一页不停日后循环采集,直到 has_more
这个参数不为 1
结束。spa
这里采用的是 python2.7 + urllib2 + demjson 来完成此项工做。urllib2 是python2.7自带的库,demjson 须要本身安装下(通常状况下用python自带的json库就能够完成json解析任务,可是如今好多网站提供的json并不规范,这就让自带json库无能为力了。)code
demjson 安装方式 (windows 不须要 sudo)blog
sudo pip install demjson
或者
sudo esay_install demjson
def bottlegen(): page_number = 1 while True: try: data = urllib2.urlopen( "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read() json = demjson.decode(data) if json["error_code"] == 0: data = json["data"] has_more = data["has_more"] bottles = data["bottles"] for bottle in bottles: thread_id = bottle["thread_id"] title = bottle["title"] img_url = bottle["img_url"] yield (thread_id, title, img_url) if has_more != 1: break page_number += 1 except: raise print("bottlegen exception") time.sleep(5)
这里使用python的生成器来源源不断的输出分析到的内容。
for thread_id, title, img_url in bottlegen(): filename = os.path.basename(img_url) pathname = "tieba/bottles/%s_%s" % (thread_id, filename) print filename with open(pathname, "wb") as f: f.write(urllib2.urlopen(img_url).read()) f.close()
# -*- encoding: utf-8 -*- import urllib2 import demjson import time import re import os def bottlegen(): page_number = 1 while True: try: data = urllib2.urlopen( "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read() json = demjson.decode(data) if json["error_code"] == 0: data = json["data"] has_more = data["has_more"] bottles = data["bottles"] for bottle in bottles: thread_id = bottle["thread_id"] title = bottle["title"] img_url = bottle["img_url"] yield (thread_id, title, img_url) if has_more != 1: break page_number += 1 except: raise print("bottlegen exception") time.sleep(5) def imggen(thread_id): try: data = urllib2.urlopen( "http://tieba.baidu.com/bottle/photopbPage?thread_id=%s" % thread_id).read() match = re.search(r"\_\.Module\.use\(\'encourage\/widget\/bottle\',(.*?),function\(\)\{\}\);", data) data = match.group(1) json = demjson.decode(data) json = demjson.decode(json[1].replace("\r\n", "")) for i in json: thread_id = i["thread_id"] text = i["text"] img_url = i["img_url"] yield (thread_id, text, img_url) except: raise print("imggen exception") try: os.makedirs("tieba/bottles") except: pass for thread_id, _, _ in bottlegen(): for _, title, img_url in imggen(thread_id): filename = os.path.basename(img_url) pathname = "tieba/bottles/%s_%s" % (thread_id, filename) print filename with open(pathname, "wb") as f: f.write(urllib2.urlopen(img_url).read()) f.close()
运行后会先得到每页全部瓶子,而后再得到具体瓶子中的全部图片,输出到 tieba/bottles/xxxxx.jpg 中。(由于比较懒就没作错误兼容,见谅 ^_^,,,)
结论是,,, 都是骗人的就首页有几张好看的 - -,,, 他喵的,,,