使用python抓取百度漂流瓶妹纸照片

无心中发现贴吧也出了个漂流瓶的东西,随手翻了翻发现竟然有好多妹子图,闲来无事因而就想写个爬虫程序把图片所有抓取下来。python

这里是贴吧漂流瓶地址
http://tieba.baidu.com/bottle...json

1.分析

首先打开抓包神器 Fiddler ,而后打开漂流瓶首页,加载几页试试,在Fiddler中过滤掉图片数据以及非 http 200 状态码的干扰数据后,发现每一页的数据获取都颇有规律,这就给抓取提供了便利。具体获取一页内容的url以下:windows

http://tieba.baidu.com/bottle...python2.7

看参数很容易明白,page_number 就是当前页码,page_size 就是当前页中包含的漂流瓶数量。网站

访问后获得的是一个json格式的数据,结构大体以下:编码

{
    "error_code": 0,
    "error_msg": "success",
    "data": {
        "has_more": 1,
        "bottles": [
            {
                "thread_id": "5057974188",
                "title": "美得不可一世",
                "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg"
            },
            {
                "thread_id": "5057974188",
                "title": "美得不可一世",
                "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg"
            },
            ...
   }
}

内容很直白一眼就看出,bottles 中的数据就是咱们想要的(thread_id 瓶子具体id, title 妹纸吐槽的内容, img_url 照片真实地址),遍历 bottles 就能够得到当前页的全部漂流瓶子。(其实如今获得的只是封面图哦,打开具体的瓶子有惊喜,由于我比较懒就懒得写了,不过我也分析了内部的数据,具体url是:http://tieba.baidu.com/bottle...瓶子thread_id>)url

还有一个参数 has_more 猜想是是否存在下一页的意思。
到这里采集方式应该能够肯定了。就是从第一页不停日后循环采集,直到 has_more 这个参数不为 1 结束。spa

2.编码

这里采用的是 python2.7 + urllib2 + demjson 来完成此项工做。urllib2 是python2.7自带的库,demjson 须要本身安装下(通常状况下用python自带的json库就能够完成json解析任务,可是如今好多网站提供的json并不规范,这就让自带json库无能为力了。)code

demjson 安装方式 (windows 不须要 sudo)blog

sudo pip install demjson

或者

sudo esay_install demjson

2.1得到一页内容

def bottlegen():
    page_number = 1
    while True:
        try:
            data = urllib2.urlopen(
                "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read()
            json = demjson.decode(data)
            if json["error_code"] == 0:
                data = json["data"]
                has_more = data["has_more"]
                bottles = data["bottles"]
                for bottle in bottles:
                    thread_id = bottle["thread_id"]
                    title = bottle["title"]
                    img_url = bottle["img_url"]
                    yield (thread_id, title, img_url)
                if has_more != 1:
                    break
                page_number += 1
        except:
            raise
            print("bottlegen exception")
            time.sleep(5)

这里使用python的生成器来源源不断的输出分析到的内容。

2.2根据url保存图片数据

for thread_id, title, img_url in bottlegen():
    filename = os.path.basename(img_url)
    pathname = "tieba/bottles/%s_%s" % (thread_id, filename)
        print filename
        with open(pathname, "wb") as f:
            f.write(urllib2.urlopen(img_url).read())
            f.close()

2.3所有代码以下

# -*- encoding: utf-8 -*-
import urllib2
import demjson
import time
import re
import os

def bottlegen():
    page_number = 1
    while True:
        try:
            data = urllib2.urlopen(
                "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read()
            json = demjson.decode(data)
            if json["error_code"] == 0:
                data = json["data"]
                has_more = data["has_more"]
                bottles = data["bottles"]
                for bottle in bottles:
                    thread_id = bottle["thread_id"]
                    title = bottle["title"]
                    img_url = bottle["img_url"]
                    yield (thread_id, title, img_url)
                if has_more != 1:
                    break
                page_number += 1
        except:
            raise
            print("bottlegen exception")
            time.sleep(5)

def imggen(thread_id):
    try:
        data = urllib2.urlopen(
            "http://tieba.baidu.com/bottle/photopbPage?thread_id=%s" % thread_id).read()
        match = re.search(r"\_\.Module\.use\(\'encourage\/widget\/bottle\',(.*?),function\(\)\{\}\);", data)
        data = match.group(1)
        json = demjson.decode(data)
        json = demjson.decode(json[1].replace("\r\n", ""))
        for i in json:
            thread_id = i["thread_id"]
            text = i["text"]
            img_url = i["img_url"]
            yield (thread_id, text, img_url)
    except:
        raise
        print("imggen exception")

try:
    os.makedirs("tieba/bottles")
except:
    pass

for thread_id, _, _ in bottlegen():
    for _, title, img_url in imggen(thread_id):
        filename = os.path.basename(img_url)
        pathname = "tieba/bottles/%s_%s" % (thread_id, filename)
        print filename
        with open(pathname, "wb") as f:
            f.write(urllib2.urlopen(img_url).read())
            f.close()

运行后会先得到每页全部瓶子,而后再得到具体瓶子中的全部图片,输出到 tieba/bottles/xxxxx.jpg 中。(由于比较懒就没作错误兼容,见谅 ^_^,,,)

结论

结论是,,, 都是骗人的就首页有几张好看的 - -,,, 他喵的,,,

最后贴下采集成果

相关文章
相关标签/搜索