python爬虫——写出最简单的网页爬虫

时间 2019-12-07

原文原文链接

知识就像碎布，记得“缝一缝”，你才能华丽丽地亮相html

最近对python爬虫有了强烈地兴趣，在此分享本身的学习路径，欢迎你们提出建议。咱们相互交流，共同进步。

1.开发工具

笔者使用的工具是sublime text3，它的短小精悍（可能男人们都不喜欢这个词）使我十分着迷。推荐你们使用，固然若是你的电脑配置不错，pycharm可能更加适合你。
sublime text3搭建python开发环境推荐查看此博客：
[sublime搭建python开发环境][http://www.cnblogs.com/codefish/p/4806849.html]

2.爬虫介绍

爬虫顾名思义，就是像虫子同样，爬在Internet这张大网上。如此，咱们即可以获取本身想要的东西。
既然要爬在Internet上，那么咱们就须要了解URL，法号“统一资源定位器”，小名“连接”。其结构主要由三部分组成：
（1）协议：如咱们在网址中常见的HTTP协议。
（2）域名或者IP地址：域名，如：www.baidu.com，IP地址，即将域名解析后对应的IP。
（3）路径：即目录或者文件等。

3.urllib开发最简单的爬虫

（1）urllib简介

Module	Introduce
urllib.error	Exception classes raised by urllib.request.
urllib.parse	Parse URLs into or assemble them from components.
urllib.request	Extensible library for opening URLs.
urllib.response	Response classes used by urllib.
urllib.robotparser	Load a robots.txt file and answer questions about fetchability of other URLs.

（2）开发最简单的爬虫

百度首页简洁大方，很适合咱们爬虫。
爬虫代码以下：

from urllib import request

def visit_baidu():
    URL = "http://www.baidu.com"
    # open the URL
    req = request.urlopen(URL)
    # read the URL 
    html = req.read()
    # decode the URL to utf-8
    html = html.decode("utf_8")
    print(html)

if __name__ == '__main__':
    visit_baidu()

结果以下图：

咱们能够经过在百度首页空白处右击，查看审查元素来和咱们的运行结果对比。
固然，request也能够生成一个request对象，这个对象能够用urlopen方法打开。
代码以下：

from urllib import request

def vists_baidu():
    # create a request obkect
    req = request.Request('http://www.baidu.com')
    # open the request object
    response = request.urlopen(req)
    # read the response 
    html = response.read()
    html = html.decode('utf-8')
    print(html)

if __name__ == '__main__':
    vists_baidu()

运行结果和刚才相同。

（3）错误处理

错误处理经过urllib模块来处理，主要有URLError和HTTPError错误，其中HTTPError错误是URLError错误的子类，即HTTRPError也能够经过URLError捕获。
HTTPError能够经过其code属性来捕获。
处理HTTPError的代码以下：

from urllib import request
from urllib import error

def Err():
    url = "https://segmentfault.com/zzz"
    req = request.Request(url)

    try:
        response = request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except error.HTTPError as e:
        print(e.code)
if __name__ == '__main__':
    Err()

运行结果如图：python

404为打印出的错误代码，关于此详细信息你们能够自行百度。

URLError能够经过其reason属性来捕获。
chuliHTTPError的代码以下：

from urllib import request
from urllib import error

def Err():
    url = "https://segmentf.com/"
    req = request.Request(url)

    try:
        response = request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except error.URLError as e:
        print(e.reason)
if __name__ == '__main__':
    Err()

运行结果如图：

既然为了处理错误，那么最好两个错误都写入代码中，毕竟越细致越清晰。须注意的是，HTTPError是URLError的子类，因此必定要将HTTPError放在URLError的前面，不然都会输出URLError的，如将404输出为Not Found。
代码以下：

from urllib import request
from urllib import error

# 第一种方法，URLErroe和HTTPError
def Err():
    url = "https://segmentfault.com/zzz"
    req = request.Request(url)

    try:
        response = request.urlopen(req)
        html = response.read().decode("utf-8")
        print(html)
    except error.HTTPError as e:
        print(e.code)
    except error.URLError as e:
        print(e.reason)

你们能够更改url来查看各类错误的输出形式。

新人初来乍到不容易，若是您以为有那么一丢丢好的话，请不要吝啬您的赞扬~撒花。segmentfault