学习笔记 - first web crawler

时间 2019-11-08

标签学习笔记 web crawler 栏目 HTML 繁體版

原文原文链接

打开jupyterhtml

首先咱们先导入python urllib库里面的request模块python

from urllib.request import urlopen
复制代码

urlopen 是用来打开并读取一个从网络获取的远程对象，是一个通用的库，能够读取html文件图像文件以及其余任何文件流。api

html = urlopen("http://www.naver.com")
print(html.read())复制代码

爬取结果：bash

BeautifulSoup

BeautifulSoup 库经过定位HTML标签来格式化和组织复杂的网络信息，用简单易用的python对象展示XML结构信息。
服务器

因为 BeautifulSoup 库不是 Python 标准库，所以须要单独安装。可是jupyter中能够直接使用，能够省去不少时间，直接使用。（BeautifulSoup 库最经常使用的对象刚好就是 BeautifulSoup 对象）
网络

把文章开头的例子进行调整：框架

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
html = urlopen("http://www.pythonscraping.com/pages/page1.html") 
bsObj = BeautifulSoup(html.read()) 
print(bsObj.h1)

复制代码

导入 urlopen，而后调用 html.read() 获取网页的 HTML 内容。这样就能够把 HTML 内容传到 BeautifulSoup 对象，转换结构：
函数

想提取什么标签的话：ui

bsObj.html.body.h1 
        bsObj.body.h1 
        bsObj.html.h1复制代码

如今初步的框架搭建出来了，开始考虑问题。url

问题

若是网页在服务器上不存在（或者获取页面的时候出现错误）时候：

程序会返回 HTTP 错误。HTTP 错误多是“404 Page Not Found”“500 Internal Server Error”等。全部相似情形，urlopen 函数都会抛出“HTTPError”异常。

解决方法：

try:     
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:     
    print(e)     
    # 返回空值，中断程序，或者执行另外一个方案 
else:    
    # 程序继续。注意：若是你已经在上面异常捕捉那一段代码里返回或中断（break），
    # 那么就不须要使用else语句了，这段代码也不会执行复制代码

服务器不存在的时候：

若是服务器不存在（就是说连接打不开，或者是 URL 连接写错了），urlopen 会返回一个 None 对象。能够增长一个判断语句检测返回的 html 是否是 None：

if html is None:     
    print("URL is not found") 
else:     
    # 程序继续
复制代码

当你调用的标签不存在：

若是你想要调用的标签不存在，BeautifulSoup 就会返回 None 对象。不过，若是再调用这个 None 对象下面的子标签，就会发生 AttributeError 错误。

解决方法：

try:     
    badContent = bsObj.body.li
except AttributeError as e:     
    print("Tag was not found") 
else:     
    if badContent == None:         
        print ("Tag was not found")     
    else:         
        print(badContent)
复制代码

从新组织代码：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("没有找到网页")
else:
    print(title)复制代码

结果：

P1：

P2：

推荐一本好书：

python网络数据采集

www.amazon.cn/dp/B01DU8CX…