BeautifulSoup解析非标准HTML的问题

时间 2019-11-17

标签 beautifulsoup 解析非标准 html 问题栏目 HTML 繁體版

原文原文链接

发现问题：

BeautifulSoup版本：4.3.2html

在用BeautifulSoup.find_all()搜索HTML时，遇到下面的代码：api

<a href="/shipin/donghuapian/2012-07-25/23404.html"title="谦谦君子" target="_blank">温润如玉</a>

能够看出代码中a标签的href属性和title属性之间没有空格。

工具

分析问题：

经过BeautifulSoup的诊断工具（4.2版以上才有）diagnose：spa

from bs4.diagnose import diagnose
html_doc = open('test.html').read()
diagnose(html_doc)

发现那行代码被解析成：code

<a href="/shipin/donghuapian/2012-07-25/23404.html"> title="谦谦君子" target="_blank"&gt;温润如玉</a>

看出来了吗？这是个错误的a标签，包含title和target位置出现错误，形成BeautifulSoup.find_all()解析到此行代码时，匹配title就会失败。

问题出现的缘由是BeautifulSoup默认使用Python自带的html parser，对错误网页的兼容性不强。xml

解决办法：

为BeautifulSoup指定一个新的html parser，这里有详情，我选择了lxml：
htm

sudo pip install lxml

建立BeautifulSoup对象时，添加一个参数：对象

#coding=utf-8
import re
from bs4 import BeautifulSoup

html_doc = open('test.html').read()
soup = BeautifulSoup(html_doc, 'lxml')　　# 选择lxml做为新的html parser。
tags = soup.find_all('a', {'title': re.compile(u'君子')})

就OK了。blog