BeautifulSoup解析空格

时间 2019-11-26

标签 beautifulsoup 解析空格繁體版

原文原文链接

今天爬一个网站，它的class里有空格，致使我用BeautifulSoup半天没爬出来，后来看了文档，这叫多值属性:css

HTML 4定义了一系列能够包含多个值的属性.在HTML5中移除了一些,却增长更多.最多见的多值的属性是 class (一个tag能够有多个CSS的class). 还有一些属性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值属性的返回类型是list:html

1 css_soup = BeautifulSoup('<p class="body strikeout"></p>')
2 css_soup.p['class']
3 # ["body", "strikeout"]
4 
5 css_soup = BeautifulSoup('<p class="body"></p>')
6 css_soup.p['class']
7 # ["body"]

若是某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性做为字符串返回python

1 id_soup = BeautifulSoup('<p id="my id"></p>')
2 id_soup.p['id']
3 # 'my id'

将tag转换成字符串时,多值属性会合并为一个值正则表达式

1 rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
2 rel_soup.a['rel']
3 # ['index']
4 rel_soup.a['rel'] = ['index', 'contents']
5 print(rel_soup.p)
6 # <p>Back to the <a rel="index contents">homepage</a></p>

若是转换的文档是XML格式,那么tag中不包含多值属性post

1 xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
2 xml_soup.p['class']
3 # u'body strikeout'

这是文档对多值属性的解释网站

因此在使用BeautifulSoup.find or BeautifulSoup.find_all的时候要注意spa

举个例子吧:code

假如我如今的HTML是这样的：orm

>>> html = '<div class="l_post j_l_post l_post_bright  "></div>'

先对他用html.parser解析，而后咱们看一看里面的class是什么xml

1 >>> Soup = BeautifulSoup(html,'html.parser')
2 >>> Soup.div['class']
3 ['l_post', 'j_l_post', 'l_post_bright', '']

咦，咱们发现若是最后又空格的话会多一个''，咱们用find_all或者find能够找到咱们想要的这个标签，可是class能够只要第一个，也能够要整个列表均可以

notice:我在爬一个网站的时候发现了这个的问题，你用列表的化是只要有列表中任一一个元素都能匹配上

1 >>> Soup.find('div', attrs = {'class':'1_post'})
2 >>> Soup.find('div', attrs = {'class':'l_post'})
3 <div class="l_post j_l_post l_post_bright "></div>
4 >>> Soup.find('div', attrs = {'class':['l_post', 'j_l_post']})
5 <div class="l_post j_l_post l_post_bright "></div>
6 >>> Soup.find('div', attrs = {'class':['l_post', 'j_l_post', 'l_post_bright']})
7 <div class="l_post j_l_post l_post_bright "></div>
8 >>> Soup.find('div', attrs = {'class':['l_post', 'j_l_post', 'l_post_bright', '']})
9 <div class="l_post j_l_post l_post_bright "></div>

这里再补充一点知识吧，就是find，和find_all的用法，拿这个例子继续(对了，find_all由于经常使用，全部能够省略，能够直接写Soup(.....))

1 >>> Soup.find(attrs = {'class':['l_post', 'j_l_post', 'l_post_bright', '']})
2 <div class="l_post j_l_post l_post_bright "></div>
3 >>> Soup.find_all(attrs = {'class':['l_post', 'j_l_post', 'l_post_bright', '']})[0]
4 <div class="l_post j_l_post l_post_bright "></div>
5 >>> Soup.find_all('div', attrs = {'class':['l_post', 'j_l_post', 'l_post_bright', '']})[0]
6 <div class="l_post j_l_post l_post_bright "></div>

通过对BeautifulSoup的深度阅读后，以为下面这个点对分析多值属性的html很重要：

按照CSS类名搜索tag的功能很是实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 作参数会致使语法错误.从Beautiful Soup的4.1.1版本开始,能够经过 class_ 参数搜索有指定CSS类名的tag:

1 soup.find_all("a", class_="sister")
2 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
3 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
4 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class_ 参数一样接受不一样类型的 过滤器 ,字符串,正则表达式,方法或 True :

 1 soup.find_all(class_=re.compile("itl"))
 2 # [<p class="title"><b>The Dormouse's story</b></p>]
 3 
 4 def has_six_characters(css_class):
 5     return css_class is not None and len(css_class) == 6
 6 
 7 soup.find_all(class_=has_six_characters)
 8 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 9 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
10 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tag的 class 属性是多值属性 .按照CSS类名搜索tag时,能够分别搜索tag中的每一个CSS类名:

1 css_soup = BeautifulSoup('<p class="body strikeout"></p>')
2 css_soup.find_all("p", class_="strikeout")
3 # [<p class="body strikeout"></p>]
4 
5 css_soup.find_all("p", class_="body")
6 # [<p class="body strikeout"></p>]

搜索 class 属性时也能够经过CSS值彻底匹配:

 
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

彻底匹配 class 的值时,若是CSS类名的顺序与实际不符,将搜索不到结果:

1 soup.find_all("a", attrs={"class": "sister"})
2 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
3 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
4 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

以上是文档对class_的解释，可是我发如今有些网站的解析时这种方式仍是行不通，因此在这种方式行不通的时候，我用了这种re

1 InfoList = Soup.find_all(class_ = re.compile('l_post j_l_post l_post_bright'))

好比说这个多值属性，我用前面的方法都不行，tag的class是“l_post j_l_post l_post_bright ”，这样才解决了个人问题