【总结】BeautifulSoup速查手册

时间 2019-11-07

标签总结 beautifulsoup 速查手册繁體版

原文原文链接

1、下载网页html

response = urllib2.urlopen(page_link, timeout=time_out)
web_page = response.read()

2、解码网页（解决中文乱码问题）：python

decode_s = web_page.decode("utf-8")

3、将字符串转化成树状结构：web

soup = BeautifulSoup(decode_s, "lxml")

接下来进入正题（假设搜索这么一段HTML标签）：url

<div class="cell maket">LMN<h1>ABC<a href="a.html">DEF</a>GHIJK</h1>OPQRST</div>

若是咱们想取出ABCDEFGHIJK的话（不带<a>标签）：spa

4、遍历寻找特定元素code

for tag in soup.find_all("div"):

5、判断特定属性是否在该元素内：xml

if tag.attrs is not None and 'class' in tag.attrs.keys():

6、判断该属性的内容是否等于特定值（对于空格分隔的状况，须要判断list长度）：htm

if len(tag.attrs['class']) == 2 and 
   tag.attrs['class'][0] == 'cell' and 
   tag.attrs['class'][1] == 'maket':

7、取出该标签下子标签的元素：utf-8

tag.h1.get_text()

取得children时候的注意事项：children返回一个可迭代元素，但这个迭代里面的元素不全都是tag，极有多是bs4.element.NavigableString，因此迭代操做元素的时候，首先要判断一下元素的类型：element

for span in tag.parent.children:
                if isinstance(span, element.Tag) and 
                   span.attrs is not None and 
                   'class' in span.attrs.keys():