1、下载网页html
response = urllib2.urlopen(page_link, timeout=time_out) web_page = response.read()
2、解码网页(解决中文乱码问题):python
decode_s = web_page.decode("utf-8")
3、将字符串转化成树状结构:web
soup = BeautifulSoup(decode_s, "lxml")
接下来进入正题(假设搜索这么一段HTML标签):url
<div class="cell maket">LMN<h1>ABC<a href="a.html">DEF</a>GHIJK</h1>OPQRST</div>
若是咱们想取出ABCDEFGHIJK的话(不带<a>标签):spa
4、遍历寻找特定元素code
for tag in soup.find_all("div"):
5、判断特定属性是否在该元素内:xml
if tag.attrs is not None and 'class' in tag.attrs.keys():
6、判断该属性的内容是否等于特定值(对于空格分隔的状况,须要判断list长度):htm
if len(tag.attrs['class']) == 2 and tag.attrs['class'][0] == 'cell' and tag.attrs['class'][1] == 'maket':
7、取出该标签下子标签的元素:utf-8
tag.h1.get_text()
取得children时候的注意事项:children返回一个可迭代元素,但这个迭代里面的元素不全都是tag,极有多是bs4.element.NavigableString,因此迭代操做元素的时候,首先要判断一下元素的类型:element
for span in tag.parent.children: if isinstance(span, element.Tag) and span.attrs is not None and 'class' in span.attrs.keys():