使用BeautifulSoup的官方文档的例子:html
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
使用soup.prettify(),bs4解析出来的DOM树输出出来。python
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
1.几个简单的浏览结构化数据的方法:性能
>>>soup.title
<title>The Dormouse's story</title>
>>>soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>>soup.title.string
The Dormouse's story
2.将文档传入bs4的方法code
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
BeautifulSoup将复杂的HTML文档解析成DOM树,在bs中有tag、NavigableString、BeautifulSoup、Comment四种类型。orm
2.1 标签 (tag)xml
这里的tag与html中的tag类似。介绍一下tag中最重要的属性: name和attributes。tag.name表示标签的名字;tag.attributes是tag的属性。htm
tag有不少属性,例如:tag<b class="boldest">中,有一个属性是class的值是‘boldest’对象
>>>soup.a.attrs
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
(都是已字典的形式给出)three
tag的属性操做与操做字典彻底相同。内存
>>>soup.a['href']
http://example.com/elsie
>>>soup.a[‘class’]
['sister']
tag属性也能够进行添加与删除与修改:
>>>tag['class'] = 'verybold' >>>del tag[‘class’] >>>print(tag.get('class'))
# None
有些属性有多个值称为多值属性.
2.2能够遍历的字符串
>>>tag.string
# u'Extremely bold'
>>>type(tag.string)
# <class 'bs4.element.NavigableString'>
将NavigableString输出成unicode的形式:
>>>unicode_string = unicode(tag.string) >>>unicode_string
# u'Extremely bold'type(unicode_string)# <type 'unicode'>
tag中包含的字符串不能编辑,可是能够被替换成其它的字符串,用 replace_with() 方法:
tag.string.replace_with("No longer bold") tag
# <blockquote>No longer bold</blockquote>
字符串不支持 .contents 或 .string 属性或 find() 方法.
若是想在Beautiful Soup以外使用 NavigableString 对象,须要调用 unicode() 方法,将该对象转换成普通的Unicode字符串,不然就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.
Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的全部内容,可是还有一些特殊对象.容易让人担忧的内容是文档的注释部分comment。
Beautiful Soup的性能会在之后时间继续更新。