BeautifulSoup 使用总结

使用BeautifulSoup的官方文档的例子:html

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用soup.prettify(),bs4解析出来的DOM树输出出来。python

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

1.几个简单的浏览结构化数据的方法:性能

>>>soup.title
<title>The Dormouse's story</title>
>>>soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>>soup.title.string
The Dormouse's story

2.将文档传入bs4的方法code

soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

BeautifulSoup将复杂的HTML文档解析成DOM树,在bs中有tag、NavigableString、BeautifulSoup、Comment四种类型。orm

2.1    标签 (tag)xml

这里的tag与html中的tag类似。介绍一下tag中最重要的属性: name和attributes。tag.name表示标签的名字;tag.attributes是tag的属性。htm

tag有不少属性,例如:tag<b class="boldest">中,有一个属性是class的值是‘boldest’对象

>>>soup.a.attrs
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

(都是已字典的形式给出)three

tag的属性操做与操做字典彻底相同。内存

>>>soup.a['href']
http://example.com/elsie
>>>soup.a[‘class’]
​​​​​​​['sister']

tag属性也能够进行添加与删除与修改:

>>>tag['class'] = 'verybold'
>>>del tag[‘class’]
>>>print(tag.get('class'))
# None

有些属性有多个值称为多值属性.

2.2能够遍历的字符串

>>>tag.string
# u'Extremely bold'
>>>type(tag.string)
# <class 'bs4.element.NavigableString'>

将NavigableString输出成unicode的形式:

>>>unicode_string = unicode(tag.string)
>>>unicode_string
# u'Extremely bold'type(unicode_string)# <type 'unicode'>

tag中包含的字符串不能编辑,可是能够被替换成其它的字符串,用 replace_with() 方法:

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>
字符串不支持 .contents 或 .string 属性或 find() 方法.

    若是想在Beautiful Soup以外使用 NavigableString 对象,须要调用 unicode() 方法,将该对象转换成普通的Unicode字符串,不然就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.

    Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的全部内容,可是还有一些特殊对象.容易让人担忧的内容是文档的注释部分comment。

    Beautiful Soup的性能会在之后时间继续更新。

相关文章
相关标签/搜索