Learn Beautiful Soup(7) —— BeautifulSoup的输出

时间 2019-11-13

标签 learn beautiful soup beautifulsoup 输出繁體版

原文原文链接

BeautifulSoup不单单只是能够查找，定位和修改文档内容，一样也能够用一个好的格式进行输出显示。BeautifulSoup能够处理不一样类型的输出：javascript

格式化的输出
非格式化的输出

格式化输出

BeautifulSoup中有内置的方法prettfy()来实现格式化输出。好比：

from bs4 import BeautifulSoup

html_markup = """<p class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>
<li class="producerlist">
<div class="name">algae</div>
Output in Beautiful Soup
<div class="number">100000</div>
</li>
</ul>"""
soup = BeautifulSoup(html_markup,"lxml")
print(soup.prettify())

输出：

prettify()能够用于BeautifulSoup对象也能够用于任何标签对象。好比：

producer_entry = soup.ul
print(producer_entry.prettify())

非格式化输出

可使用str()和unicode()来进行非格式化输出。

若是咱们对BeautifulSoup对象和标签对象使用str()方法，那么获得的就是通常的字符串输出样式。

咱们也可使用前篇讲到的encode()方法来指定编码格式的输出。

对BeautifulSoup对象或标签对象使用decode()方法来获得Unicode字符串。

BeautifulSoup中的输出格式化

HTML实体编码能够放进HTML文档中用来表示特别的字符和标识。这些标识不存在于键盘上，这些HTML实体编码只是当浏览器打开后才回看到效果。

在输出方法中，只有这几个HTML编码有点例外。>和<和&三个符号。除此以外其余的特别标识都是被转换成Unicode编码当建立BeautifulSoup对象时，且当使用Prettify()方法或者其余方法输出时，咱们只能获得UTF-8格式的字符串。

html_markup = """<html>
<body>& & ampersand
¢ ¢ cent
© © copyright
÷ ÷ divide
> > greater than
</body>
</html>

输出：

能够看到两个没有被转换。BeautifulSoup自带的输出格式器来控制输出。输出格式器有如下几种类型。

miimal
html
None
function

咱们能够在输出方法中传递上述输出格式器参数，如prettify(),ncode(),decode()

miimal格式化

在这种格式化模式下，字符串被处理成一个有效的HTML代码。这是默认的格式化输出，此时输出结果就和前面的同样。不能转换&, >和<

Html格式化

这种格式化模式下，BeautifulSoup将会将Unicode字符转换成HTML编码形式。
print(soup.prettify(formatter="html"))html

输出：

None格式化

这种状况下，BeautifulSoup不会改变字符串。这会致使产生一个非法的HTML代码。

print(soup.prettify(formatter=None))

输出：

函数格式化

咱们能够定义一个函数来处理字符串。好比去掉a字符。

def remove_chara(markup):
    return markup.replace("a","")
                 
soup = BeautifulSoup(html_markup,"lxml")
print(soup.prettify(formatter=remove_chara))

输出：

注意，其中字符a被替换掉了，可是注意的是&, >,和<也被转换了。

使用get_text()

从网页中获得文本是常见的工做，BeautifulSoup提供了get_text()方法来达到目的。

若是咱们只想获得BeautifulSoup对象的文本或标签对象的文本内容，咱们可使用get_text()方法。好比：

html_markup = """<p class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>
<li class="producerlist">
<div class="name">algae</div>
<div class="number">100000</div>
</li>
</ul>"""
soup = BeautifulSoup(html_markup,"lxml")
print(soup.get_text())

输出：

plants
100000

algae
100000

get_text()方法返回BeautifulSoup对象或标签对象中的文本内容，其为一个Unicode字符串。可是get_text()有个问题是它一样也会返回javascript代码。

去掉javascript代码的方法以下：

[x.extract() for x in soup_packtpage.find_all('script')]

这样就会用处掉全部脚本元素。