python-68：BS4获取多个标签的文本

时间 2019-11-15

标签 python bs4 获取多个标签文本栏目 Python 繁體版

原文原文链接

上一小节咱们讲解了如何获取源码并提取文章的标题，咱们使用的是soup.title.string ，通过对网页源码的分析，我发现文章的内容大部分都在...标签里，就像这样，因此我如今想将全部的内容获取出来看看是什么结果html

<p>若是你用的是新版的Debain或ubuntu,那么能够经过系统的软件包管理来安装:</p>
<p><code class="docutils literal"><span class="pre">$</span> <span class="pre">apt-get</span> <span class="pre">install</span> <span class="pre">Python-bs4</span></code></p>
<p>Beautiful Soup 4 经过PyPi发布,因此若是你没法使用系统包管理安装,那么也能够经过 <code class="docutils literal"><span class="pre">easy_install</span></code> 或 <code class="docutils literal"><span class="pre">pip</span></code> 来安装.包的名字是 <code class="docutils literal"><span class="pre">beautifulsoup4</span></code> ,这个包兼容Python2和Python3.</p>
<p><code class="docutils literal"><span class="pre">$</span> <span class="pre">easy_install</span> <span class="pre">beautifulsoup4</span></code></p>
<p><code class="docutils literal"><span class="pre">$</span> <span class="pre">pip</span> <span class="pre">install</span> <span class="pre">beautifulsoup4</span></code></p>
<p>(在PyPi中还有一个名字是 <code class="docutils literal"><span class="pre">BeautifulSoup</span></code> 的包,但那可能不是你想要的,那是 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup3</a> 的发布版本,由于不少项目还在使用BS3, 因此 <code class="docutils literal"><span class="pre">BeautifulSoup</span></code> 包依然有效.可是若是你在编写新项目,那么你应该安装的 <code class="docutils literal"><span class="pre">beautifulsoup4</span></code> )</p>
<p>若是你没有安装 <code class="docutils literal"><span class="pre">easy_install</span></code> 或 <code class="docutils literal"><span class="pre">pip</span></code> ,那你也能够 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/download/4.x/">下载BS4的源码</a> ,而后经过setup.py来安装.</p>
<p><code class="docutils literal"><span class="pre">$</span> <span class="pre">Python</span> <span class="pre">setup.py</span> <span class="pre">install</span></code></p>
<p>若是上述安装方法都行不通,Beautiful Soup的发布协议容许你将BS4的代码打包在你的项目中,这样无须安装便可使用.</p>
<p>做者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在全部当前的Python版本中正常工做</p>

依照前面的方法，我想这里应该这样写
python

print soup.p.string

可是结果是这样的：ubuntu

这个结果很令我吃惊，因而我又将代码修改为只输出标签的结果，而后变成了这样的：spa

print soup.p

无论怎么看，这结果都像是只返回了一个标签的内容，这令我大惑不解，到底是为何？code

咱们一个问题一个问题的来解决，首先是 soup.p 为何只返回一个标签orm

经过点取属性的方式只能得到当前名字的第一个tag:
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>htm

若是想要获得全部的<a>标签,或是经过名字获得比一个tag更多的内容的时候,就须要用到 Searching the tree 中描述的方法,好比: find_all()
soup.find_all('a')
ip

既然这样，那咱们的代码应该修改一下：
开发

print soup.find_all('p')

结果是这样的：get

这里的符号显示返回的应该是一个列表，但这具体是怎么回事如今先无论，咱们再来看看第二个问题：为何使用soup.p.string 的时候结果是None？

若是tag只有一个 NavigableString 类型子节点,那么这个tag能够使用 .string 获得子节点:
title_tag.string
# u'The Dormouse's story'

若是一个tag仅有一个子节点,那么这个tag也能够使用 .string 方法,输出结果与当前惟一子节点的 .string 结果相同:
head_tag.contents
# [<title>The Dormouse's story</title>]
head_tag.string
# u'The Dormouse's story'

若是tag包含了多个子节点,tag就没法肯定 .string 方法应该调用哪一个子节点的内容, .string 的输出结果是 None :
print(soup.html.string)
# None

这里的描述已经很详细了，我再讲的话有点多余了，因此，一句话总结：

soup.p 这样的方式只能获取第一个p标签，soup.p.string 没法获取获取多个标签里面的文本内容。

很明显这彻底不能实现咱们须要的功能，咱们须要获取多个标签里面的文字，因此咱们须要寻找新的方法，好在前面已经给了提示，find_all() , 那好，咱们就来看看它究竟是什么吧