python爬虫：使用BeautifulSoup修改网页内容

时间 2019-11-11

标签 python 爬虫使用 beautifulsoup 修改网页内容栏目 Python 繁體版

原文原文链接

BeautifulSoup除了能够查找和定位网页内容，还能够修改网页。修改意味着能够增长或删除标签，改变标签名字，变动标签属性，改变文本内容等等。html

使用修BeautifulSoup修改标签

每个标签在BeautifulSoup里面都被看成一个标签对象，这个对象能够执行如下任务：python

修改标签名
修改标签属性
增长新标签
删除存在的标签
修改标签的文本内容

修改标签的名字

只须要修改.name参数就能够修改标签名字。app

[python] view plain copy print ?

producer_entries.name = "div"<span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);">怎么办嘛</span><img src="file:///C:\Users\ADMINI~1\AppData\Local\Temp\~LWHD)}S}%DE5RTOO[CVEI1.gif" sysface="15" style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);" alt="" />

你咋这么说

修改标签的属性

修改标签的属性如class,id,style等。由于属性以字典形式储存，因此改变标签属性就是简单的处理Python的字典。函数

更新已经存在属性的标签

能够参照以下代码：spa

[python] view plain copy print ?

producer_entries['id']="producers_new_value"

为一个标签增长一个新的属性

好比一个标签没有class属性，那么能够参照以下代码增长class属性，.net

[python] view plain copy print ?

producer_entries['class']='newclass'

删除标签属性

使用del操做符，示例以下：xml

[python] view plain copy print ?

del producer_entries['class']

增长一个新的标签

BeautifulSoup有new_tag()方法来创造一个新的标签。而后可使用append(),insert(),insert_after()或者insert_before()等方法来对新标签进行插入。htm

增长一个新生产者，使用new_tag()而后append()对象

参照前面例子，生产者除了plants和alage外，咱们如今添加一个phytoplankton.首先，须要先创造一个li标签。blog

用new_tag()建立一个新标签

new_tag()方法只能用于BeautifulSoup对象。如今建立一个li对象。

[python] view plain copy print ?

soup = BeautifulSoup(html_markup,"lxml")
new_li_tag = soup.new_tag("li")

new_tag()对象必须的参数是标签名，其余标签属性参数或其余参数都是可选参数。举例：

[python] view plain copy print ?

new_atag=soup.new_tag("a",href="www.example.com")

[python] view plain copy print ?

new_li_tag.attrs={'class':'producerlist'}

使用append()方法添加新标签

append()方法添加新标签于,contents以后，就跟python列表方法append()同样。

[python] view plain copy print ?

producer_entries = soup.ul
producer_entries.append(new_li_tag)

li标签是ul标签的子代，添加新标签后的输出结果。

<ul id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>s
<li class="producerlist">
</li>
</ul>

使用insert()向li标签中添加新的div标签

append()在.contents以后添加新标签，而insert()却不是如此。咱们须要指定插入的位置。就跟python中的Insert（）方法同样。

[python] view plain copy print ?

new_div_name_tag=soup.new_tag("div")
new_div_name_tag["class"]="name"
new_div_number_tag=soup.new_tag("div")
new_div_number_tag["class"]="number"

先是建立两个div标签

[python] view plain copy print ?

new_li_tag.insert(0,new_div_name_tag)
new_li_tag.insert(1,new_div_number_tag)
print(new_li_tag.prettify())

而后进行插入，输出效果以下：

改变字符串内容

在上面例子中，只是添加了标签，但标签中却没有内容，若是想添加内容的话，BeautifulSoup也能够作到。

使用.string修改字符串内容

好比：

[python] view plain copy print ?

new_div_name_tag.string="phytoplankton"
print(producer_entries.prettify())

输出以下：

使用.append/()，insert()，和new_string()添加字符串

使用append()和insert()的效果就跟用在添加新标签中同样。好比：

[python] view plain copy print ?

new_div_name_tag.append("producer")
print(soup.prettify())

输出：

[html] view plain copy print ?

<html>
<body>
<div class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>
<div class="number">
100000
</div>
</li>
<li class="producerlist">
<strong><div class="name">
phytoplankton
producer
</div>
</strong><div class="number">
</div>
</li>
</ul>
</div>
</body>
</html>

还有一个new_string()方法，

[python] view plain copy print ?

new_string_toappend = soup.new_string("producer")
new_div_name_tag.append(new_string_toappend)

从网页中删除一个标签

删除标签的方法有decomose()和extract()方法

使用decompose()删除生产者

咱们如今移去class="name"属性的div标签，使用decompose()方法。

[python] view plain copy print ?

third_producer = soup.find_all("li")[2]
div_name = third_producer.div
div_name.decompose()
print(third_producer.prettify())

输出：

decompose()方法会移去标签及标签的子代。

使用extract()删除生产者

extract()用于删除一个HTMNL文档中昂的标签或者字符串，另外，它还返回一个被删除掉的标签或字符串的句柄。不一样于decompose()，extract也能够用于字符串。

[python] view plain copy print ?

third_producer_removed=third_producer.extract()
print(soup.prettify())

使用BeautifulSoup删除标签的内容

标签能够有一个NavigableString对象或tag对象做为子代。删除掉这些子代可使用clear()

举例，能够移掉带有plants的div标签和相应的class=number属性标签。

[python] view plain copy print ?

li_plants=soup.li

[python] view plain copy print ?

li_plants.clear()

输出：

能够看出跟li相关的标签内容被删除干净。

修改内容的特别函数

除了咱们以前看到的那些方法，BeautifulSoup还有其余修改内容的方法。

Insert_after()和Insert_before()方法：

这两个方法用于在标签或字符串以前或以后插入标签或字符串。这个方法须要的参数只有NavigavleString和tag对象。

[python] view plain copy print ?

soup = BeautifulSoup(html_markup,"lxml")
div_number = soup.find("div",class_="number")
div_ecosystem = soup.new_tag("div")
div_ecosystem['class'] = "ecosystem"
div_ecosystem.append("soil")
div_number.insert_after(div_ecosystem)
print(soup.prettify())

输出：

<html>
<body>
<div class="ecopyramid">
<ul id="producers">
<li class="producerlist">
<div class="name">
plants
</div>
<div class="number">
100000
</div>
<div class="ecosystem">
soil
</div>
</li>
<li class="producerlist">
<div class="name">
algae
</div>

replace_with()方法：

这个方法用于用一个新的标签或字符串替代原有的标签或字符串。这个方法把一个标签对象或字符串对象做为输入。replace_with()会返回一个被替代标签或字符串的句柄。

[python] view plain copy print ?

soup = BeautifulSoup(html_markup,"lxml")
div_name =soup.div
div_name.string.replace_with("phytoplankton")
print(soup.prettify())

replace_with()一样也能够用于彻底的替换掉一个标签。

wrap()和unwrap()方法：

wrap()方法用于在一个标签或字符串外包裹一个标签或字符串。好比能够用一个div标签包裹li标签里的所有内容。

[python] view plain copy print ?

li_tags = soup.find_all("li")
for li in li_tags:
<span style="white-space:pre"> </span>new_divtag = soup.new_tag("div")
<span style="white-space:pre"> </span>li.wrap(new_divtag)
print(soup.prettify())

而unwrap()就跟wrap()作的事情相反。unwrap()和replace_with()同样会返回被替代的标签句柄。