##示例1:去除scripthtml
#! /usr/bin/env python # -*- coding: utf-8 -*- from BeautifulSoup import BeautifulSoup html = ''' <script>a</script> baba <script>b</script> <h1>hi, world</h1> ''' soup = BeautifulSoup('<script>a</script>baba<script>b</script><h1>') [s.extract() for s in soup('script')] print soup
输出:python
baba<h1></h1>
可使用这种方法去除其余标签、以及其中内容。code
也能够将htm
[s.extract() for s in soup('script')]
替换为:ip
[s.extract() for s in soup.findAll('script')]
##示例2:去除注释utf-8
#! /usr/bin/env python # -*- coding: utf-8 -*- from BeautifulSoup import BeautifulSoup, Comment data = """<div class="foo"> cat dog sheep goat <!-- <p>test</p> --> </div>""" soup = BeautifulSoup(data) for element in soup(text=lambda text: isinstance(text, Comment)): element.extract() print soup.prettify()
输出结果:element
<div class="foo"> cat dog sheep goat </div>