Beautiful Soup 是一个能够从HTML或XML文件中提取数据的Python库;其强大的提取能力让知识追寻者放弃了使用正则匹配查找HTML节点;Beautifu Soup 其能直接经过HTML标签获取相应的节点,或者经过函数直接得到节点,大大提升了编程人员的开发效率;看完本篇学不会Beautiful Soup ,满天神佛都救不了你;以为知识追寻者的文章有点意思,关注加点赞谢谢;javascript
Beautiful Soup 的解释器以下:html
解释器 | 使用示例 |
---|---|
Python标准库 | BeautifulSoup(markup, "html.parser") |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") |
lxml XML 解析器 | BeautifulSoup(markup, "xml") |
html5lib | BeautifulSoup(markup, "html5lib") |
本篇的解释器读者可使用Python标准库或者lxml HTML 解析器均可以;下午中获取标签其实都是获取标签对象,读者谨记;html5
简要归纳下属性的说明:java
属性 | 含义 |
---|---|
soup.tag.name | 获取标签tag名称 |
soup.tag.string | 获取标签tag文本内容 |
soup.tag | 获取标签tag |
soup.tag.attrs | 获取标签tag全部属性 |
soup.tag.attrs['class'] | 获取标签指定class的属性 |
soup.tag1.tag2 | 获取子标签tag2 |
soup.tag.contents | 获取tag全部直接子标签以列表输出 |
soup.tag.children | 获取直接子标签,返回生成器 |
soup.tag.descendants | 获取全部子标签,返回生成器 |
soup.tag.parent | 获取直接父节点 |
soup.tag.parents | 获取祖先节点,返回生成器 |
soup.tag.next_sibling | 获取后一个兄弟节点 |
soup.tag.previous_sibling | 获取前一个兄弟节点 |
soup.tag.next_siblings | 获取后一个兄弟节点,返回生成器 |
soup.tag.previous_siblings | 获取前一个兄弟节点,返回生成器 |
prettify()
方法会格式化HTML文档# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.prettify())
输出结果下,是否是很美观,结构是否是很清楚;并且还补全了缺失的标签</form>
, </div>
;node
<div class="filter-box d-flex align-items-center"> <form action="" id="seeOriginal"> <dl class="filter-sort-box d-flex align-items-center"> <dt> 排序: </dt> <dd> <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self"> 默认 </a> </dd> <dd> <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"> </use> </svg> RSS订阅 </a> </dd> </dl> </form> </div>
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 输出节点 <dt>排序:</dt> print(soup.dt)
soup.dt.string 得到dt标签包含的内容;编程
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 输出文本内容 排序: print(soup.dt.string)
soup.dt.name 直接得到标签dt的名称;svg
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 输出dt print(soup.dt.name)
直接得到标签后使用type方法能够显示出标签类型是 <class 'bs4.element.Tag'>
函数
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') dt = soup.dt # <class 'bs4.element.Tag'> print(type(dt))
soup.a.attrs 获取匹配到第一个a标签的全部属性;flex
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.a.attrs)
输出默认匹配第一个a标签的所有属性spa
{'href': 'javascript:void(0);', 'data-report-query': '', 'class': ['btn-filter-sort', 'active'], 'target': '_self'}
soup.a.attrs['href'],获取匹配到第一个a标签的href属性内容
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 输出javascript:void(0); print(soup.a.attrs['href'])
soup.form.dd 会得到form标签下第一个dd标签
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.form.dd)
输出
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
soup.form.contents 将会以列表的形式输出form全部的子标签;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.form.contents)
输出结果:
['\n', <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>]
soup.svg.children 会得到dd全部子节点的生成器;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') for index, child in enumerate(soup.svg.children): print(index, child)
输出结果:
0 1 <use xlink:href="#csdnc-rss"></use> 2
soup.dl.descendants 会获取dl 标签全部的子节点(more than direct child node),
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') for index, child in enumerate(soup.dl.descendants): print(index, child)
输出结果:
0 1 <dt>排序:</dt> 2 排序: 3 4 <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd> 5 <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a> 6 默认 7 8 <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> 9 <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> 10 11 <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg> 12 13 <use xlink:href="#csdnc-rss"></use> 14 15 RSS订阅 16 17
soup.a.parent 或获取第一个匹配到a标签的父标签对象;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.a.parent)
输出结果:
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
soup.a.parents 会得到第一个匹配到a标签的全部父节点,也就是祖先节点,返回生成器;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') for node in soup.a.parents: if node is None: print(node) else: print(node.name)
输出结果:
dd dl form div [document]
兄弟节点有个坑,一般是返回空白,就不作过多讲解
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.dt.next_sibling)
输出是空白;其它兄弟节点属性就不写了,感受没啥意义,不是空白就是None;
学完第二节内容,读者们其实就是打了个基础,重点是这章节;
函数 | 含义 |
---|---|
find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) | 查找全部匹配节点 |
find(name=None, attrs={}, recursive=True, text=None, **kwargs) | 查找第一个匹配节点 |
find_parent(name=None, attrs={}, **kwargs) | 返回当前节点的父辈节 |
find_parents(name=None, attrs={}, **kwargs) | 返回当前节点的祖先节点 |
find_next_sibling(name=None, attrs={}, text=None, **kwargs) | 返回符合条件的后面的第一个tag节点 |
find_next_siblings(name=None, attrs={}, text=None, **kwargs) | 返回全部符合条件的后面的兄弟节点 |
find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs) | 返回第一个符合条件的前面的兄弟节点 |
find_previous_siblings(self, name=None, attrs={}, text=None, **kwargs) | 返回全部符合条件的前面的兄弟节点 |
find_next(name=None, attrs={}, text=None, **kwargs) | 返回第一个符合条件的节点 |
find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs) | 返回全部符合条件的节点 |
find_previous(name=None, attrs={}, text=None, **kwargs) | 返回第一个符合条件的节点 |
find_all_previousname=None, attrs={}, text=None, limit=None, **kwargs) | 返回全部符合条件的节点 |
本节着重讲解find_all方法,find方法于find_all一致,学一个就会用另外一个;
soup.find_all(name='dd') 会得到全部dd标签对象,而且返回列表;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all(name='dd'))
输出结果
[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>, <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd>]
注:soup.find_all(name='dd') 与 soup.find_all('dd') 一致;
soup.find_all(attrs={'id':'seeOriginal'}) 获取 属性 id = seeOriginal 全部标签对象
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all(attrs={'id':'seeOriginal'}))
输出
[<form action="" id="seeOriginal"> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl></form>]
soup.find_all('dl',recursive=False)
会查找dl标签子节点,当recursive 设置为False以后就找不到了;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all('dl',recursive=False))
输出空列表[]
soup.find_all('dd',limit=1)
会限制输出结果为一条
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all('dd',limit=1))
输出
[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>]
soup.find_all(id='seeOriginal')
直接指定id属性查找
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all(id='seeOriginal'))
输出
[<form action="" id="seeOriginal"> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl></form>]
soup.find_all(href=re.compile("java.*?"))
匹配属性 href 正则 java开头的属性标签;
# -*- coding: utf-8 -*- import re import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all(href=re.compile("java.*?")))
输出结果
[<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a>]
soup.find_all("a", class_="btn")
查找a标签,class属性带有btn
# -*- coding: utf-8 -*- import re import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all("a", class_="btn"))
输出结果
[<a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a>]
Beautiful Soup 还直接支持CSS选择器搜索,下面列出了常常使用的方法示例;
# -*- coding: utf-8 -*- import re import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 选取 dl 标签下面的 dt标签 lt = soup.select('dl dt') print(lt) dd = soup.select('dl dd') print(dd[0]) # id 选择器搜索 id = soup.select('#seeOriginal') print(id) # class选择器 搜索 cla = soup.select('.btn-filter-sort') print(cla[0])
分别输出以下
soup.select('dl dt')
[<dt>排序:</dt>]
soup.select('dl dd')[0]
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd>
soup.select('#seeOriginal')
[<form action="" id="seeOriginal"> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS订阅</a> </dd> </dl></form>]
soup.select('.btn-filter-sort')[0]
<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默认</a>
原文出处:https://www.cnblogs.com/zszxz/p/12208673.html