二、经过BeautifulSoup检索文档中的tag

时间 2019-11-17

原文原文链接

一、使用find_all()(或者findAll())检索标签

对于BeautifulSoup中的方法，find/findAll()为一组函数，经过不一样的参数进行重载。此处没太多书上的含糊细节，具体知识点都在代码注释行。html

须要注意的是，Python的注释分为：#单行注释
python

''' 多行注释 '''。可是通过试验，发现不能有多个多行注释。正则表达式

网上的其余注释方案有：用""" """定义多行字符串进行注释express

#根据特定tag和属性，来搜索定位
from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj=BeautifulSoup(html,"html.parser")
nameList=bsObj.find_all("span",{"class":"green"})#寻找被class=green渲染的文字,
                #findall(tag, attr)
for name in nameList:
    print(name.get_text()) #打印被渲染的文字

#关于find()和findall()函数的说明
#findAll(tag, attr, recursive, text, limit, keywords)
#find(tag, attr, recursive, text, keywords)
#一、tag为标签,多个标签：bsObj.findAll({"h1","h2","h3"})表示“or”关系
#二、attr为标签下的属性，分“名”和“值”：{"name":"value"}
    # 多个属性：bsObj.findAll{"class":"green", "class":"red"}
#三、recursive：true表示查看子标签；false表示只看顶层标签。默认为true
#四、text:根据内容寻找tag，如查看prince出现的次数：nameList=bsObj.findAll(text="prince")
#                                                   print(len(nameList))
#五、limit：限制查找次数，find()为findAll()当limit=1时的特例。
#六、findAll()能够指定特定的关键词（keyword）,此处为多个关键词为“and”关系，
#如 allText=bsObj.findAll(id="text")
#print(allText[0].get_text())
#注意：用keyword的地方也能够用其余方式来产生相同的做用。同时，因为class为关键字，因此不能用下面方法：
#bsObj.findAll(class="green")
#但能够：bsObj.findAll(class_="gree")或者bsObj.findAll("":{"class":"green"})

二、使用children/decsendants

#BeautifulSoup中的对象：BeautifulSoup object, Tag object, NavigableString object,
#  Comment object(<!-- comment -->)
#   直接访问标签：bsObj.div.findAll("img")，表示bsObj文档中的第一个div标签中寻找<img>
from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html,"html.parser")

for child in bsObj.find("table",{"id":"giftList"}).children: #children表示在table标签下的子标签
    #children仅仅表示第一层子标签，若是表示下面的子孙，则须要：descendants来表示
    print(child)

三、使用sibling/siblings

#此处演示“兄弟标签”
from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html,"html.parser")

for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    #此处打印多个兄弟节点，须要注意：一、此处的sibling不包括tr本省
                            #二、next_siblings向后顺序查找，前面的siblings使用previous_siblings
                            #三、next_sibling和previous_sibling只查找一个元素
    print(sibling)

四、使用parent

#寻找某个tag的parent
from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html,"html.parser")

print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).
      parent.previous_sibling.get_text())

五、正则表达式

#此程序使用regular expression
#regular expression = 特征 + 数量
#在线检测：RegexPal
#邮件地址的正则表达式：[A-Za-z0-9_+]+@[A-Za-z0-9]+\.(com|org|edu|net)
'''
    正则表达式中的12个符号：
    一、*：出现0或屡次
    二、+：出现1或屡次
    三、[]：给定某个范围，如[0-9]
    四、()：圈上某个组合，如(a*b)*
    五、{m,n}：指明最少出现m此，最多出现n次
    六、[^]：匹配不是范围内的字符，如：[^A-Za-z]，不是字母
    七、|：表示或，如：a|b|c，出现或者是a，或者是b，或者是c
    八、.：匹配任意一个字符
    九、^：出如今字符串的开头，如^a，a出如今字符串开头
    十、$：出如今字符串结尾，如a$，a为倒数第一个字符。
    十一、\：转译字符
    十二、?!：表示不包含，如^((?![A-Z]).)*$，表示不包含大写字母
'''

#下面程序用正则表达式来描述图片路径
from urllib.request import urlopen
from bs4 import BeautifulSoup
from bs4 import re

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj=BeautifulSoup(html,"html.parser")

images=bsObj.findAll("img",{"src":re.compile("\.\.\/img\/gifts\/img.*\.jpg")})
for image in images:
    print(image["src"])

六、lambda 表达式以及其余

#对于tag，能够直接访问其属性：tag.attrs，如访问图片属性：myImgTag.attrs['src']
#lambda expression：将函数值做为参数，带入另外一个函数中，如g(f(x),y)
#findAll()容许使用lambda表示，可是须要被带入的参数知足：该函数有一个tag参数
                                                        #返回值为true
#BeautifulSoup遇到的每一个tag都会在被带入函数中处理，且返回的true的tag将被保留
#好比：soup.findAll(lambda tag:len(tag.attrs) == 2)
'''
对于两个标签：<div class="body" id="content"></div>
            <span style="color:red" class="title"></span>
            则会返回
下面几个库实现和BeautifulSoup相同的功能:
lxml
HTML parser
'''