BeautifulSoup是一个类html
b = BeautifulSoup(html)java
b对象有与html结构相关的各类方法和和属性。app
a = b.findAll('a')得到标签的对象dom
a对象又有关于属性的各类方法和属性吧url
获取某网页的全部链接:spa
from bs4 import BeautifulSoup import urllib.request import sys url = 'http://news.163.com/' #获取网页html html = urllib.request.urlopen(url).read() html = html.decode('gbk') #经过BeautifulSoup提取href a = BeautifulSoup(html).findAll('a') count = 0 err_a_list = [] for i in a: try: if i and i.attrs['href'][0] != 'j': #排除href = java.. print(i.attrs['href']) except Exception as e: #当没有href属性或属性值为空时会报错,捕获以防止循环被中断 print(e) err_a_list.append(i) count += 1 print("\n"*8) for i in err_a_list: print(i) print() print(count)
对网址没有域名以及锚点等href处理:.net
http://blog.csdn.net/huangxiongbiao/article/details/45584407code
# 将形如#comment-text的锚点补全成http://www.ruanyifeng.com/blog/2015/05/co.html,将形如/feed.html补全为http://www.ruanyifeng.com/feed.html alist = map(lambda i: proto + '://' + domain + i if i[0] == '/' else url + i if i[0] == '#' else i, alist)
shtm