beautifulsoup 解析器
解析器 | 使用方法 | 优点 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(text, "html.parser") | Python的内置标准库执行速度适中文档容错能力强 | Python 2.7.3 or 3.2.2前的版本中文档容错能力差 |
lxml HTML 解析器 | BeautifulSoup(text, "lxml") | 速度快文档容错能力强 | 须要安装C语言库 |
lxml XML 解析器 | BeautifulSoup(text, "xml") | 速度快惟一支持XML的解析器 | 须要安装C语言库 |
html5lib | BeautifulSoup(text, "html5lib") | 生成HTML5格式的文档 | 速度慢不依赖外部扩展 |
做业1:爬取文章, 并保存到本地(每一个文章, 一个html文件)
wordpress-edu-3autumn.localprod.forc.work
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://wordpress-edu-3autumn.localprod.forc.work/').text,'html.parser')
for i in soup.find_all('h2',class_='entry-title'):
print(i.find('a').text)
with open('{}.html'.format(i.find('a').text),'w',encoding='utf8') as file:
soup = BeautifulSoup(requests.get(i.find('a')['href']).text,'lxml')
file.write(str(soup.find('div',class_='entry-content')))
复制代码
做业2: 爬取分类下的图书名和对应价格, 保存到books.txt
books.toscrape.com
最终效果...
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://books.toscrape.com/').text,'html.parser')
with open('books.txt','w',encoding='utf8') as file:
for i in soup.find('ul',class_='nav nav-list').find('ul').find_all('li'):
file.write(i.text.strip()+'\n')
res = requests.get("http://books.toscrape.com/"+i.find('a')['href'])
res.encoding='utf8'
soup = BeautifulSoup(res.text,'html.parser')
for j in soup.find_all('li',class_="col-xs-6 col-sm-4 col-md-3 col-lg-3"):
print(j.find('h3').find('a')['title'])
file.write('\t"{}" {}\n'.format(j.find('h3').find('a')['title'],j.find('p',class_='price_color').text))
复制代码
Travel
"It's Only the Himalayas" £45.17
"Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond" £49.43
"See America: A Celebration of Our National Parks & Treasured Sites" £48.87
"Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel" £36.94
"Under the Tuscan Sun" £37.33
"A Summer In Europe" £44.34
"The Great Railway Bazaar" £30.54
"A Year in Provence (Provence #1)" £56.88
"The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2)" £23.21
"Neither Here nor There: Travels in Europe" £38.95
"1,000 Places to See Before You Die" £26.08
Mystery
"Sharp Objects" £47.82
"In a Dark, Dark Wood" £19.63
"The Past Never Ends" £56.50
"A Murder in Time" £16.64
"The Murder of Roger Ackroyd (Hercule Poirot #4)" £44.10
"The Last Mile (Amos Decker #2)" £54.21
"That Darkness (Gardiner and Renner #1)" £13.92
"Tastes Like Fear (DI Marnie Rome #3)" £10.69
"A Time of Torment (Charlie Parker #14)" £48.35
"A Study in Scarlet (Sherlock Holmes #1)" £16.73
"Poisonous (Max Revere Novels #3)" £26.80
"Murder at the 42nd Street Library (Raymond Ambler #1)" £54.36
"Most Wanted" £35.28
"Hide Away (Eve Duncan #20)" £11.84
"Boar Island (Anna Pigeon #19)" £59.48
"The Widow" £27.26
"Playing with Fire" £13.71
"What Happened on Beale Street (Secrets of the South Mysteries #2)" £25.37
"The Bachelor Girl's Guide to Murder (Herringford and Watts Mysteries #1)" £52.30
"Delivering the Truth (Quaker Midwife Mystery #1)" £20.89
Historical Fiction
"Tipping the Velvet" £53.74
"Forever and Forever: The Courtship of Henry Longfellow and Fanny Appleton" £29.69
"A Flight of Arrows (The Pathfinders #2)" £55.53
"The House by the Lake" £36.95
"Mrs. Houdini" £30.25
"The Marriage of Opposites" £28.08
"Glory over Everything: Beyond The Kitchen House" £45.84
"Love, Lies and Spies" £20.55
"A Paris Apartment" £39.01
"Lilac Girls" £17.28
"The Constant Princess (The Tudor Court #1)" £16.62
"The Invention of Wings" £37.34
"World Without End (The Pillars of the Earth #2)" £32.97
"The Passion of Dolssa" £28.32
"Girl With a Pearl Earring" £26.77
"Voyager (Outlander #3)" £21.07
"The Red Tent" £35.66
"The Last Painting of Sara de Vos" £55.55
"The Guernsey Literary and Potato Peel Pie Society" £49.53
"Girl in the Blue Coat" £46.83
......
复制代码
猫哥教你写爬虫 000--开篇.md
猫哥教你写爬虫 001--print()函数和变量.md
猫哥教你写爬虫 002--做业-打印皮卡丘.md
猫哥教你写爬虫 003--数据类型转换.md
猫哥教你写爬虫 004--数据类型转换-小练习.md
猫哥教你写爬虫 005--数据类型转换-小做业.md
猫哥教你写爬虫 006--条件判断和条件嵌套.md
猫哥教你写爬虫 007--条件判断和条件嵌套-小做业.md
猫哥教你写爬虫 008--input()函数.md
猫哥教你写爬虫 009--input()函数-人工智能小爱同窗.md
猫哥教你写爬虫 010--列表,字典,循环.md
猫哥教你写爬虫 011--列表,字典,循环-小做业.md
猫哥教你写爬虫 012--布尔值和四种语句.md
猫哥教你写爬虫 013--布尔值和四种语句-小做业.md
猫哥教你写爬虫 014--pk小游戏.md
猫哥教你写爬虫 015--pk小游戏(全新改版).md
猫哥教你写爬虫 016--函数.md
猫哥教你写爬虫 017--函数-小做业.md
猫哥教你写爬虫 018--debug.md
猫哥教你写爬虫 019--debug-做业.md
猫哥教你写爬虫 020--类与对象(上).md
猫哥教你写爬虫 021--类与对象(上)-做业.md
猫哥教你写爬虫 022--类与对象(下).md
猫哥教你写爬虫 023--类与对象(下)-做业.md
猫哥教你写爬虫 024--编码&&解码.md
猫哥教你写爬虫 025--编码&&解码-小做业.md
猫哥教你写爬虫 026--模块.md
猫哥教你写爬虫 027--模块介绍.md
猫哥教你写爬虫 028--模块介绍-小做业-广告牌.md
猫哥教你写爬虫 029--爬虫初探-requests.md
猫哥教你写爬虫 030--爬虫初探-requests-做业.md
猫哥教你写爬虫 031--爬虫基础-html.md
猫哥教你写爬虫 032--爬虫初体验-BeautifulSoup.md
猫哥教你写爬虫 033--爬虫初体验-BeautifulSoup-做业.md
猫哥教你写爬虫 034--爬虫-BeautifulSoup实践.md
猫哥教你写爬虫 035--爬虫-BeautifulSoup实践-做业-电影top250.md
猫哥教你写爬虫 036--爬虫-BeautifulSoup实践-做业-电影top250-做业解析.md
猫哥教你写爬虫 037--爬虫-宝宝要听歌.md
猫哥教你写爬虫 038--带参数请求.md
猫哥教你写爬虫 039--存储数据.md
猫哥教你写爬虫 040--存储数据-做业.md
猫哥教你写爬虫 041--模拟登陆-cookie.md
猫哥教你写爬虫 042--session的用法.md
猫哥教你写爬虫 043--模拟浏览器.md
猫哥教你写爬虫 044--模拟浏览器-做业.md
猫哥教你写爬虫 045--协程.md
猫哥教你写爬虫 046--协程-实践-吃什么不会胖.md
猫哥教你写爬虫 047--scrapy框架.md
猫哥教你写爬虫 048--爬虫和反爬虫.md
猫哥教你写爬虫 049--完结撒花.mdhtml