哈哈,其实很简单,寥寥几行代码网页爬一部小说,不卖关子,马上开始。html
首先安装所需的包,requests,BeautifulSoup4chrome
控制台执行网络
pip install requests工具
pip install BeautifulSoup4spa
若是不能正确安装,请检查你的环境变量,至于环境变量配置,在这里再也不赘述,相关文章有不少。3d
两个包的安装命令都结束后,输入pip list调试
能够看到,两个包都成功安装了。code
好的,咱们马上开始编写代码。htm
咱们的目标是抓取这个连接下全部小说的章节 https://book.qidian.com/info/1013646681#Catalogblog
咱们访问页面,用chrome调试工具查看元素,查看各章节的html属性。咱们发现全部章节父元素是<ul class="cf">这个元素,章节的连接以及标题,在子<li>下的<a>标签内。
那咱们第一步要作的事,就是要提取全部章节的连接。
'用于进行网络请求' import requests chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog") print(chapter.text)
页面顺利的请求到了,接下来咱们从页面中抓取相应的元素
'用于进行网络请求' import requests '用于解析html' from bs4 import BeautifulSoup chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog") ul_bs = BeautifulSoup(chapter.text) '提取class为cf的ul标签' ul = ul_bs.find_all("ul",class_="cf") print(ul)
ul也顺利抓取到了,接下来咱们遍历<ul>下的<a>标签取得全部章节的章节名与连接
'用于进行网络请求' import requests '用于解析html' from bs4 import BeautifulSoup chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog") ul_bs = BeautifulSoup(chapter.text) '提取class为cf的ul标签' ul = ul_bs.find_all("ul",class_="cf") ul_bs = BeautifulSoup(str(ul[0])) '找到<ul>下的<a>标签' a_bs = ul_bs.find_all("a") '遍历<a>的href属性跟text' for a in a_bs: href = a.get("href") text = a.get_text() print(href) print(text)
ok,全部的章节连接搞定,咱们去看想一想章节详情页面长什么样,而后咱们具体制定详情页面的爬取计划。
打开一个章节,用chrome调试工具审查一下。文章标题保存在<h3 class="j_chapterName">中,正文保存在<div class="read-content j_readContent">中。
咱们须要从这两个标签中提取内容。
'用于进行网络请求' import requests '用于解析html' from bs4 import BeautifulSoup chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog") ul_bs = BeautifulSoup(chapter.text) '提取class为cf的ul标签' ul = ul_bs.find_all("ul",class_="cf") ul_bs = BeautifulSoup(str(ul[0])) '找到<ul>下的<a>标签' a_bs = ul_bs.find_all("a") detail = requests.get("https:"+a_bs[0].get("href")) text_bs = BeautifulSoup(detail.text) text = text_bs.find_all("div",class_ = "read-content j_readContent") print(text)
正文页很顺利就爬取到了,以上代码仅是用第一篇文章作示范,经过调试文章已经能够爬取成功,全部下一步咱们只要把全部连接遍历逐个提取就行了
'用于进行网络请求' import requests '用于解析html' from bs4 import BeautifulSoup chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog") ul_bs = BeautifulSoup(chapter.text) '提取class为cf的ul标签' ul = ul_bs.find_all("ul",class_="cf") ul_bs = BeautifulSoup(str(ul[0])) '找到<ul>下的<a>标签' a_bs = ul_bs.find_all("a") '遍历全部<href>进行提取' for a in a_bs: detail = requests.get("https:"+a.get("href")) d_bs = BeautifulSoup(detail.text) '正文' content = d_bs.find_all("div",class_ = "read-content j_readContent") '标题' name = d_bs.find_all("h3",class_="j_chapterName")[0].get_text()
在上图中咱们看到正文中的每个<p>标签为一个段落,提取的文章包含不少<p>标签这也是咱们不但愿的,接下来去除p标签。
可是去除<p>标签后文章就没有段落格式了呀,这样的阅读体验很不爽的,咱们只要在每一个段落的结尾加一个换行符就行了
'用于进行网络请求' import requests '用于解析html' from bs4 import BeautifulSoup chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog") ul_bs = BeautifulSoup(chapter.text) '提取class为cf的ul标签' ul = ul_bs.find_all("ul",class_="cf") ul_bs = BeautifulSoup(str(ul[0])) '找到<ul>下的<a>标签' a_bs = ul_bs.find_all("a") '遍历全部<href>进行提取' for a in a_bs: detail = requests.get("https:"+a.get("href")) d_bs = BeautifulSoup(detail.text) '正文' content = d_bs.find_all("div",class_ = "read-content j_readContent") '标题' name = d_bs.find_all("h3",class_="j_chapterName")[0].get_text() txt = "" p_bs = BeautifulSoup(str(content)) '提取每一个<p>标签的内容' for p in p_bs.find_all("p"): txt = txt + p.get_text()+"\r\n"
去掉<p>标签了,全部的工做都作完了,咱们只要把文章保存成一个txt就能够了,txt的文件名以章节来命名。
'用于进行网络请求' import requests '用于解析html' from bs4 import BeautifulSoup def create_txt(path,txt): fd = None try: fd = open(path,'w+',encoding='utf-8') fd.write(txt) except: print("error") finally: if (fd !=None): fd.close() chapter = requests.get("https://book.qidian.com/info/1013646681#Catalog") ul_bs = BeautifulSoup(chapter.text) '提取class为cf的ul标签' ul = ul_bs.find_all("ul",class_="cf") ul_bs = BeautifulSoup(str(ul[0])) '找到<ul>下的<a>标签' a_bs = ul_bs.find_all("a") '遍历全部<href>进行提取' for a in a_bs: detail = requests.get("https:"+a.get("href")) d_bs = BeautifulSoup(detail.text) '正文' content = d_bs.find_all("div",class_ = "read-content j_readContent") '标题' name = d_bs.find_all("h3",class_="j_chapterName")[0].get_text() path = 'F:\\test\\' path = path + name+".txt" txt = "" p_bs = BeautifulSoup(str(content)) '提取每一个<p>标签的内容' for p in p_bs.find_all("p"): txt = txt + p.get_text()+"\r\n" create_txt(path,txt) print(path+"保存成功")
文章成功爬取,文件成功保存,搞定。就这么简单的几行代码搞定。