首先咱们来看看须要爬取的网站:http://xiaohua.zol.com.cn/html
pip install requests
pip install beautifulsoup4
pip install lxml
from bs4 import BeautifulSoup import os import requests
导入须要的库,os库用来后期储存爬取内容。python
随后咱们点开“最新笑话”,发现有“所有笑话”这一栏,可以让咱们最大效率地爬取全部历史笑话!app
咱们来经过requests库来看看这个页面的源代码:工具
from bs4 import BeautifulSoup import os import requests all_url = 'http://xiaohua.zol.com.cn/new/ headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} all_html=requests.get(all_url,headers = headers) print(all_html.text)
header是请求头,大部分网站没有这个请求头会爬取失败源码分析
部分效果以下:布局
经过源码分析发现咱们仍是不能经过此网站就直接获取到全部笑话的信息,所以咱们在在这个页面找一些间接的方法。优化
点开一个笑话查看全文,咱们发现此时网址变成了http://xiaohua.zol.com.cn/detail58/57681.html,在点开其余的笑话,咱们发现网址部都是形如http://xiaohua.zol.com.cn/detail?/?.html的格式,咱们以这个为突破口,去爬取全部的内容网站
咱们的目的是找到全部形如http://xiaohua.zol.com.cn/detail?/?.html的网址,再去爬取其内容。url
咱们在“所有笑话”页面随便翻到一页:http://xiaohua.zol.com.cn/new/5.html ,按下F12查看其源代码,按照其布局发现 :spa
每一个笑话对应其中一个
标签,分析得每一个笑话展开全文的网址藏在href当中,咱们只须要获取href就能获得笑话的网址
from bs4 import BeautifulSoup import os import requests RootUrl = 'http://xiaohua.zol.com.cn/new/' headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} RootCode=requests.get(RootUrl,headers = headers) #print(RootCode.text) Soup = BeautifulSoup(RootCode.text,'lxml') SoupList=Soup.find_all('li',class_ = 'article-summary') for i in SoupList: #print(i) SubSoup = BeautifulSoup(i.prettify(),'lxml') list2=SubSoup.find_all('a',target = '_blank',class_='all-read') for b in list2: href = b['href'] print(href)
咱们经过以上代码,成功得到第一页全部笑话的网址后缀:
也就是说,咱们只须要得到全部的循环遍历全部的页码,就能得到全部的笑话。
上面的代码优化后:
from bs4 import BeautifulSoup import requests import os RootUrl = 'http://xiaohua.zol.com.cn/new/' headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} RootCode=requests.get(RootUrl,headers = headers) def GetJokeUrl(): JokeUrlList = [] Soup = BeautifulSoup(RootCode.text,'lxml') SoupList=Soup.find_all('span',class_ = 'article-title') for i in SoupList: SubSoup = BeautifulSoup(i.prettify(),'lxml') JokeUrlList.append("http://xiaohua.zol.com.cn/"+str(SubSoup.a['href'])) return JokeUrlList
简单分析笑话页面html内容后,接下来获取一个页面所有笑话的内容:
from bs4 import BeautifulSoup import requests import os RootUrl = 'http://xiaohua.zol.com.cn/new/' headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} RootCode=requests.get(RootUrl,headers = headers) def GetJokeUrl(): JokeUrlList = [] Soup = BeautifulSoup(RootCode.text,'lxml') SoupList=Soup.find_all('span',class_ = 'article-title') for i in SoupList: SubSoup = BeautifulSoup(i.prettify(),'lxml') JokeUrlList.append("http://xiaohua.zol.com.cn/"+str(SubSoup.a['href'])) return JokeUrlList def GetJokeText(url): HtmlCode = requests.get(url,headers=headers) #don not forget Soup = BeautifulSoup(HtmlCode.text,'lxml') Content = Soup.find_all('p') for p in Content: print(p.text) def main(): JokeUrlList = GetJokeUrl() for url in JokeUrlList: GetJokeText(url) if __name__ == "__main__": main()
效果图以下: