先来首python之禅(嘿嘿)html
分析青年文摘官网精选栏目http://www.qnwz.cn/html/221/list_1.htmlpython
源码app
<strong>当前位置:</strong><a href='http://www.qnwz.cn/'>主页</a>><a href='/html/239/'>《青年文摘·快点》</a>><a href='/html/221/'>文章精选</a>> | |
</div> | |
<div class="listbox"> | |
<ul class="e2"> | |
<li> | |
<a href='/html/221/201603/618083.html' class='preview'><img src='http://www.qnwz.cn///uploads/allimg/160315/1-160315105620961-lp.jpg'/></a> | |
<a href="/html/221/201603/618083.html" class="title"><b>视野|歪果仁找工做也拼爹?</b></a> | |
<span class="info"> | |
<small>日期:</small>2016-03-15 10:54:49 | |
<small>好评:</small>0 | |
<small>得分:</small>0 | |
</span>url ‘’‘’‘’‘’spa |
发现全部文章标题和文章网址都在div(class=listbox)里,该栏目有68页htm
1.so,导入requests和Beautifulsoup俩个爬虫经常使用库get
#!/usr/bin/python3 #coding:utf8 import requests from bs4 import BeautifulSoup
2.简单获得全部页面的地址(1到68页)源码
def geturl(self): for i in range(1,68): root_url='http://www.qnwz.cn/html/221/list_' root_url+=str(i)+'.html' self.l.append(root_url)
3.下载全部获得的页面(1到68页)requests
text = self.req.get(url=url)
4.从下载的页面中获取标题和文章地址string
def parser(self,r): soup = BeautifulSoup(r.content, 'html.parser') ur = soup.find_all('div', class_='listbox') soup = BeautifulSoup(str(ur), 'html.parser') titleurl = soup.find_all('a', class_='title') s='' for i in titleurl: self.n=self.n+1 s='title=' + i.string + ',url=http://www.qnwz.cn' + i['href']+'\n' print(s)
运行结果:
源码:
#!/usr/bin/python3 #coding:utf8 import requests from bs4 import BeautifulSoup class main(object): def __init__(self): self.l = list() self.req=requests.Session() self.T = [] self.n=0 self.geturl() for i in self.l: self.gethtml(i) print('总共' + str(self.n) + "篇") def geturl(self): for i in range(1,68): root_url='http://www.qnwz.cn/html/221/list_' root_url+=str(i)+'.html' self.l.append(root_url) def parser(self,r): soup = BeautifulSoup(r.content, 'html.parser') ur = soup.find_all('div', class_='listbox') soup = BeautifulSoup(str(ur), 'html.parser') titleurl = soup.find_all('a', class_='title') s='' for i in titleurl: self.n=self.n+1 s='title=' + i.string + ',url=http://www.qnwz.cn' + i['href']+'\n' print(s) def gethtml(self,url): text = self.req.get(url=url) self.parser(text) if __name__=='__main__': main()
文笔很差,代码简单,写得也比较简单‘ 。—— 。’ 有什么错误,欢迎指正。。