python3爬取青年文摘999篇精选文章

先来首python之禅(嘿嘿)html

分析青年文摘官网精选栏目http://www.qnwz.cn/html/221/list_1.htmlpython

源码app

<strong>当前位置:</strong><a href='http://www.qnwz.cn/'>主页</a>><a href='/html/239/'>《青年文摘·快点》</a>><a href='/html/221/'>文章精选</a>>
  </div>
  <div class="listbox">
  <ul class="e2">
  <li>
  <a href='/html/221/201603/618083.html' class='preview'><img src='http://www.qnwz.cn///uploads/allimg/160315/1-160315105620961-lp.jpg'/></a>
  <a href="/html/221/201603/618083.html" class="title"><b>视野|歪果仁找工做也拼爹?</b></a>
  <span class="info">
  <small>日期:</small>2016-03-15 10:54:49
  <small>好评:</small>0
  <small>得分:</small>0
 

</span>url

‘’‘’‘’‘’spa

发现全部文章标题和文章网址都在div(class=listbox)里,该栏目有68页htm

1.so,导入requests和Beautifulsoup俩个爬虫经常使用库get

#!/usr/bin/python3
#coding:utf8
import requests
from bs4 import BeautifulSoup

2.简单获得全部页面的地址(1到68页)源码

def geturl(self):
    for i in range(1,68):
        root_url='http://www.qnwz.cn/html/221/list_'
        root_url+=str(i)+'.html'
        self.l.append(root_url)

3.下载全部获得的页面(1到68页)requests

text = self.req.get(url=url)

4.从下载的页面中获取标题和文章地址string

def parser(self,r):
    soup = BeautifulSoup(r.content, 'html.parser')
    ur = soup.find_all('div', class_='listbox')
    soup = BeautifulSoup(str(ur), 'html.parser')
    titleurl = soup.find_all('a', class_='title')
    s=''
    for i in titleurl:
        self.n=self.n+1
        s='title=' + i.string + ',url=http://www.qnwz.cn' + i['href']+'\n'
        print(s)

运行结果:

源码:

#!/usr/bin/python3
#coding:utf8
import requests
from bs4 import BeautifulSoup
class main(object):
    def __init__(self):
        self.l = list()
        self.req=requests.Session()
        self.T = []
        self.n=0
        self.geturl()
        for i in self.l:
            self.gethtml(i)
        print('总共' + str(self.n) + "篇")
    def geturl(self):
        for i in range(1,68):
            root_url='http://www.qnwz.cn/html/221/list_'
            root_url+=str(i)+'.html'
            self.l.append(root_url)
    def parser(self,r):
        soup = BeautifulSoup(r.content, 'html.parser')
        ur = soup.find_all('div', class_='listbox')
        soup = BeautifulSoup(str(ur), 'html.parser')
        titleurl = soup.find_all('a', class_='title')
        s=''
        for i in titleurl:
            self.n=self.n+1
            s='title=' + i.string + ',url=http://www.qnwz.cn' + i['href']+'\n'
            print(s)

    def gethtml(self,url):
        text = self.req.get(url=url)
        self.parser(text)
if __name__=='__main__':
    main()

文笔很差,代码简单,写得也比较简单‘ 。——   。’    有什么错误,欢迎指正。。

相关文章
相关标签/搜索