基于python的批量网页爬虫

时间 2020-06-14

原文原文链接

在各个网站，较久远的天气信息基本须要付费购买，所以为了花费更少的代价，获得完整的信息，咱们常常会对一个网站进行爬虫，这篇文章是我第一次爬虫的心得，由于是第一次进行爬虫，python程序运行时间较长，如有错误，请大佬指出。html

爬取网站https://en.tutiempo.net/climate/ws-567780.html上昆明每个月的平均天气信息。以昆明1942年7月为例，观测网站https://en.tutiempo.net/climate/07-1942/ws-567780.html，能够发现，绿色表明月份，蓝色表明年份，咱们须要爬取的信息是1942年到2019年每个月的信息。即https://en.tutiempo.net/climate/01-1942/ws-567780.html到https://en.tutiempo.net/climate/12-2019/ws-567780.html每一个网页上图1红框内的信息。python

图1网站

F12观测网页结构如图2，找到该红框所对应的代码（html小白能够把鼠标放在代码上，出现的蓝筐即为该代码所构成的网页模块）。url

图2spa

发现红框对应的网页代码如图3所示：.net

图3excel

所以构造python字符匹配代码：code

'<td class="tc2">(.*)</td><td class="tc3">(.*)</td><td class="tc4">(.*)</td><td class="tc5">(.*)</td><td class="tc6">(.*)</td><td class="tc7">(.*)</td><td class="tc8">(.*)</td><td class="tc9">(.*)</td><td class="tc10">(.*)</td><td>&nbsp;</td><td>(.*)</td><td>(.*)</td><td>(.*)</td><td>(.*)</td>'

构造出的总体python代码以下：orm

import requests
import re
from xlwt import *

book = Workbook(encoding='utf-8')
sheet = book.add_sheet('Sheet1') #建立一个sheet
for j in range(78):
    # 一共78年
    for k in range(12):
        # 一共12个月
        print(j,k)
        try:
            # 匹配字符串
            word2 = '<td class="tc2">(.*)</td><td class="tc3">(.*)</td><td class="tc4">(.*)</td><td class="tc5">(.*)</td><td class="tc6">(.*)</td><td class="tc7">(.*)</td><td class="tc8">(.*)</td><td class="tc9">(.*)</td><td class="tc10">(.*)</td><td>&nbsp;</td><td>(.*)</td><td>(.*)</td><td>(.*)</td><td>(.*)</td>'
            # 在1到9月前面加个0
            if(k<9):
                url = "https://en.tutiempo.net/climate/0{}-{}/ws-567780.html".format(k + 1, j + 1942)
            else:
                url = "https://en.tutiempo.net/climate/{}-{}/ws-567780.html".format(k + 1, j + 1942)
            f = requests.get(url)  # Get该网页从而获取该html内容
            str = f.content.decode()
            # 返回查找到的数据
            wordlist2 = re.findall(re.compile(word2), str)
            for i in range(13):
                # 将数据存入book中
                print(wordlist2[0][i])
                a = j*12+k
                sheet.write(a, i, label=wordlist2[0][i])
        except:
            print()
# 将book保存到表格里
book.save("weather.xls")

运行后获得的excel表格见图5，通过ctrl+F进行字符替换和excel表的数据-分列-完成操做后，获得表格见图6，进行一些修饰，见图7表格。htm

图5

图6

图7

最后，本篇文章乃做者原创，禁止将本篇文章内容用于商业用途，若需转载请标明出处。

原文出处：https://www.cnblogs.com/nzsll/p/10959261.html