Python数据科学（五）数据处理和数据采集

时间 2019-11-17

原文原文链接

传送门：html

Python数据科学（一）- python与数据科学应用(Ⅰ)

Python数据科学（二）- python与数据科学应用(Ⅱ)

Python数据科学（三）- python与数据科学应用(Ⅲ)

Python数据科学（四）- 数据收集系列

Python数据科学（五）- 数据处理和数据采集

Python数据科学（六）- 资料清理(Ⅰ)

Python数据科学（七）- 资料清理(Ⅱ)

最近由于工做的事比较忙，要学的东西也不少，没有及时更新，下一阶段我会尽力一天一更的，一块学习的朋友跟紧不走丢ヽ(ˋ▽ˊ)ノpython

每时每刻,搜索引擎和网站都在采集大量信息,非原创即采集。采集信息用的程序通常被称为网络爬虫(Web crawler)、网络蜘蛛(Web spider),其行为通常是先“爬”到对应的网页上,再把须要的信息“铲”下来。说的通俗易懂一点网络数据采集程序也像是一只辛勤采蜜的小蜜蜂,它飞到花(目标网页)上,采集花粉(须要的信息),通过处理(数据清洗、存储)变成蜂蜜(可用的数据)。chrome

1.处理不一样格式的数据

网络数据采集大有所为。在大数据深刻人心的时代,网络数据采集做为网络、数据库与机器学习等领域的交汇点,已经成为知足个性化网络数据需求的最佳实践。搜索引擎能够知足人们对数据的共性需求,即**“所见即所得”,而网络数据采集技术能够进一步精炼数据**,把网络中杂乱无章的数据聚合成合理规范的形式,方便分析与挖掘,真正实现“经过数据进行分析”。工做中,你可能常常为找数据而烦恼,或者眼睁睁看着眼前的几百页数据却只能长恨咫尺天涯,又或者数据杂乱无章的网站中尽是带有陷阱的表单和坑爹的验证码,甚至须要的数据都在网页版的 PDF 和网络图片中。而做为一名反爬虫工程师,你也须要了解经常使用的网络数据采集手段,以及经常使用的网络表单安全措施,以提升网站访问的安全性,所谓道高一尺,魔高一丈...(因此对于爬虫工程师来讲天天都是不停地和对方的反爬工程师斗智斗勇,这个改天再唠...)数据库

扯得有点远，咱们言归正传，网络数据采集以前咱们先了解一下怎么对不一样格式的数据进行处理...json

1.处理CSV格式数据

1.下载数据

数据来源：http://data.stats.gov.cn/easyquery.htm?cn=C01 安全

2.处理数据

注意：处理Excel格式、Json格式数据数据也相似，分别使用Pandas中的read_excel()方法和read_json()方法。bash

3.处理XML格式数据

2.网络爬虫

这部分因为以前写过，这里就再也不进行详细写了，能够参考往期文章。网络

Python网络爬虫（一）- 入门基础

Python网络爬虫（二）- urllib爬虫案例

Python网络爬虫（三）- 爬虫进阶

Python网络爬虫（四）- XPath

Python网络爬虫（五）- Requests和Beautiful Soup

Python网络爬虫（六）- Scrapy框架

Python网络爬虫（七）- 深度爬虫CrawlSpider

Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序

利用简书首页文章标题数据生成词云

Spider与OpenPyXL的结合

爬取拉勾网招聘信息并使用xlwt存入Excel

Python能够作哪些好玩的事之自动刷票

Selenium与PhantomJS

使用Selenium抓取QQ空间好友说说

Selenium 的使用

3.小试牛刀

说了那么多理论性的东西，接下来就开始步入正轨了。 app

1.获取腾讯新闻首页新闻标题及连接，并以Excel形式存储

import requests
import pandas
from bs4 import BeautifulSoup

res = requests.get('https://news.qq.com/')  # 数据采集目标地址
soup = BeautifulSoup(res.text, 'html.parser') # 解析网页

newsary = []   # 定义空列表
for news in soup.select('.Q-tpWrap .text'):
    newsary.append({'title': news.select('a')[0].text,
                    'url':news.select('a')[0]['href']}) # 分别获取超连接中文本信息和href属性,即地址

newdf = pandas.DataFrame(newsary) # 建立一个DataFrame
newsdf.to_excel('news.xlsx')   # 输出到excel表格
print(newsary[0]) 
复制代码

2.抓取房天下房价信息并存储

import requests
import pandas as pd
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua_list = UserAgent()   # 设置user-agent列表,每次请求时，随机挑选一个user-agent

my_headers = {
        'user-agent': ua_list.random
    }

# 获取全部的url
def get_url():
    num = 1
    sum_url = []
    while num < 101:
        usual_url = 'http://esf.sh.fang.com/house/i3'
        home_url = usual_url + str(num)
        print(home_url)
        res = requests.get(url=home_url, headers=my_headers)
        num+=1
        soup = BeautifulSoup(res.text, 'html.parser')
        domain = 'http://esf.sh.fang.com'
        for house in soup.select('.houseList dl'):
            try:
                # title = house.select('.title')[0].text.strip() # 清除多余的换行
                url1 = domain + house.select('.title a')[0]['href']
                sum_url.append(url1)
            except Exception as e:
                print(e)
    print(len(sum_url))
    return sum_url


def houseary():
    houseary_url = get_url()
    houseary = []
    for url in houseary_url:
        print(url)
        content = requests.get(url=url, headers=my_headers)
        soup1 = BeautifulSoup(content.text, 'html.parser')
        try:
            info = {}
            info['标题'] = soup1.select('.title')[0].text.strip()
            info['总价'] = soup1.select('.trl-item')[0].text
            info['户型'] = soup1.select('.tt')[0].text.strip()
            info['建筑面积'] = soup1.select('.tt')[1].text
            info['单价'] = soup1.select('.tt')[2].text
            info['朝向'] = soup1.select('.tt')[3].text
            info['楼层'] = soup1.select('.tt')[4].text
            info['装修'] = soup1.select('.tt')[5].text
            info['小区'] = soup1.select('.rcont')[0].text.strip().replace('\n', '')
            info['区域'] = soup1.select('.rcont')[1].text.replace('\n', '').replace('\r', '').replace(' ', '')
            info['经纪人'] = soup1.select('.pn')[0].text
            info['联系电话'] = soup1.select('.pnum')[0].text
            houseary.append(info)
        except Exception as e:
            print(e)

    print(houseary)
    print(len(houseary))
    df = pd.DataFrame(houseary)
    df.to_excel('house.xlsx')


if __name__ == '__main__':
    houseary()
复制代码

后台运行程序，通过半个小时的战绩，总算把数据爬下来了，这个效率我以为是时候学一波分布式爬虫了... 框架

拿到了数据，咱们就该作数据的清理了，下一阶段数据的清理、资料探索与资料视觉化...

Python数据科学（五） 数据处理和数据采集