818寿司外卖数据

时间 2019-11-30

标签寿司外卖数据繁體版

原文原文链接

几个月前空闲时候爬了下外卖的寿司数据（才不会认可是那段时间靠外卖维持生存），得闲写写分享下。本文适合围观群众和有一丁点基础的人。html

tips:本爬虫为了提升爬取速度，使用了异步协程，有须要且数据量小的喷油并不建议这么使用，会被封掉，能够修改成常规同步代码。mysql

根据数据分析的ETL流程，该小爬虫讲解以下：git

先准备下面的Python第三方包：

import pandas as pd
import requests
import aiohttp
import asyncio
from multiprocessing.pool import Pool
from datetime import date
import pymysql
from sqlalchemy import create_engine
import collections
复制代码

而后选择一个外卖平台进行分析，这里我选择的是ele.me，缘由就是由于简单！简单！简单！ ele.me能够直接经过Chrome的web分析找到数据接口，没有那么多反爬套路。接下来上正餐~~~~ 2.1先定义好一个请求数据的函数：

async def gethtml(url):
    header = {
        'Accept': 'application/json, text/plain, */*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive',
        'Host': 'www.ele.me',
        'Referer': 'https://www.ele.me/place/wsbrgts6d1ry?latitude=28.111704&longitude=113.011304',
        'x-shard': 'loc=113.011304,28.111704',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
    }
    try:
        async with aiohttp.ClientSession() as session:         
            async with session.get(url=url, headers=header) as r:
                # time.sleep(0.5)
                if not r.raise_for_status():
                    data = await r.json()            
                # print(data)
                # data = ujson.loads(data)
                return data
    except Exception as e:
        print(e)
        pass
复制代码

后续的数据请求都是经过这个函数，由于使用的是异步协程，因此使用async定义。github

2.2 接下来是数据提取函数：web

def getshopid(html):
    shop_id = {i['restaurant']['id'] for i in html['restaurant_with_foods']}
    return shop_id


def geturl(ids):
    restaurant_url = {'https://www.ele.me/restapi/shopping/restaurant/%s?latitude=28.09515&longitude=113.012001&terminal=web' %
                      shop_id for shop_id in ids}
    foodurl = {'https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=%s&terminal=web' %
               shop_id for shop_id in ids}
    return restaurant_url, foodurl
复制代码

函数分别是获取店铺id，获取店铺详情，这里面须要注意的是提取数据要注意去重，这里使用了简单暴力的集合数据结构去重。sql

2.3 数据提取完毕，接下来使用pandas从新载入数据作最后的分析，以下:数据库

def food_table(foodlists):
    foods = {(y['specfoods'][0]['restaurant_id'], y['name'], y['specfoods'][0]['price'],y['month_sales'], date.today().strftime('%Y-%m-%d'), date.today().strftime('%A')) for foodlist in foodlists for x in foodlist for y in x['foods']}
    return foods


def shop_table(shoplist):
    shop_detail = {(shop['id'], shop['name'], shop['distance'], shop['float_delivery_fee'],shop['float_minimum_order_amount'], shop['rating'], shop['rating_count']) for shop in shoplist}
    return shop_detail
复制代码

函数分别是生成食物详情表，店铺详情表。json

2.4 最后一步就是作分析，使用pandas处理，这里以简单的每一个店铺月销售总额作为指标：小程序

def join_table(shoptable, foodtable):
    shoptable = pd.DataFrame(list(shoptable), columns=[ 'id', 'name', 'distance', 'delivery_fee', 'minimum_order_amount', 'rating', 'rating_count'])
    foodtable = pd.DataFrame(list(foodtable), columns=['id', 'fname', 'price', 'msale', 'date', 'weekday'])
    # print(foodtable.values)
    new = pd.merge(shoptable, foodtable, on='id')
    new['total'] = new['msale'] * new['price']
    group = new.groupby(['name', 'id'])
    return new, group.sum()
复制代码

这一步是用pandas替代了SQL作处理，也能够存入MySQL中再处理，代码以下：api

connect = create_engine( 'mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8')
pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists='append')
复制代码

处理函数所有定义好，就能够开始写main函数了：

async def main(name):
    pool = Pool(8)
    # html = await gethtml(yangqi)
    htasks = [asyncio.ensure_future(gethtml(url))for url in name]
    htmls = await asyncio.gather(*htasks)
    # ids = getshopid(html)
    # print(htmls)
    ids = [getshopid(html) for html in htmls]
    # print(ids)
    restaurant_url, food_url = geturl(ids[0])
    print('async crawl...')
    shoptasks = [asyncio.ensure_future(
        gethtml(url)) for url in restaurant_url]
    foodtasks = [asyncio.ensure_future(
        gethtml(url)) for url in food_url]
    fdone, fpending = await asyncio.wait(foodtasks)
    sdone, spending = await asyncio.wait(shoptasks)
    shoplist = [task.result() for task in sdone]
    foodlist = [task.result() for task in fdone]
    print('distribute pasrse....')
    sparse_jobs = [pool.apply_async(shop_table, args=(shoplist,))]
    fparse_jobs = [pool.apply_async(food_table, args=(foodlist,))]
    shoptable = [x.get() for x in sparse_jobs][0]
    foodtable = [x.get() for x in fparse_jobs][0]
    new, result = join_table(shoptable, foodtable)

    return new, result
复制代码

最后一波操做，执行main函数：

while len(lists)>0:
    for k,v in list(lists.items()): 
        try:
            loop = asyncio.get_event_loop()
            tasks = asyncio.ensure_future(main(v))
            loop.run_until_complete(tasks)
            detail, totals = tasks.result()

            lists.pop(k)
            print('done:{}'.format(k))                  
        except KeyError:
            print('fail:{}'.format(k))
            pass
        else:
            connect = create_engine( 'mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8')
            pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists='append')
复制代码

由于是异步，须要在事件循环中执行。里面的lists就是本身想要搜索的区域中的外卖店列表，下面提供几个列表示例：

wuyisquare=['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.19652&limit=100&longitude=112.977361&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
sushi = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%AF%BF%E5%8F%B8&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
yangqi = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E8%8C%B6&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
tea = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
fen = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%AD%92%E5%AD%90%E9%AA%A8%E7%B2%89&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
gaosheng = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%B2%89&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
fangcun = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E6%96%B9%E5%AF%B8%E5%AF%BF%E5%8F%B8&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
luoyide = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%BD%97%E4%B9%89%E5%BE%B7&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
lists={'sushi':sushi,'tea':tea,'fen':fen,'gaosheng':gaosheng,'luoyide':luoyide,'fangcun':fangcun}
复制代码

URL只要替换keyword和latitude，longitude就能够搜索本身想要区域，经度纬度能够经过各种地图API获取，这里就不打广告了

这个爬虫使用了异步请求，集合去重，pandas的数据库同步写入等基础知识，适合练手，至于数据的价值本身慢慢挖掘，有点意思。好比月售与各类维度的关系，好比散点图，柱状图，日历热点图：

下一波玩一玩微信和QQ机器人，敬请期待~~~~~

写的这些文章是给刚入门的喷油作些参考，欢迎点星狂赞，顺便打个广告，颜值计算器小程序，源码看这里