爬虫与swift

时间 2019-11-16

原文原文链接

分析

使用爬虫爬取网站page，并按事先的要求将须要的项目保存到数据库中，而后再使用python flask框架编写一个web 服务器讲数据库中的数据读出来，最后用swift编写一个应用将数据显示出来。我这里选区的所要爬取的网站是豆瓣电影网。css

技术选用

爬虫：使用python的scrapy爬虫
数据库：使用mongoDB，存储网页只须要key和value形式进行存储就行了，因此在这里选择mongoDB这种NOSQL数据库进行存储
服务器：使用python的flask框架，用了你就知道几行代码就能完成不少事情，固然特别是flask能够根据须要组装空间，超轻量级。html

实现：

scrapy爬虫实现python

上图是scrapy的文档结构，下面主要介绍几个文件。ios

a. items.pyweb

from scrapy.item import Item, Field
import scrapy
class TopitmeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = Field()
    dataSrc = Field()
    dataId = Field()
    filmReview = Field()
    startCount = Field()
这里能够把items.py看做是mvc中的model，在items里咱们定义了本身须要的模型。

b. pipelines.pymongodb

import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings[‘MONGODB_COLLECTION’]]
    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Beauty added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

俗称管道，这个文件主要用来把咱们获取的item类型存入mongodb

c. settings.py数据库

BOT_NAME = 'topitme'
SPIDER_MODULES = ['topitme.spiders']
NEWSPIDER_MODULE = 'topitme.spiders'
BOT_NAME = 'topitme'
ITEM_PIPELINES = ['topitme.pipelines.MongoDBPipeline',]
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "topitme"
MONGODB_COLLECTION = "beauty"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'topitme (+http://www.yourdomain.com)'

这里须要设置一些常量，例如mongodb的数据库名，数据库地址和数据库端口号等等

d. topitme_scrapy.pyjson

from scrapy import Spider
from scrapy.selector import Selector
from topitme.items import TopitmeItem

import sys
reload(sys)
sys.setdefaultencoding(‘utf8’)#设置默认编码格式

class topitmeSpider(Spider):
    name = "topitmeSpider"
    allowed_domin =["movie.douban.com"]
    start_urls = [
        "http://movie.douban.com/review/latest/",
    ]
    def parse(self, response):
        results = Selector(response).xpath('//ul[@class="tlst clearfix"]')
        for result in results:
            item = TopitmeItem()
            # item['title'] = result.xpath('li[@class="ilst"]/a/@src').extract()[0]
            item['title'] = result.xpath('li[@class="ilst"]/a/@title').extract()[0].encode('utf-8')
            item['dataSrc'] = result.xpath('li[@class="ilst"]/a/img/@src').extract()[0]
            item['filmReview'] = result.xpath('li[@class="clst report-link"]/div[@class="review-short"]/span/text()').extract()[0].encode('utf-8')
            item['dataId'] = result.xpath('li[@class="clst report-link"]/div[@class="review-short"]/@id').extract()[0]
            item['dataId'] = result.xpath('li[@class="nlst"]/h3/a/@title').extract()[0]
            item['startCount'] = 0
            yield item

# ul[@class="tlst clearfix"]/li[3]/div[1]
# //ul[@class="tlst clearfix"]/li[@class="ilst"]/a/img/@src

这个文件是爬虫程序的主要代码，首先咱们定义了一个类名为topitmeSpider的类，继承自Spider类，而后这个类有3个基础的属性，name表示这个爬虫的名字，等一下咱们在命令行状态启动爬虫的时候，爬虫的名字就是name规定的。
allowed_domin意思就是指在movie.douban.com这个域名爬东西。
start_urls是一个数组，里面用来保存须要爬的页面，目前咱们只须要爬首页。因此只有一个地址。
而后def parse就是定义了一个parse方法（确定是override的，我以为父类里确定有一个同名方法），而后在这里进行解析工做，这个方法有一个response参数，你能够把response想象成，scrapy这个框架在把start_urls里的页面下载了，而后response里所有都是html代码和css代码。这之中最主要的是涉及一个xpath的东西，XPath即为XML路径语言，它是一种用来肯定XML（标准通用标记语言的子集）文档中某部分位置的语言。能够经过xpath定位到咱们想要获取的元素。

服务器

使用python的flask框架实现flask

from flask import Flask, request
import json
from bson import json_util
from bson.objectid import ObjectId
import pymongo

app = Flask(__name__)

client = pymongo.MongoClient()
db = client['topitme']
def toJson(data):
    return json.dumps(data, default=json_util.default)

@app.route('/FilmReview', methods=['GET'])

def findMovie():
    if request.method == 'GET':
        json_results = []
        for result in results:
            json_results.append(result)
        return toJson(json_results)

if __name__ == '__main__':
    app.run(debug=True)



首先能够看到代码，client，db两个参量是为了取得数据库链接。
findMovie函数响应http request，而后返回数据库数据，以JSON形式返回

swift

ios的实现就不详细介绍了，这里写这部分只是为了，验证结果。swift

运行：

起服务器：

起数据库：

运行爬虫：

访问服务器：http://localhost:5000/FileReview 能够看到数据已经存储到数据库中了

ios运行状况：

下面是原网站网页展现，能够看到所要的数据存储到数据库，而且正常显示出来

1. 爬虫与反爬虫
2. scrapy爬虫与反爬虫
3. 爬虫与反爬
4. websocket与爬虫
5. 网络爬虫与反爬虫实战
6. 聚焦爬虫与通用爬虫
7. 关于java爬虫与python爬虫
8. 反爬虫与爬虫技术整理
9. 爬虫与反爬虫技术分析
10. 爬虫爬虫爬虫（一）
更多相关文章...
• Swift 类 - Swift 教程
• XSL-FO 与 XSLT - XSL-FO 教程
• Composer 安装与使用
• Java Agent入门实战（一）-Instrumentation介绍与使用