除了经常使用的 scrapy crawl 来启动 Scrapy,也可使用 API 在脚本中启动 Scrapy。html
须要注意的是,Scrapy 是在 Twisted 异步网络库上构建的,所以其必须在 Twisted reactor 里运行。react
另外,在 spider 运行结束后,必须自行关闭 Twisted reactor。这能够经过 CrawlerRunner.crawl 所返回的对象中添加回调函数来实现。json
示例:api
from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings runner = CrawlerRunner(get_project_settings()) # 'followall' is the name of one of the spiders of the project. d = runner.crawl('followall', domain='scrapinghub.com') d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is finished #Running spiders outside projects it’s not much different. You have to create a generic Settings object and populate it as needed (See 内置设定参考手册 for the available settings), instead of using the configuration returned by get_project_settings. #Spiders can still be referenced by their name if SPIDER_MODULES is set with the modules where Scrapy should look for spiders. Otherwise, passing the spider class as first argument in the CrawlerRunner.crawl method is enough. from twisted.internet import reactor from scrapy.spider import Spider from scrapy.crawler import CrawlerRunner from scrapy.settings import Settings class MySpider(Spider): # Your spider definition ... settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}) runner = CrawlerRunner(settings) d = runner.crawl(MySpider) d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is finished
默认状况下,当执行 scrapy crawl 时,Scrapy 每一个进程运行一个 spider。固然,Scrapy经过内部(internal)API 也支持单进程多个 spider。浏览器
示例:服务器
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings runner = CrawlerRunner(get_project_settings()) dfs = set() for domain in ['scrapinghub.com', 'insophia.com']: d = runner.crawl('followall', domain=domain) dfs.add(d) defer.DeferredList(dfs).addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
相同的例子,不经过连接(chaining)deferred 来线性运行 spider:cookie
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings runner = CrawlerRunner(get_project_settings()) @defer.inlineCallbacks def crawl(): for domain in ['scrapinghub.com', 'insophia.com']: yield runner.crawl('followall', domain=domain) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished
Scrapy 并无提供内置的机制支持分布式(多服务器)爬取。不过仍是有办法进行分布式爬取,取决于要怎么分布了。网络
若是有不少 spider,那分布负载最简单的方法就是启动过个 Scrapyd,并分配到不一样机器上。并发
若是想要在多个机器上运行一个单独的 spider,能够将要爬取的 url 进行分块,并发送给 spider。例如:dom
首先,准备要爬取的 url 列表,并分配到不一样的文件 url 里:
http://somedomain.com/urls-to-crawl/spider1/part1.list http://somedomain.com/urls-to-crawl/spider1/part2.list http://somedomain.com/urls-to-crawl/spider1/part3.list
接着在3个不一样的 Scrapyd 服务器中启动 spider。spider 会接受一个(spider)参数part,该参数表示要爬取的分块:
curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1 curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2 curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3
有些网站实现了特定的机制,以必定的规则来避免被爬虫爬取。与这些规则打交道并不容易,须要技巧,有时候也须要特别的基础。
下面是处理这些站点的建议:
COOKIES_ENABLED
),有些站点会使用 cookies 来发现爬虫的轨迹。DOWNLOAD_DELAY
设置。对于有些应用,item 的结构由用户输入或者其它变化的状况所控制。你能够动态建立 class。
from scrapy.item import DictItem, Field def create_item_class(class_name, field_list): fields = {field_name: Field() for field_name in field_list} return type(class_name, (DictItem,), {'fields': fields})