观后小结：技术演讲 - WebCrawling and Metadata Extractors...

时间 2019-12-01

标签观后小结技术演讲 webcrawling metadata extractors 繁體版

原文原文链接

演讲内容摘要：html

Web crawling is a hard problem and the web is messy. There is no shortage of semantic web standards -- basically, everyone has one. How do you make sense of the noise of our web of billions of pages?python

This talk presents two key technologies that can be used: Scrapy, an open source & scalable web crawling framework, and Mr. Schemato, a new, open source semantic web validator and distiller.git

演讲视频在 vimeo 上，幻灯片能够看 Speaker Deck 上的，或者浏览器直接打开这儿。幻灯片是用 reST 和 S5 制做的，源码在 github 上。github

演讲者是 Andrew Montalenti, co-founder/CTO of Parse.ly。web

我的观后小结：vim

他对三个页面抓取相关的动词之间区别的理解：Crawling, Spidering, Scrapingapi
Parse.ly 有大于 1TB 的生产数据是放在内存中的浏览器
开发和测试环境使用 Scrapy Cloud，生产环境使用 Rackspace Cloudscrapy
现场演示如何基于 Scrapy 定制爬虫ide
演示了他们是怎么使用 Scrapy Cloud 的
介绍了他们的开源项目：Schemato - the unified validator for the next generation of metadata

做者：czhang

原文连接：http://jianshu.io/p/CFP7Gx