因为最近要实现一个爬取H5游戏的代理服务器,隧看到这么一篇不错的文章(http://blog.miguelgrinberg.com/post/easy-web-scraping-with-nodejs),加之最近在学习Node.js,因此就准备翻译出来加深一下印象。html
转载请注明来源:(www.cnblogs.com/xdxer )前端
在这篇文章中,我将会向你们展现如何用JavaScript配合Node.js写一个网络爬取的脚本。node
在大部分状况下,一个网络抓取的脚本只须要一种方法去下载整个页面,而后搜索里面的数据。全部的现代语言都会提供方法去下载网页,或者至少会有人实现了某个library或者一些其余的扩展包,因此这并非一个难点。而后,要精肯定位而且找出在HTML中的数据是比较有难度的。一个HTML页面混杂了许多内容、布局和样式的变量,因此去解释而且识别出那些咱们关注的部分是一个挺不容易的工做。web
举个例子,考虑以下的HTML页面:npm
<html> <head>...</head> <body> <div id="content"> <div id="sidebar"> ... </div> <div id="main"> <div class="breadcrumbs"> ... </div> <table id="data"> <tr><th>Name</th><th>Address</th></tr> <tr><td class="name">John</td><td class="address">Address of John</td></tr> <tr><td class="name">Susan</td><td class="address">Address of Susan</td></tr> </table> </div> </div> </body> </html>
若是咱们须要获取到出如今 id = “data”这个表中的人名,那么应该怎么作呢?编程
通常的,网页会被下载成一个字符串的形式,而后只须要很简单的对这个网页进行检索,检索出那些出如今<td class = “name”> 以后,以</td>结尾的字符串就能够了。api
可是这种方式很容易会获取到不正确的数据。网页可能会有别的table,或者更加糟糕的是,原先的<td class="name"> 变成了 <td align="left" class="name"> ,这将会让咱们以前所制定的方案什么都找不到。虽说网页的变化很容易致使一个爬取脚本失效,可是假如咱们能够清楚的知道元素是如何在HTML中组织的,那么咱们就没必要老是重写咱们的爬取脚本,当网页改变的时候。浏览器
若是你写过前端的js代码,使用过jQuery,那么你就会发现使用CSS selector 来选择DOM中的元素是一件很是简单的事情。举个例子,在浏览器中,咱们能够很简单的爬取到那些名字使用以下的方式:服务器
$('#data .name').each(function() { alert($(this).text()); });
http://nodejs.org (get it here!)网络
Javascript 是一个嵌入web浏览器的语言,感谢Node.js工程,咱们如今能够编写可以独立运行,而且甚至能够做为一个web server 的编程语言。
有不少现成的库,例如jQuery那样的。因此使用Javrscript+Node.js去实现这么一个任务就很是便利了,由于咱们可使用那些现有的操做DOM元素的技术,这些技术在web浏览器上已经应用的比较成熟了。
Node.js有不少的库,它是模块化的。本例子中须要用到两个库,request 和 cheerio。 request主要是用于下载那些网页,cheerio 会在本地生成一棵DOM树,而后提供一个jQuery子集去操做它们。安装Node.js模块须要用到npm操做,相似于Ruby的gem 或者 Python的easy_install
有关于cheerio的一些API 能够参考这一篇CNode社区的文章 (https://cnodejs.org/topic/5203a71844e76d216a727d2e)
$ mkdir scraping
$ cd scraping
$ npm install request cheerio
如以上代码所示,首先咱们建立了一个目录“scraping”,而且咱们在在这个目录下安装了request 和 cheerio模块,事实上,nodejs的模块是能够进行全局性的安装的,可是我更加喜欢locally的安装,安装的效果以下图所示。
那接下来咱们就看看如何使用cheerio,来爬取上面的例子中的name,咱们建立一个.js文件 example.js,代码以下:
var cheerio = require('cheerio'); $ = cheerio.load('<html><head></head><body><div id="content">
<div id="sidebar"></div><div id="main">
<div id="breadcrumbs"></div><table id="data"><tr>
<th>Name</th><th>Address</th></tr><tr><td class="name">
John</td><td class="address">Address of John</td></tr>
<tr><td class="name">Susan</td><td class="address">
Address of Susan</td></tr></table></div></div></body></html>'); $('#data .name').each(function() { console.log($(this).text()); });
输出以下:
$ node example.js John Susan
http://www.thprd.org/schedules/schedule.cfm?cs_id=15 爬取这个网站中的日程表
代码以下:
var request = require('request'); var cheerio = require('cheerio'); days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']; pools = { 'Aloha': 3, 'Beaverton': 15, 'Conestoga': 12, 'Harman': 11, 'Raleigh': 6, 'Somerset': 22, 'Sunset': 5, 'Tualatin Hills': 2 }; for (pool in pools) { var url = 'http://www.thprd.org/schedules/schedule.cfm?cs_id=' + pools[pool]; request(url, (function(pool) { return function(err, resp, body) { $ = cheerio.load(body); $('#calendar .days td').each(function(day) { $(this).find('div').each(function() { event = $(this).text().trim().replace(/\s\s+/g, ',').split(','); if (event.length >= 2 && (event[1].match(/open swim/i) || event[1].match(/family swim/i))) console.log(pool + ',' + days[day] + ',' + event[0] + ',' + event[1]); }); }); }})(pool)); }
输出以下:
$ node thprd.js Conestoga,Monday,4:15p-5:15p,Open Swim - M/L Conestoga,Monday,7:45p-9:00p,Open Swim - M/L Conestoga,Tuesday,7:30p-9:00p,Open Swim - M/L Conestoga,Wednesday,4:15p-5:15p,Open Swim - M/L Conestoga,Wednesday,7:45p-9:00p,Open Swim - M/L Conestoga,Thursday,7:30p-9:00p,Open Swim - M/L Conestoga,Friday,6:30p-8:30p,Open Swim - M/L Conestoga,Saturday,1:00p-4:15p,Open Swim - M/L Conestoga,Sunday,2:00p-4:15p,Open Swim - M/L Aloha,Monday,1:05p-2:20p,Open Swim Aloha,Monday,7:50p-8:25p,Open Swim Aloha,Tuesday,1:05p-2:20p,Open Swim Aloha,Tuesday,8:45p-9:30p,Open Swim Aloha,Wednesday,1:05p-2:20p,Open Swim Aloha,Wednesday,7:50p-8:25p,Open Swim Aloha,Thursday,1:05p-2:20p,Open Swim Aloha,Thursday,8:45p-9:30p,Open Swim Aloha,Friday,1:05p-2:20p,Open Swim Aloha,Friday,7:50p-8:25p,Open Swim Aloha,Saturday,2:00p-3:30p,Open Swim Aloha,Saturday,4:30p-6:00p,Open Swim Aloha,Sunday,2:00p-3:30p,Open Swim Aloha,Sunday,4:30p-6:00p,Open Swim Harman,Monday,4:25p-5:30p,Open Swim* Harman,Monday,7:30p-8:55p,Open Swim Harman,Tuesday,4:25p-5:10p,Open Swim* Harman,Wednesday,4:25p-5:30p,Open Swim* Harman,Wednesday,7:30p-8:55p,Open Swim Harman,Thursday,4:25p-5:10p,Open Swim* Harman,Friday,2:00p-4:55p,Open Swim* Harman,Saturday,1:30p-2:25p,Open Swim Harman,Sunday,2:00p-2:55p,Open Swim Beaverton,Tuesday,10:45a-12:55p,Open Swim (No Diving Well) Beaverton,Tuesday,8:35p-9:30p,Open Swim No Diving Well Beaverton,Thursday,10:45a-12:55p,Open Swim (No Diving Well) Beaverton,Thursday,8:35p-9:30p,Open Swim No Diving Well Beaverton,Saturday,2:30p-4:00p,Open Swim Beaverton,Sunday,4:15p-6:00p,Open Swim Sunset,Tuesday,1:00p-2:30p,Open Swim/One Lap Lane Sunset,Thursday,1:00p-2:30p,Open Swim/One Lap Lane Sunset,Sunday,1:30p-3:00p,Open Swim/One Lap Lane Tualatin Hills,Monday,7:35p-9:00p,Open Swim-Diving area opens at 8pm Tualatin Hills,Wednesday,7:35p-9:00p,Open Swim-Diving area opens at 8pm Tualatin Hills,Sunday,1:30p-3:30p,Open Swim Tualatin Hills,Sunday,4:00p-6:00p,Open Swim
要注意的几个问题: 异步js的做用域问题,还有对网站结构的分析,我会在其余博客中提到。
其实我只翻译了不多的一部分,有兴趣的能够去看一下原文,每一步都说的很仔细。