Zlib模块和http模块爬虫案例

时间 2019-11-10

原文原文链接

Zlib模块

zlib是压缩包的内置模块，将文件打包，生成一个压缩包

以流的概念html
- Node.js中数据传输是分片的流
步骤前端
1. fs建立可读的流
2. 建立空压缩包
3. 建立可写的流
4. 经过管道流进行数据传递
pipenode
- 链接I/O之间的管道，这里称之为管道流

代码 const zlib = require( 'zlib' ) // zlib是一个压缩包的内置模块 const fs = require( 'fs' ) // fs是文件系统 const inp = fs.createReadStream('./dist/1.txt') // 建立可读的流 const out = fs.createWriteStream('1.txt.gz') //建立可写的流 const gzib = zlib.createGzip() // 建立一个空的压缩包 inp .pipe( gzib ) .pipe( out ) $ node 文件名

console模块web

底层调用的是 process.stdout

爬虫案例

爬虫

经过后端语言爬取网站中的数据，而后经过特定模块进行数据清洗，最后将数据输出到前端json
不是全部的网站都能爬取后端
基本组成app
1. 程序入口
2. 请求模块
3. 数据解释
程序入口网站
- 程序入口能够用web页面实现，还能够在网页上显示抓取的数据和分析结果；
请求模块ui
- https发送请求，有get方式和requers方式两种

这边用的是get，代码以下： const http = require( 'http' ); const cheerio = require( 'cheerio' ); http.get('http://nodejs.org/dist/index.json', (res) => { const { statusCode } = res; // 获取状态码 1xx - 5xx const contentType = res.headers['content-type']; // 文件类型 text/json/html/xml let error; // 错误报出，状态码不是200,报错，不是json类型报错 if (statusCode !== 200) { error = new Error('Request Failed.\n' + `Status Code: ${statusCode}`); } else if (!/^application\/json/.test(contentType)) { error = new Error('Invalid content-type.\n' + `Expected application/json but received ${contentType}`); } if (error) { console.error(error.message); // consume response data to free up memory res.resume(); // 继续请求 return; } res.setEncoding('utf8'); // 字符编码

option里分别写入爬取网址的数据和请求头数据this

若是是html格式的，如下代码能够不用写

let error; // 错误报出，状态码不是200,报错，不是json类型报错 if (statusCode !== 200) { error = new Error('Request Failed.\n' + `Status Code: ${statusCode}`); } else if (!/^application\/json/.test(contentType)) { error = new Error('Invalid content-type.\n' + `Expected application/json but received ${contentType}`); } if (error) { console.error(error.message); // consume response data to free up memory res.resume(); // 继续请求 return; }

-数据解释

将爬取到的数据调用cheerio显示或保存

res.setEncoding('utf8'); // 字符编码 // 核心 -- start let rawData = ''; res.on('data', (chunk) => { rawData += chunk; }); // 数据拼接 res.on('end', () => { // 数据获取结束 try { const $ = cheerio.load( rawData ) $('td.student a').each( function ( item ) { console.log( $( this ).text() ) }) } catch (e) { console.error(e.message); } }); // 核心 -- end }).on('error', (e) => { console.error(`Got error: ${e.message}`); }); req.end()

反爬虫

 
  给标签的内容中放一张图片