Node.JS爬虫实战 - 爬取图片并下载到本地

时间 2020-07-10

标签 node.js node 爬虫实战图片下载本地栏目 Node.js 繁體版

原文原文链接

前言

爬虫应该遵循：robots 协议html

什么是爬虫

引用百度百科：node

网络爬虫（又称为网页蜘蛛，网络机器人，在 FOAF 社区中间，更常常的称为网页追逐者），是一种按照必定的规则，自动地抓取万维网信息的程序或者脚本。另一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。ios

通俗的讲就是经过机器自动地获取想要的信息，当你访问一个网站，发现有不少好看的图片，因而你会选择右键保存到本地，当你保存了几张以后你会想我为何不能写个脚本，自动的去下载这些图片到本地呢？因而爬虫诞生了......npm

常见的爬虫类型

服务端渲染的页面(ssr) 就是服务端已经返回了渲染好的 html 片断
客户端渲染的页面(csr) 常见的单页面应用就是客户端渲染

第二种须要经过分析接口爬虫，本文讲解的是使用第一种，使用 nodejs 实现爬取远程图片下载到本地

最终效果： json

准备

1 目录axios

┌── cache
│   └── img 图片目录
├── app.js
└──  package.json
复制代码

2 安装依赖数组

axios 请求库

npm i axios --save
复制代码

cheerio 服务端的'jq'

npm i cheerio --save
复制代码

fs 文件模块

npm i fs --save
复制代码

开始爬虫

爬取某户外网站，爬取首页推荐的图片并下载到本地网络

1 流程分析

分析页面结构，肯定要爬取的内容
node 端 http 请求获取到页面内容
用 cheerio 获得图片数组
遍历图片数组，并下载到本地

2 编写代码 axios 拿到 html 片断分析发现该图片在'newsimg'块里，cheerio 使用跟 jq 基本没什么区别,拿到图片标题和下载连接 app

const res = await axios.get(target_url);
const html = res.data;
const $ = cheerio.load(html);
const result_list = [];
$('.newscon').each(element => {
  result_list.push({
    title: $(element).find('.newsintroduction').text(),
    down_loda_url: $(element).find('img').attr('src').split('!')[0],
  });
});
this.result_list.push(...result_list);
复制代码

已经拿到一个下载连接数组，接下来要作的是遍历该数组，发送请求而后用 fs 保存到本地dom

const target_path = path.resolve(__dirname, `./cache/image/${href.split('/').pop()}`);
const response = await axios.get(href, { responseType: 'stream' });
await response.data.pipe(fs.createWriteStream(target_path));
复制代码

3 请求优化避免太频繁请求会被封 ip，比较简单的方法有几个:

避免短期内频繁请求，间隔必定时间再请求
axios 拦截器中设置 User-Agent，每次请求到用一个不一样的
ip 库，每次请求都用不同的 ip

完整代码

class stealData {

  constructor() {
    this.base_url = ''; //要爬取的网站
    this.current_page = 1;
    this.result_list = [];
  }

  async init() {
    try {
      await this.getPageData();
      await this.downLoadPictures();
    } catch (e) {
      console.log(e);
    }
  }

  sleep(time) {
    return new Promise((resolve) => {
      console.log(`自动睡眠中，${time / 1000}秒后从新发送请求......`)
      setTimeout(() => {
        resolve();
      }, time);
    });
  }

  async getPageData() {
    const target_url = this.base_url;
    try {
      const res = await axios.get(target_url);
      const html = res.data;
      const $ = cheerio.load(html);
      const result_list = [];
      $('.newscon').each((index, element) => {
        result_list.push({
          title: $(element).find('.newsintroduction').text(),
          down_loda_url: $(element).find('img').attr('src').split('!')[0],
        });
      });
      this.result_list.push(...result_list);
      return Promise.resolve(result_list);
    } catch (e) {
      console.log('获取数据失败');
      return Promise.reject(e);
    }
  }

  async downLoadPictures() {
    const result_list = this.result_list;
    try {
      for (let i = 0, len = result_list.length; i < len; i++) {
        console.log(`开始下载第${i + 1}张图片!`);
        await this.downLoadPicture(result_list[i].down_loda_url);
        await this.sleep(3000 * Math.random());
        console.log(`第${i + 1}张图片下载成功!`);
      }
      return Promise.resolve();
    } catch (e) {
      console.log('写入数据失败');
      return Promise.reject(e)
    }
  }

  async downLoadPicture(href) {
    try {
      const target_path = path.resolve(__dirname, `./cache/image/${href.split('/').pop()}`);
      const response = await axios.get(href, { responseType: 'stream' });
      await response.data.pipe(fs.createWriteStream(target_path));
      console.log('写入成功');
      return Promise.resolve();
    } catch (e) {
      console.log('写入数据失败');
      return Promise.reject(e)
    }
  }

}

const thief = new stealData('xxx_url');
thief.init();
复制代码