回顾爬虫

时间 2019-11-17

标签回顾爬虫栏目网络爬虫繁體版

原文原文链接

会用到的点javascript

模块

　　　　1 hashlib模块--加密。php

　　　　　　update(string.encode('utf-8')) m.hexdigest()css

　　　　2 requests模块html

　　　　　　https://blog.csdn.net/shanzhizi/article/details/50903748java

　　　　　　r = requests.get(url, params = {}, headers={},cookies = cookies，proxies = proxies)node

　　　　　　cookies, proxies 都是字典格式python

　　　　　　搜索的关键字就以字典的形式，放在params参数中mysql

param = {'wd':'火影'}
r = requests.get('https://www.baidu.com/s', params=param)
print(r.status_code)
print(r.url)


百度没有防爬虫措施，搜狗有。这里用百度简单的演示下。params就是get访问时每一个？后面的xx=xx

View Code

　　　　　　r = requests.post(url, data = {},headers={} )　　react

　　　　　　headers = {git

　　　　　　　　　　'content-type':

　　　　　　　　　　'User-Agent':

　　　　　　　　　 'Referer':

　　　　　　　　　　'Cookie':

　　　　　　　　　　'Host':

　　　　　　　　　　　}

　　　　　　r.encoding = '' 自定义编码，对文本内容进行解码。和 r.text 好基友

　　　　　　r.text 字符串方式的响应体

　　　　　　r.content 字节方式的响应体

　　　　　　r.status_code

　　　　　　r.request 能够得到请求的相关信息

 1 import requests
 2 
 3 url = 'https://www.cnblogs.com/654321cc/p/11013243.html'
 4 headers = {
 5     'User-Agent':'User-Agent',
 6 }
 7 r = requests.get(url=url,headers=headers)
 8 
 9 #获取请求头
10 print(r.request.headers)
11 #获取响应头
12 print(r.headers)
13 
14 
15 #获取请求的cookie
16 print(r.request._cookies)
17 
18 #获取响应的cookie
19 print(r.cookies)

View Code

　　　　　　r.headers 以字典对象存储服务器响应头。不知道何时回用到

　　　　　　r.cookies

　　　　　　r.history

 1 import requests
 2 
 3 url = 'https://www.cnblogs.com/654321cc/p/11013243.html'
 4 headers = {
 5     'User-Agent':'User-Agent',
 6 }
 7 r = requests.get(url=url,headers=headers)
 8 
 9 #获取请求头
10 print(r.request.headers)
11 #获取响应头
12 print(r.headers)
13 
14 
15 #获取请求的cookie
16 print(r.request._cookies)
17 
18 #获取响应的cookie
19 print(r.cookies)

View Code

何时会用到？
有的时候302跳转。

302 Found 的定义
302 状态码表示目标资源临时移动到了另外一个 URI 上。因为重定向是临时发生的，因此客户端在以后的请求中还应该使用本来的 URI。

服务器会在响应 Header 的 Location 字段中放上这个不一样的 URI。浏览器可使用 Location 中的 URI 进行自动重定向。

能够看到跳转网页以前的状态码。即有时候r.status_code 是唬人的。

能够用一个参数，禁止自动跳转。 allow_redirects 

r = requests.get('http://www.baidu.com/link?url=QeTRFOS7TuUQRppa0wlTJJr6FfIYI1DJprJukx4Qy0XnsDO_s9baoO8u1wvjxgqN', allow_redirects = False)
>>>r.status_code
302

View Code

　　　　　　r.headers 响应头内容

　　　　　　r.request.headers 请求头内容

　　　　　　假装请求头

headers = {'User-Agent': 'liquid'}
r = requests.get('http://www.zhidaow.com', headers = headers)
print r.request.headers['User-Agent']

View Code

　　　　　　会话对象

　　　　　　s = requests.Session()

　　　　　　s.get()

　　　　　　s.post()

会话对象让你可以跨请求保持某些参数，最方便的是在同一个Session实例发出的全部请求之间保持cookies，且这些都是自动处理的，甚是方便。
import requests
 
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Encoding': 'gzip, deflate, compress',
           'Accept-Language': 'en-us;q=0.5,en;q=0.3',
           'Cache-Control': 'max-age=0',
           'Connection': 'keep-alive',
           'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'}
 
s = requests.Session()
s.headers.update(headers)
# s.auth = ('superuser', '123')
s.get('https://www.kuaipan.cn/account_login.htm')
 
_URL = 'http://www.kuaipan.cn/index.php'
s.post(_URL, params={'ac':'account', 'op':'login'},
       data={'username':'****@foxmail.com', 'userpwd':'********', 'isajax':'yes'})
r = s.get(_URL, params={'ac':'zone', 'op':'taskdetail'})
print(r.json())
s.get(_URL, params={'ac':'common', 'op':'usersign'})

View Code

　　　　　　超时与异常

　　　　　　timeout 参数

r = requests.get('https://m.hcomic1.com',timeout = 1)

View Code

　　　　3 json模块--轻量级数据交换格式

　　　　　　文件 dump load

　　　　　　字符串 dumps loads

　　　　4 re模块

　　　　　　re.S 表示 “.”（不包含外侧双引号）的做用扩展到整个字符串，包括“\n”。. 默认匹配除换行符之外全部的字符，re.S 模式中，. 连换行符都会匹　　　　　　配。

　　　　　　re.I 表示忽略字符的大小写

正则表达式中，“.”的做用是匹配除“\n”之外的任何字符，也就是说，它是在一行中进行匹配。这里的“行”是以“\n”进行区分的。a字符串有每行的末尾有一个“\n”，不过它不可见。

若是不使用re.S参数，则只在每一行内进行匹配，若是一行没有，就换下一行从新开始，不会跨行。而使用re.S参数之后，正则表达式会将这个字符串做为一个总体，将“\n”当作一个普通的字符加入到这个字符串中，在总体中进行匹配。

View Code

import re

s = 'agejaoigeaojdnghaw2379273589hahjhgoiaioeg87t98w825tgha9e89aye835yyaghe9857ahge878ahsohga9e9q30gja9eu73hga9w7ga8w73hgna9geuahge9aoi753uajghe9as' \
    '8837t5hga8u83758uaga98973gh8e'
res1 = re.findall('\d{2,3}[a-zA-Z]{1,}?\d{2,3}',s)
# [ ] 字符集，只能取其中一个
# {m,n} 量词，对前面一个字符重复m到n此
# 量词后面加？，为非贪婪匹配
# print(res1) ['589hahjhgoiaioeg87', '98w825', '89aye835', '857ahge878', '758uaga989']

res2 = re.search('(\d{2,3})[a-zA-Z]{1,}?(\d{2,3})',s)
# print(res2) res:<re.Match object; span=(25, 43), match='589hahjhgoiaioeg87'>

print(res2.group())   # match 和 search 只会返回第一个。
print(res2.group(1))  #The first parenthesized subgroup. 使用group传参，正则匹配必须有响应的 ( )
print(res2.group(2))  #The second parenthesized subgroup.

View Code　　　

# res3 = re.finditer('\d{2,3}[a-zA-Z]{1,}?\d{2,3}',s)
# print(res3) # res3:<callable_iterator object at 0x000001DE04A9E048> 返回迭代对象，省内存

View Code

　　　　5 flask_httpauth模块--认证模块？

from flask_httpauth import HTTPBasicAuth
from flask_httpauth import HTTPDigestAuth

View Code

　　　　6 beautifulsoup模块

　　　　　　Beautiful Soup 是一个能够从HTML或XML文件中提取数据的Python库.它可以经过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方　　式.Beautiful Soup会帮你节省数小时甚至数天的工做时间.

　　　　https://www.cnblogs.com/linhaifeng/articles/7783586.html#_label2

# beautifulsoup的使用经常使用t套路。经过find_all 获取tag列表。这一步叫搜索文档树
# 对获取到的tag列表，取想要的数据：文本，超连接等。这一步叫获取tag的属性，名称，内容等。
# 通常这样使用就足够了

from bs4 import BeautifulSoup
import requests

URL = 'https://www.duodia.com/daxuexiaohua/'

def get_page(url):
    r = requests.get(url)
    if r.status_code == 200:
        return r.text

content = get_page(URL)

soup = BeautifulSoup(content,'lxml')

# 1 搜索文档树 name tag名。class_ ,类名。class是关键字
a_s = soup(name='a',class_='thumbnail-container')

# print(type(a_s[0]))  # 注意 a_s[0]的数据类型  <class 'bs4.element.Tag'>
# 2 获取tag的属性，名称，内容等。
for a in a_s:
    print(a.attrs['href'])
    print(a.text)
    print(a.name)

View Code

知识点

迭代器与生成器
- 　可能不那么重要

异常处理

try:
    a = 1
    b = 's'
    c = a + b
except TypeError as e:
    # 可能 except as 比较好，能区分出什么错误来
    print('TypeError %s' %e)
else:
    # 没有异常才会走
    print('else') 
finally:
    # 有没有异常都会走
    print('finally')

View Code

多线程多进程

　　多线程用于IO密集型，如socket，爬虫，web
　　多进程用于计算密集型，如金融分析

multiprocessing模块和threading模块

from threading import Thread

def foo():
    print(os.getpid())
    print('f00')
    time.sleep(2)
def bar():
    print(os.getpid())
    print('bar')
    time.sleep(5)

if __name__ == '__main__':
    t = time.time()
    t1 = Thread(target=foo)
    t2 = Thread(target=bar)
    t1.start()
    t2.start()
    t1.join()  # 线程的join起阻塞进程的做用，t1,t2线程跑完后，才会回到主进程，执行print()语句
    t2.join()  #
    print('time cost {}'.format(time.time() - t))

View Code

进程的join也是如此。

multiprocessing模块下的进程池的概念

Pool类这个有点厉害。　　

from multiprocessing import Pool
import time
import os
import random

def foo(n):
    time.sleep(random.random())
    return {'name':'foo','length':n}


def save(dic):
    f = open('a.txt','a',encoding='utf-8')
    f.write('name:{},length:{}\n'.format(dic['name'],dic['length']))
    f.close()

if __name__ == '__main__':

    n = os.cpu_count()

    pool = Pool(n)
    # print(p)   p: <multiprocessing.pool.Pool object at 0x000001DDE9D3E0B8>

    task_list = []

    for i in range(20):
        task = pool.apply_async(foo,args=(i,),callback=save)
        # print(task) task:<multiprocessing.pool.ApplyResult object at 0x0000026084D5AFD0>
        task_list.append(task)
    pool.close()
    pool.join()
    for task in task_list:
        print(task.get())



p = Pool()
task = p.apply_async(func=,args=,kwds=,callback=) 注意 task是什么相似，异步添加任务。
    callback有且惟一参数是func的返回值，用的好的话，省不少事
p.close()
p.join()

task.get()

View Code

from concurrent.futures import ProcessPoolExecutor,ThreadPoolExecutor

import requests
def get(url):
    r=requests.get(url)

    return {'url':url,'text':r.text}
def parse(future):
    dic=future.result()          #future对象调用result方法取其值、
    f=open('db.text','a')
    date='url:%s\n'%len(dic['text'])
    f.write(date)
    f.close()
if __name__ == '__main__':
    executor=ThreadPoolExecutor()
    url_l = ['http://cn.bing.com/', 'http://www.cnblogs.com/wupeiqi/', 'http://www.cnblogs.com/654321cc/',
                 'https://www.cnblogs.com/', 'http://society.people.com.cn/n1/2017/1012/c1008-29581930.html',
                 'http://www.xilu.com/news/shaonianxinzangyou5gedong.html', ]

    for url in url_l:
        executor.submit(get,url).add_done_callback(parse)         #与Pool进程池回调函数接收的是A函数的返回值(对象ApplyResult.get()获得的值)。
    executor.shutdown()                                           #这里回调函数parse，接收的参数是submit生成的 Future对象。
    print('主')

View Code

状态码

框架

scrapy

新版本的scarpy，更新了get() 和 getall() 方法。大致至关于以前的 extract_first() extract()
当分析一个网页时，能够用 1）查看网页源代码 2 ）检查 3）也能够用scarpy shell 'url'。拿到一个response对象。我印象中这种方法能够有一些便利。　　
- 当拿到response后。能够对这个response对象经过CSS选择
  - response.css('title') title是html标签。拿到的是由一系列selector对象构成的selectorlist。 selector对象包含 xml/html元素。你能够对selector对象进行进一步的获取数据。
  - response.css('title').getall() 返回列表。获取到的是 xml/html元素
    - ```
    ['<title>Quotes to Scrape</title>']
```
- response.css('title::text').getall() 获取所筛选元素的文本值
- response.css('title::text')[0].getall() == response.css('title:text').get() 当你只想获取第一个元素的值时。上面的方法都是可行的。看出看出 getall() 方法 selector对象和 selectorlist 均可以调用。
- 能够对selector对象进行嵌套调用
  - >>> for quote in response.css("div.quote"): ... text = quote.css("span.text::text").get() ... author = quote.css("small.author::text").get() ... tags = quote.css("div.tags a.tag::text").getall() ... print(dict(text=text, author=author, tags=tags))
    
    View Code
- 除了selector对象的 getall()方法和 get() 方法，selector对象还支持正则表达式
  - response.css('title:text').re(r'\d') 返回的是列表。元素就是符合要求的文本值。样式和 get() 是同样的
- 强烈推荐一个CSS选择工具
  - Selector Gadget
- view response 有点用处
  - scrapy shell url 和 scrapy view url 搭配使用。
  - 在你的默认浏览器中打开给定的URL，并以Scrapy spider获取到的形式展示。有些时候spider获取到的页面和普通用户看到的并不相同，一些动态加载的内容是看不到的，所以该命令能够用来检查spider所获取到的页面。
- CSS 和 XPath的差别
  - CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
  - XPath is a language for selecting nodes in XML documents, which can also be used with HTML.
  - XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是经过沿着路径 (path) 或者步 (steps) 来选取的。
- 获取节点的属性值
  - 经过attrib方法 response.xpath('//title').attrib['href'] 只返回第一个匹配的元素的属性值
  - response.xpath('//title/@href').get()
  - response.xpath('//title/@href').getall()

scrapy基本实例

crawling blogs, forums and other sites with pagination

常规操做

用到的访问下一页是

 next_page = response.urljoin(next_page)

Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

View Code

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

View Code

response.follow(next_page, callback=self.parse)

View Code

Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin. Note that response.follow just returns a Request instance; you still have to yield this Request.

You can also pass a selector to response.follow instead of a string; this selector should extract necessary attributes:

View Code

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

View Code

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

View Code

稍微复杂的例子

Here is another spider that illustrates callbacks and following links, this time for scraping author information:

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }
This spider will start from the main page, it will follow all the links to the authors pages calling the parse_author callback for each of them, and also the pagination links with the parse callback as we saw before.

Here we’re passing callbacks to response.follow as positional arguments to make the code shorter; it also works for scrapy.Request.

The parse_author callback defines a helper function to extract and cleanup the data from a CSS query and yields the Python dict with the author data.

View Code

spider

简介

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).

For spiders, the scraping cycle goes through something like this:

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.

Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.

View Code

常见属性

name
start_urls
- not necessary，能够用 def start_requests() 来代替
custom_settings
logger
from_crawler ??

start_requests

Scrapy calls it only once, so it is safe to implement start_requests() as a generator

能够override。当你POST，而不是get，应该是由于start_urls 默认都是 get。只能在这override了吧

If you want to change the Requests used to start scraping a domain, this is the method to override. For example, if you need to start by logging in using a POST request, you could do:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        # here you would extract links to follow and return Requests for
        # each of them, with another callback
        pass

View Code

parse
- This method, as well as any other Request callback, must return an iterable of Requestand/or dicts or Item objects.

基本示例

注意self.logger，start_urls属性和 start_requests()方法，Item（后面细说）

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
        for h3 in response.xpath('//h3').getall():
            yield {"title": h3}

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)
Instead of start_urls you can use start_requests() directly; to give data more structure you can use Items:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
        for h3 in response.xpath('//h3').getall():
            yield MyItem(title=h3)

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)

View Code

CrawlSpider

https://docs.scrapy.org/en/latest/topics/spiders.html#crawling-rules
这个是之后写的模板。会方便许多，link都帮咱们提取好了。
只需建立spider的时候加上 -t crawl 参数
- ```
scrapy genspider -t crawl xx  xx.com
```
  View Code
new attribute
- rules
  - follow属性设为True，自动跟进下一页。咱们只负责parse_item就能够了，省了一半的功夫。好使，碉堡了。
- parse_start_url

示例

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').get()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').get()
        item['link_text'] = response.meta['link_text']
        return item

View Code

Selectors

https://docs.scrapy.org/en/latest/topics/selectors.html#

简介

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.

View Code

Using text nodes in a condition

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> sel.xpath("//a[1]").getall() # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall() # convert it to string
['Click here to go to the Next Page']
So, using the .//text() node-set won’t select anything in this case:

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]
But using the . to mean the node, works:

>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

View Code

Variables in XPath expressions

https://parsel.readthedocs.io/en/latest/usage.html#variables-in-xpath-expressions

XPath allows you to reference variables in your XPath expressions, using the $somevariable syntax. This is somewhat similar to parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders like ?, which are then substituted with values passed with the query.

View Code

Here’s another example, to find the “id” attribute of a <div> tag containing five <a> children (here we pass the value 5 as an integer):

>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'

View Code

Here’s an example to match an element based on its normalized string-value:

>>> str_to_match = "Name: My image 3"
>>> selector.xpath('//a[normalize-space(.)=$match]',
...                match=str_to_match).get()
'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>'

View Code

Here’s another example using a position range passed as two integers:

>>> start, stop = 2, 4
>>> selector.xpath('//a[position()>=$_from and position()<=$_to]',
...                _from=start, _to=stop).getall()
['<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>']

View Code

Named variables can be useful when strings need to be escaped for single or double quotes characters. The example below would be a bit tricky to get right (or legible) without a variable reference:

>>> html = u'''<html>
... <body>
...   <p>He said: "I don't know why, but I like mixing single and double quotes!"</p>
... </body>
... </html>'''
>>> selector = Selector(text=html)
>>>
>>> selector.xpath('//p[contains(., $mystring)]',
...                mystring='''He said: "I don't know''').get()
'<p>He said: "I don\'t know why, but I like mixing single and double quotes!"</p>'

View Code

Built-in Selectors reference
- Selector objects
  - attrib：Return the attributes dictionary for underlying element.
  - register_namespace ？
  - remove_namespaces ？
- SelectorList objects
  - attrib：Return the attributes dictionary for the first element. If the list is empty, return an empty dict.
  - ... ...
Selecting element attributes
- https://docs.scrapy.org/en/latest/topics/selectors.html#selecting-attributes
- 三种方法
  - xpath('//title/@href').getall()
  - css("a::attr(href)").getall()
  - [a.attrib['href'] for a in response.xpath('//a')] 即attrib['xx']内置方法

Items

简介

The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: it is easy to make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.

View Code

metadata key？

Field objects are used to specify metadata for each field. For example, the serializer function for the last_updated field illustrated in the example above.

You can specify any kind of metadata for each field. There is no restriction on the values accepted by Field objects. For this same reason, there is no reference list of all available metadata keys. Each key defined in Field objects could be used by a different component, and only those components know about it. You can also define and use any other Field key in your project too, for your own needs. The main goal of Field objects is to provide a way to define all field metadata in one place. Typically, those components whose behaviour depends on each field use certain field keys to configure that behaviour. You must refer to their documentation to see which metadata keys are used by each component.

View Code

class Myitem(scrapy.Item):
    name = scrapy.Field()
    age = scrapy.Field()
    salary = scrapy.Field()

item = Myitem({'name':'z','age':28,'salary':10000})

print(item.fields)
print(Myitem.fields)

===>
{'age': {}, 'name': {}, 'salary': {}}
{'age': {}, 'name': {}, 'salary': {}}



class Myitem(scrapy.Item):
    name = scrapy.Field()
    age = scrapy.Field()
    salary = scrapy.Field(dd='geagd')

item = Myitem({'name':'z','age':28,'salary':10000})

print(item.fields)
print(Myitem.fields)

===>
{'age': {}, 'name': {}, 'salary': {'dd': 'geagd'}}
{'age': {}, 'name': {}, 'salary': {'dd': 'geagd'}}

View Code

Item objects

Items replicate the standard dict API, including its constructor. The only additional attribute provided by Items is:fields

View Code

Field objects

The Field class is just an alias to the built-in dict class and doesn’t provide any extra functionality or attributes. In other words, Field objects are plain-old Python dicts. A separate class is used to support the item declaration syntax based on class attributes.

View Code

Extending Items
- scrapy.Field(Product.fields['name'])

You can extend Items (to add more fields or to change some metadata for some fields) by declaring a subclass of your original Item.

For example:

class DiscountedProduct(Product):
    discount_percent = scrapy.Field(serializer=str)
    discount_expiration_date = scrapy.Field()
You can also extend field metadata by using the previous field metadata and appending more values, or changing existing values, like this:

class SpecificProduct(Product):
    name = scrapy.Field(Product.fields['name'], serializer=my_serializer)
That adds (or replaces) the serializer metadata key for the name field, keeping all the previously existing metadata values.

View Code

Item Loader

简介 --快速填充数据。应该是新的scrapy的新的功能。不用你在手动的extract data，而后给Item的field赋值了。我擦，省老多事了，牛逼。

在哪使用？spider 。是什么？是类。参数是什么？是item，response。

In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.

Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.

View Code

简单例子

from scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)            #首先实例化 
    #ItemLoader.default_item_class属性控制默认实例化的类
    l.add_xpath('name', '//div[@class="product_name"]')   # 获取值的三种方式
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()                        # 返回item populated with the data 
                                                         #previously extracted

View Code

To use an Item Loader, you must first instantiate it. You can either instantiate it with a dict-like object (e.g. Item or dict) or without one, in which case an Item is automatically instantiated in the Item Loader constructor using the Item class specified in the ItemLoader.default_item_class attribute.

Then, you start collecting values into the Item Loader, typically using Selectors. You can add more than one value to the same item field; the Item Loader will know how to “join” those values later using a proper processing function.

Here is a typical Item Loader usage in a Spider, using the Product item declared in the Items chapter:

from scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()
By quickly looking at that code, we can see the name field is being extracted from two different XPath locations in the page:

//div[@class="product_name"]
//div[@class="product_title"]
In other words, data is being collected by extracting it from two XPath locations, using the add_xpath() method. This is the data that will be assigned to the name field later.

Afterwards, similar calls are used for price and stock fields (the latter using a CSS selector with the add_css() method), and finally the last_update field is populated directly with a literal value (today) using a different method: add_value().

Finally, when all data is collected, the ItemLoader.load_item() method is called which actually returns the item populated with the data previously extracted and collected with the add_xpath(), add_css(), and add_value() calls.

View Code

input and output processors ??
- An Item Loader contains one input processor and one output processor for each (item) field.
- The result of the input processor is collected and kept inside the ItemLoader。
- The result of input processors will be appended to an internal list (in the Loader) containing the collected values (for that field).
- The result of the output processor is the final value that gets assigned to the item.（ The result of the output processor is the value assigned to the name field in the item.）
- It’s worth noticing that processors are just callable objects, which are called with the data to be parsed, and return a parsed value. So you can use any function as input or output processor. The only requirement is that they must accept one (and only one) positional argument, which will be an iterator.
- Both input and output processors must receive an iterator as their first argument. The output of those functions can be anything. The result of input processors will be appended to an internal list (in the Loader) containing the collected values (for that field). The result of the output processors is the value that will be finally assigned to the item.
- The other thing you need to keep in mind is that the values returned by input processors are collected internally (in lists) and then passed to output processors to populate the fields.
- https://docs.scrapy.org/en/latest/topics/loaders.html#input-and-output-processors

available build-in processors

https://docs.scrapy.org/en/latest/topics/loaders.html#topics-loaders-available-processors
Join
TakeFirst
Compose

MapCompose

This processor provides a convenient way to compose functions that only work with single values (instead of iterables). For this reason the MapCompose processor is typically used as input processor, since data is often extracted using the extract() method of selectors, which returns a list of unicode strings.

View Code

>>> def filter_world(x):
...     return None if x == 'world' else x
...
>>> from scrapy.loader.processors import MapCompose
>>> proc = MapCompose(filter_world, str.upper)
>>> proc(['hello', 'world', 'this', 'is', 'scrapy'])
['HELLO, 'THIS', 'IS', 'SCRAPY']

View Code

declaring item loaders
- https://docs.scrapy.org/en/latest/topics/loaders.html#declaring-item-loaders

declaring input and out processors

https://docs.scrapy.org/en/latest/topics/loaders.html#declaring-input-and-output-processors

对比

input and output processors declared in the Itemloader defination.优先级最高。KEYWORD Item Loader definition.

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

class ProductLoader(ItemLoader):

    default_output_processor = TakeFirst()

    name_in = MapCompose(unicode.title)
    name_out = Join()

    price_in = MapCompose(unicode.strip)

View Code

use Item Field metadata to specify the input and output processors.优先级第二高。 KEYWORD:Item Filed metadata

import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags

def filter_price(value):
    if value.isdigit():
        return value

class Product(scrapy.Item):
    name = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=Join(),
    )
    price = scrapy.Field(
        input_processor=MapCompose(remove_tags, filter_price),
        output_processor=TakeFirst(),
    )
>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item=Product())
>>> il.add_value('name', [u'Welcome to my', u'<strong>website</strong>'])
>>> il.add_value('price', [u'&euro;', u'<span>1000</span>'])
>>> il.load_item()
{'name': u'Welcome to my website', 'price': u'1000'}

View Code

reusing and extending ltem loaders
- https://docs.scrapy.org/en/latest/topics/loaders.html#topics-loaders-extending
item loader context
- https://docs.scrapy.org/en/latest/topics/loaders.html#item-loader-context
- The Item Loader Context is a dict of arbitrary key/values which is shared among all input and output processors in the Item Loader. It can be passed when declaring, instantiating or using Item Loader. They are used to modify the behaviour of the input/output processors.

nested loaders

不用也行
https://docs.scrapy.org/en/latest/topics/loaders.html#nested-loaders
nested_xpath() add_xpath('相对路径')

loader = ItemLoader(item=Item())
# load stuff not in the footer
footer_loader = loader.nested_xpath('//footer')
footer_loader.add_xpath('social', 'a[@class = "social"]/@href')
footer_loader.add_xpath('email', 'a[@class = "email"]/@href')
# no need to call footer_loader.load_item()
loader.load_item()

View Code

scrapy shell

https://docs.scrapy.org/en/latest/topics/shell.html

设置 shell 工具为 ipython　　

在scrapy.cfg中设置

[settings]
shell = bpython

Invoking the shell from spiders to inspect responses

from scrapy.shell import inspect_response

Here’s an example of how you would call it from your spider:

import scrapy


class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # Rest of parsing code.
When you run the spider, you will get something similar to this:

2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>> response.url
'http://example.org'
Then, you can check if the extraction code is working:

>>> response.xpath('//h1[@class="fn"]')
[]
Nope, it doesn’t. So you can open the response in your web browser and see if it’s the response you were expecting:

>>> view(response)
True
Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:

>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...

View Code

The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider
Once you get familiarized with the Scrapy shell, you’ll see that it’s an invaluable tool for developing and debugging your spiders. 能够考虑下。
另外一种方法就是加log吧。 self.logger.info('xxx')

Item Pipeline

是干什么的

Typical uses of item pipelines are:

cleansing HTML data
validating scraped data (checking that the items contain certain fields)
checking for duplicates (and dropping them)
storing the scraped item in a database

View Code

是什么

是个class

is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.

View Code

方法

process_item(self,item,spider)

This method is called for every item pipeline component. process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components

View Code

from_crawler(cls,crawler)

If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.

View Code

例子

简单例子

from scrapy.exceptions import DropItem

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item.get('price'):
            if item.get('price_excludes_vat'):
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

View Code

Write items to MongoDB

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

View Code

Duplicates filter

简单的去重，注意领会精神。

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

View Code

在设置 ITEM_PIPELINES 配置中将pipeline类添加上。

Feed exports
- https://docs.scrapy.org/en/latest/topics/feed-exports.html
- 简介
  - to be consumed by other systems.
- 用到的时候在细看吧

requests and responses

https://docs.scrapy.org/en/latest/topics/request-response.html

参数

meta

A shortcut to the Request.meta attribute of the Response.request object (ie. self.request.meta).

Unlike the Response.request attribute, the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.

View Code

link extractors

https://docs.scrapy.org/en/latest/topics/link-extractors.html
这个也是个技巧吧。能够很便捷的找到连接

from scrapy.linkextractors import LinkExtractor

The only public method that every link extractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.

View Code

参数
- allow
- deny
- restrict_xpath
- process_value
- attrs

setting

designating the setting

设置环境变量，告诉scrapy使用的是哪一个配置，环境变量名是 SCRAPY_SETTINGS_MODULE

When you use Scrapy, you have to tell it which settings you’re using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE.

View Code

下面这句话不太懂？？是在说 SCRAPY_SETTINGS_MODULE 应该在setting中配置吧

The value of SCRAPY_SETTINGS_MODULE should be in Python path syntax, e.g. myproject.settings. Note that the settings module should be on the Python import search path.

View Code

populating the setting

优先级

1 Command line options (most precedence)
2 Settings per-spider
3 Project settings module
4 Default settings per-command
5 Default global settings (less precedence)

View Code

command line options

经过 -s / --set 指定

Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option.

Example:

scrapy crawl myspider -s LOG_FILE=scrapy.log

View Code

settings per-spider

经过 custom_settings指定

Spiders (See the Spiders chapter for reference) can define their own settings that will take precedence and override the project ones. They can do so by setting their custom_settings attribute:

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

View Code

project settings module
- 经过setting.py 指定
Default settings per-command
Default global settings
- 位于 scrapy.settings.default_settings模块

how to access settings

经过 self.settings 方法得到

In a spider, the settings are available through self.settings:

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        print("Existing settings: %s" % self.settings.attributes.keys())

View Code

spider 初始化以前，使用这些settings，重写类方法from_crawler()。这个 from_crawler 仍是至关有用的。插一句，settings取值建议使用官方API

Note

The settings attribute is set in the base Spider class after the spider is initialized. If you want to use the settings before the initialization (e.g., in your spider’s __init__() method), you’ll need to override the from_crawler() method.

Settings can be accessed through the scrapy.crawler.Crawler.settings attribute of the Crawler that is passed to from_crawler method in extensions, middlewares and item pipelines:

class MyExtension(object):
    def __init__(self, log_is_enabled=False):
        if log_is_enabled:
            print("log is enabled!")

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings.getbool('LOG_ENABLED'))
The settings object can be used like a dict (e.g., settings['LOG_ENABLED']), but it’s usually preferred to extract the setting in the format you need it to avoid type errors, using one of the methods provided by the Settings API.

View Code

built-in setting reference

不少，很是多。有几个仍是挺有用的
https://docs.scrapy.org/en/latest/topics/settings.html#built-in-settings-reference

CONCURRENT_ITEMS

Default: 100

Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).

View Code

CONCURRENT_REQUESTS

Default: 16

The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.

View Code

CONCURRENT_REQUESTS_PER_DOMAIN

Default: 8

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.

View Code

CONCURRENT_REQUESTS_PER_IP

Default: 0

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.

This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.

View Code

DEFAULT_REQUEST_HEADERS

我觉的这个仍是至关有用的！！

Default:

{
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}
The default headers used for Scrapy HTTP Requests. They’re populated in the DefaultHeadersMiddleware.

View Code

DOWNLOADER_MIDDLEWARES

Default:: {}

A dict containing the downloader middlewares enabled in your project, and their orders.

View Code

DOWNLOAD_DELAY

Default: 0

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example:

DOWNLOAD_DELAY = 0.25    # 250 ms of delay
This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By default, Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY.

When CONCURRENT_REQUESTS_PER_IP is non-zero, delays are enforced per ip address instead of per domain.

You can also change this setting per spider by setting download_delay spider attribute.

View Code

DOWNLOAD_HANDLERS

Default: {}

A dict containing the request downloader handlers enabled in your project.

View Code

DOWNLOAD_TIMEOUT

Default: 180

The amount of time (in secs) that the downloader will wait before timing out.

Note

This timeout can be set per spider using download_timeout spider attribute and per-request using download_timeout Request.meta key.

View Code

DOWNLOAD_MAXSIZE
DOWNLOAD_WARNSIZE

DUPEFILTER_CLASS

Default: 'scrapy.dupefilters.RFPDupeFilter'

The class used to detect and filter duplicate requests.

The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function. In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method. This method should accept scrapy Request object and return its fingerprint (a string).

You can disable filtering of duplicate requests by setting DUPEFILTER_CLASS to 'scrapy.dupefilters.BaseDupeFilter'. Be very careful about this however, because you can get into crawling loops. It’s usually a better idea to set the dont_filter parameter to True on the specific Request that should not be filtered.

View Code

EXTENSIONS

Default:: {}

A dict containing the extensions enabled in your project, and their orders.

View Code

ITEM_PIPELINES

Default: {}

A dict containing the item pipelines to use, and their orders. Order values are arbitrary, but it is customary to define them in the 0-1000 range. Lower orders process before higher orders.

Example:

ITEM_PIPELINES = {
    'mybot.pipelines.validate.ValidateMyItem': 300,
    'mybot.pipelines.validate.StoreMyItem': 800,
}

View Code

MEMDEBUG_ENABLED

Default: False

Whether to enable memory debugging.

View Code

MEMDEBUG_NOTIFY

和上面那个搭配的，发邮件，可能有用吧

Default: []

When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty, otherwise the report will be written to the log.

Example:

MEMDEBUG_NOTIFY = ['user@example.com']

View Code

RANDOMIZE_DOWNLOAD_DELAY

Default: True

If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY) while fetching requests from the same website.

This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests.

The randomization policy is the same used by wget --random-wait option.

If DOWNLOAD_DELAY is zero (default) this option has no effect.

View Code

SPIDER_MIDDLEWARES

Default:: {}

A dict containing the spider middlewares enabled in your project, and their orders.

View Code

USER_AGENT

或者在这也能够重写，不须要再每一个spider中从新写一遍

Default: "Scrapy/VERSION (+https://scrapy.org)"

The default User-Agent to use when crawling, unless overridden.

View Code

。。。还有不少

Exceptions

https://docs.scrapy.org/en/latest/topics/exceptions.html#module-scrapy.exceptions

built-in exceptions reference

DropItem

exception scrapy.exceptions.DropItem
The exception that must be raised by item pipeline stages to stop processing an Item. For more information see Item Pipeline.

View Code

CloseSpider

exception scrapy.exceptions.CloseSpider(reason='cancelled')
This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported arguments:

Parameters:    reason (str) – the reason for closing
For example:

def parse_page(self, response):
    if 'Bandwidth exceeded' in response.body:
        raise CloseSpider('bandwidth_exceeded')

View Code

built-in service

Logging

更多的用法须要看内置的logging模块。我觉的自带的已经够用了。之后细看吧

how to log messages 的两种方法

import logging
logging.warning("This is a warning")

View Code

import logging
logging.log(logging.WARNING, "This is a warning")

logging from spider

Scrapy provides a logger within each Spider instance, which can be accessed and used like this:

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'
    start_urls = ['https://scrapinghub.com']

    def parse(self, response):
        self.logger.info('Parse function called on %s', response.url)

View Code

That logger is created using the Spider’s name, but you can use any custom Python logger you want

import logging
import scrapy

logger = logging.getLogger('mycustomlogger')

class MySpider(scrapy.Spider):

    name = 'myspider'
    start_urls = ['https://scrapinghub.com']

    def parse(self, response):
        logger.info('Parse function called on %s', response.url)

View Code

logging configuration

Loggers on their own don’t manage how messages sent through them are displayed. For this task, different “handlers” can be attached to any logger instance and they will redirect those messages to appropriate destinations, such as the standard output, files, emails, etc.

View Code

上面这句话一样适用于 logging 模块

logging settings
- LOG_FILE
- LOG_SHORT_NAMES
- 等等，就是 built-in setting 的一些关于log的设置

Advanced customization (高级定制)

说实话，这个有点吊。 https://docs.scrapy.org/en/latest/topics/logging.html#advanced-customization

Because Scrapy uses stdlib logging module, you can customize logging using all features of stdlib logging.

For example, let’s say you’re scraping a website which returns many HTTP 404 and 500 responses, and you want to hide all messages like this:

2016-12-16 22:00:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring
response <500 http://quotes.toscrape.com/page/1-34/>: HTTP status code
is not handled or not allowed
The first thing to note is a logger name - it is in brackets: [scrapy.spidermiddlewares.httperror]. If you get just [scrapy] then LOG_SHORT_NAMES is likely set to True; set it to False and re-run the crawl.

Next, we can see that the message has INFO level. To hide it we should set logging level for scrapy.spidermiddlewares.httperror higher than INFO; next level after INFO is WARNING. It could be done e.g. in the spider’s __init__ method:

import logging
import scrapy


class MySpider(scrapy.Spider):
    # ...
    def __init__(self, *args, **kwargs):
        logger = logging.getLogger('scrapy.spidermiddlewares.httperror')
        logger.setLevel(logging.WARNING)
        super().__init__(*args, **kwargs)
If you run this spider again then INFO messages from scrapy.spidermiddlewares.httperror logger will be gone.

View Code

这里说的应该是将httperror的显示log的等级设置为WARNING，因此等级为INFO的就不显示了。

logger = logging.getLogger('scrapy.spidermiddlewares.httperror') 还能够这样用。name能够不填。一般logger的名字咱们对应模块名，如聊天模块、数据库模块、验证模块等。
super().__init__(*args, **kwargs) 这里不是绑定函数，因此不用传self参数。连接 https://www.runoob.com/python/python-func-super.html

Stats Collection　　

数据收集 https://docs.scrapy.org/en/latest/topics/stats.html#stats-collection

还不知道有什么用？网上说和 Signals 配合使用。？？？待定

class ExtensionThatAccessStats(object):

    def __init__(self, stats):
        self.stats = stats

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.stats)

View Code

Sending e-mail
- https://docs.scrapy.org/en/latest/topics/email.html
- 有须要的时候在研究吧
- quick example
  - from scrapy.mail import MailSender mailer = MailSender()
    View Code
  - mailer = MailSender.from_settings(settings)
    View Code
telnet console
- pass https://docs.scrapy.org/en/latest/topics/telnetconsole.html
Web Service
- https://docs.scrapy.org/en/latest/topics/webservice.html

SOLVING SPECIFIC PROBLEMS
- Frequently Asked Questions
  - https://docs.scrapy.org/en/latest/faq.html#frequently-asked-questions
  - Does Scrapy work with HTTP proxies?
    - class scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
  - How can I scrape an item with attributes in different pages
    - https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
  - How can I simulate a user login in my spider?
    - https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin
  - What is the recommended way to deploy a Scrapy crawler in production?
    - https://docs.scrapy.org/en/latest/topics/deploy.html#topics-deploy
  - Does Scrapy manage cookies automatically?
    - Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does.
      
      View Code
  - How can I prevent my Scrapy bot from getting banned?
    - https://docs.scrapy.org/en/latest/topics/practices.html#bans
- Debugging Spiders
  - https://docs.scrapy.org/en/latest/topics/debug.html#debugging-spiders
  - 有四种方法
- Spiders Contracts
  - https://docs.scrapy.org/en/latest/topics/contracts.html#spiders-contracts
  - 这是个什么玩意？？
- Common Practices
  - Running multiple spiders in the same process
    - https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
    - import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished
      
      View Code
    - import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
      
      View Code
  - Avoiding getting banned
    - https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
    - 这个应该颇有用，实践中试试看！！！
- Broad Crawl
  - https://docs.scrapy.org/en/latest/topics/broad-crawls.html
  - 也是挺有用的。
  - Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asynchronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarizes some things you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy settings to tune in order to achieve an efficient broad crawl.
    View Code
  - Use the right SCHEDULER_PRIORITY_QUEUE
    - Scrapy’s default scheduler priority queue is 'scrapy.pqueues.ScrapyPriorityQueue'. It works best during single-domain crawl. It does not work well with crawling many different domains in parallel To apply the recommended priority queue use: SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
      
      View Code
  - Increase concurrency
  - Reduce log level
  - Disable cookies !! 不少次说起这个了
  - Reduce download timeout
  - Enable crawling of “Ajax Crawlable Pages”
    - 重点研究下。有种预感颇有用。若是真的能够自动找到ajax，那简直是牛逼。
    - https://docs.scrapy.org/en/latest/topics/broad-crawls.html#enable-crawling-of-ajax-crawlable-pages
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#ajaxcrawl-middleware
  - Be mindful of memory leaks
    - https://docs.scrapy.org/en/latest/topics/leaks.html#topics-leaks
    - 不知道，反正见这个东西好几回了，有机会研究下。
  - Using your browser’s Developer Tools for scraping
    - https://docs.scrapy.org/en/latest/topics/developer-tools.html#using-your-browser-s-developer-tools-for-scraping
    - 颇有用，这就是正常爬取网页的分析过程。官方文档用的是火狐浏览器，看了下，确实比chrome好一些，有必定的道理。
    - Caveats with inspecting the live browser DOM
      - 可能第一点会有用，禁用后，就是最原始的页面，没有动态加载的内容
      - Disable Javascript while inspecting the DOM looking for XPaths to be used in Scrapy (in the Developer Tools settings click Disable JavaScript) Never use full XPath paths, use relative and clever ones based on attributes (such as id, class, width, etc) or any identifying features like contains(@href, 'image'). Never include <tbody> elements in your XPath expressions unless you really know what you’re doing
        
        View Code
    - 总之,重中之重,没事多看几遍,注意领会精神!!!!!!!!!
  - Selecting dynamically-loaded content
    - 很是重要的。没事多看看！！
    - headless browser
      - selenium
      - scrapy-selenium
    - Pre-rendering JavaScrip
      - https://docs.scrapy.org/en/latest/topics/dynamic-content.html#pre-rendering-javascript
      - scrapy-splash
        
        还不知道有什么用，怎么用。看它的介绍说明颇有用
        
        多是比价low的方法，可能相似是无头浏览器的方法
    - Parsing JavaScript code
      - get the JavaScript code
        
        response.text
        
        在html元素中，使用selector
      - extract the desired data
        
        方法1
        
        regular expression
        
        json.loads
        
        方法2
        
        js2xml
        
        selector
    - Handling different response formats
      - HTML or XM
        
        selector
      - JSON
        
        json.loads()
      - JavaScript, or HTML with a <script/>
        
        https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-parsing-javascript
      - an image or another format based on images (e.g. PDF)
        
        OCR
        
        pytesseract
        
        tabula-py
      - SVG, or HTML with embedded SVG
        
        Selector
        
        OCR
  - Debugging memory leaks
    - https://docs.scrapy.org/en/latest/topics/leaks.html
    - 之后细看
  - Downloading and processing files and images
    - 这个提供了一个此场景的简单方法，用也行，不用也行。用的好很省事
    - https://docs.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images
    - https://github.com/factsbenchmarks/360img
    - item 的 image_urls 字段是必须的。images 字段多是重写 item_completed 方法会用到吧
    - 通常直接使用 ImagesPipeline，就能够了。若是想改存的图片的路径，名称，须要重写 file_path 方法
    - 文件介绍了三个方法，除了上面的 file_path 方法，还有 get_medie_requests 方法和 item_completed 方法还不知道具体有什么用。
    - 须要多配置几个设置项
      - pipeline
      - 文件存储的路径
      - 文件过时时间
      - 图片能够存缩略图
        
        IMAGES_THUMBS
      - 图片能够设置最小图片的尺度大小，小于的就不要了
        
        IMAGES_MIN_HEIGHT
        
        IMAGES_MIN_WIDTH
  - deploying spider　　
    - https://docs.scrapy.org/en/latest/topics/deploy.html#deploying-spiders
  - autothrottle extension　　
    - https://docs.scrapy.org/en/latest/topics/autothrottle.html#autothrottle-extension
    - 给个人感受是更智能化的自动节流。每一个ip 有延迟访问，延迟下载等。
  - benchmarking
    - 没啥用
    - https://docs.scrapy.org/en/latest/topics/benchmarking.html#benchmarking
  - Jobs: pausing and resuming crawls
    - https://docs.scrapy.org/en/latest/topics/jobs.html#jobs-pausing-and-resuming-crawls
    - 感受有些用
    - Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later.
    - Scrapy supports this functionality out of the box by providing the following facilities: a scheduler that persists scheduled requests on disk a duplicates filter that persists visited requests on disk an extension that keeps some spider state (key/value pairs) persistent between batches
      
      View Code

EXTENDING SCRAPY

Architecture overview
- 面试必问
- https://docs.scrapy.org/en/latest/topics/architecture.html#architecture-overview
Downloader Middleware
- https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#downloader-middleware
- Activating a downloader middleware
  - The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
    View Code
- Writing your own downloader middleware
  - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#writing-your-own-downloader-middleware
  - from_crawler
  - process_request
  - process_response
  - prcoess_exception
- Built-in downloader middleware reference
  - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#built-in-downloader-middleware-reference
  - CookiesMiddleware
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.cookies
  - DefaultHeadersMiddleware
  - DownloadTimeoutMiddleware
  - HttpAuthMiddleware
  - HttpCompressionMiddleware
  - HttpProxyMiddleware
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy
  - RedirectMiddleware
  - RetryMiddleware
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#retrymiddleware-settings
  - UserAgentMiddleware
  - AjaxCrawlMiddleware
    - https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.ajaxcrawl
Spider Middleware
- https://docs.scrapy.org/en/latest/topics/spider-middleware.html#spider-middleware
- Writing your own spider middleware
  - process_spider_input
  - process_spider_output
  - process_spider_exception
  - process_start_requests ？？
  - from_crawler
- Built-in spider middleware reference
  - HttpErrorMiddleware

Extensions

https://docs.scrapy.org/en/latest/topics/extensions.html#extensions
The extensions framework provides a mechanism for inserting your own custom functionality into Scrapy.
可能牛逼的人才会用到吧

示例和signal 搭配

Here we will implement a simple extension to illustrate the concepts described in the previous section. This extension will log a message every time:

a spider is opened
a spider is closed
a specific number of items are scraped
The extension will be enabled through the MYEXT_ENABLED setting and the number of items will be specified through the MYEXT_ITEMCOUNT setting.

Here is the code of such extension:

import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured

logger = logging.getLogger(__name__)

class SpiderOpenCloseLogging(object):

    def __init__(self, item_count):
        self.item_count = item_count
        self.items_scraped = 0

    @classmethod
    def from_crawler(cls, crawler):
        # first check if the extension should be enabled and raise
        # NotConfigured otherwise
        if not crawler.settings.getbool('MYEXT_ENABLED'):
            raise NotConfigured

        # get the number of items from settings
        item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000)

        # instantiate the extension object
        ext = cls(item_count)

        # connect the extension object to signals
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        logger.info("opened spider %s", spider.name)

    def spider_closed(self, spider):
        logger.info("closed spider %s", spider.name)

    def item_scraped(self, item, spider):
        self.items_scraped += 1
        if self.items_scraped % self.item_count == 0:
            logger.info("scraped %d items", self.items_scraped)

View Code

Core API
- https://docs.scrapy.org/en/latest/topics/api.html#core-api
- 用到的时候再说吧，东西不少

Signals

多是高端用法吧 https://docs.scrapy.org/en/latest/topics/signals.html#signals

Scrapy uses signals extensively to notify when certain events occur. You can catch some of those signals in your Scrapy project (using an extension, for example) to perform additional tasks or extend Scrapy to add functionality not provided out of the box.

View Code

from scrapy import signals
from scrapy import Spider


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]


    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider


    def spider_closed(self, spider):
        spider.logger.info('Spider closed: %s', spider.name)


    def parse(self, response):
        pass

View Code

Item Exporters
- https://docs.scrapy.org/en/latest/topics/exporters.html#module-scrapy.exporters
- 感受用不到吧，都是用数据库吧

数据库

mysql

版本
- 5.76
端口
- 3306
书籍推荐
- MySQL Cookbook
图形化管理客户端
- Navicat
- MySQL Workbench
增删改查
- SELECT id,name FROM pages WHERE title LIKE "%dota%" ;
  - % 符号表示MySQL字符串通配符
- UPDATE pages SET name = "storm" WHERE id = 1;

与python整合

PyMySQL
下面的代码和异常处理有效的结合起来了

import pymysql
import random
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
import datetime



conn = pymysql.connect(host='localhost',user='root',passwd='123',db='zuo',charset='utf8')

cur = conn.cursor()

cur.execute('SELECT title FROM pages WHERE id = 2 ')

print(cur.fetchone())

def store(title,content):
    cur.execute("INSERT INTO pages (title,content) VALUES (%s,%s)",(title,content))
    cur.connection.commit()

def getLinks(url):
    html = urlopen(url=url)
    bs = BeautifulSoup(html,'lxml')
    title = bs.find_all('title')
    content = bs.find_all('content')
    store(title,content)
    return bs.find_all('links')
    
links = getLinks('www.baidu.com')

try:
    while len(links) > 0:
        newLink = 'x'
        print(newLink)
        links = getLinks('url')
finally:
    cur.close()
    conn.close()

View Code

链接对象（conn）和 光标对象（cur）
链接/光标模式是数据库编程中经常使用的模式。
链接模式除了要链接数据库以外，还要发送数据库信息，处理回滚操做（当一个查询或一组查询被中断时，数据库须要回到初始状态，通常用事务控制手段实现状态回滚），建立光标对象，等等。

而一个链接能够有不少个光标，一个光标跟踪一种状态（state）信息，好比正在使用的是哪一个数据库。若是你有多个数据库，且须要向全部数据库写内容，就须要多个光标来进行处理。光标还会包含最后一次查询执行的结果，经过调用光标函数，好比 cur.fetchone()，能够获取查询结果。

第三方工具
- 数据清理
  - OpenRefine
    - http://openrefine.org/
- fiddler　　　　
JavaScript
- https://www.w3school.com.cn/js/index.asp
- JavaScript 是属于 HTML 和 Web 的编程语言。
- 页面加载会引用外部脚本，这相信你确定见识过了。一个页面或许会引用多个脚本。
- 了解下函数，基本概念基本就能够吧
- jQuery
  - jQuery是一个快速、简洁的JavaScript框架。jQuery设计的宗旨是“write Less，Do More”，即倡导写更少的代码，作更多的事情。它封装JavaScript经常使用的功能代码，提供一种简便的JavaScript设计模式，优化HTML文档操做、事件处理、动画设计和Ajax交互。
    View Code
- 请记住：一个网站使用JavaScript并不意味着全部传统的网页抓取工具都失效了。JavaScript的目标是生成HTML和CSS，而后被浏览器渲染，或者是经过HTML请求和响应与服务器动态通讯。一旦使用了Selenium，页面上的HTML和CSS就能够和其余网站代码同样被读取和解析。另外，JavaScript对于网络爬虫来讲甚至会带来一些好处，由于它做为“浏览器的内容管理系统”，可能会想外界暴露有用的API，让你能够更直接的获取数据。

AJAX

基本概念
- AJAX = Asynchronous JavaScript and XML（异步的 JavaScript 和 XML）
- AJAX 是一种在无需从新加载整个网页的状况下，可以更新部分网页的技术。若是提交表单后，或者从Web服务器获取信息时，网站的页面不须要从新加载，那么你访问的网站极可能是用来Ajax技术。
- 与一些人的想法相反，Ajax并非一门语言，而是用来完成某个任务的一个技术，网站不须要使用单独的网页请求就能够和Web服务器进行交互。须要注意的是，你不该该说“这个网站是用Ajax写的”，正确的说法应该是“这个表单使用Ajax与Web服务器通讯”。
- ```
AJAX 是一种用于建立快速动态网页的技术。

经过在后台与服务器进行少许数据交换，AJAX 可使网页实现异步更新。这意味着能够在不从新加载整个网页的状况下，对网页的某部分进行更新。

传统的网页（不使用 AJAX）若是须要更新内容，必需重载整个网页面。
```
  View Code

简单例子

div 部分用于显示来自服务器的信息。当按钮被点击时，它负责调用名为 loadXMLDoc() 的函数：

<html>
<body>

<div id="myDiv"><h3>Let AJAX change this text</h3></div>
<button type="button" onclick="loadXMLDoc()">Change Content</button>

</body>
</html>
接下来，在页面的 head 部分添加一个 <script> 标签。该标签中包含了这个 loadXMLDoc() 函数：

<head>
<script type="text/javascript">
function loadXMLDoc()
{
.... AJAX script goes here ...
}
</script>
</head>

View Code

function是在 <script>（脚本）标签内实现的，能够放在<head>,<body>中，推荐放在body元素的底部，可改善显示速度，由于脚本编译会拖慢显示。实际上放在<head>,<body>的任意位置都可。
触发通常有 onclick，onkeyup等等
和JavaScript 有极为密切的关系

AJAX XHR

XHR 指的是 XMLHttpRequest

XHR 建立对象

var xmlhttp;
if (window.XMLHttpRequest)
  {// code for IE7+, Firefox, Chrome, Opera, Safari
  xmlhttp=new XMLHttpRequest();
  }
else
  {// code for IE6, IE5
  xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
  }

View Code

XHR请求

xmlhttp.open("GET","test1.txt",true);
xmlhttp.send();

View Code

XHR响应

如需得到来自服务器的响应，请使用 XMLHttpRequest 对象的 responseText 或 responseXML 属性。

document.getElementById("myDiv").innerHTML=xmlhttp.responseText;

View Code

xmlDoc=xmlhttp.responseXML;
txt="";
x=xmlDoc.getElementsByTagName("ARTIST");
for (i=0;i<x.length;i++)
  {
  txt=txt + x[i].childNodes[0].nodeValue + "<br />";
  }
document.getElementById("myDiv").innerHTML=txt;

View Code

XHR readyState

https://www.w3school.com.cn/ajax/ajax_xmlhttprequest_onreadystatechange.asp

xmlhttp.onreadystatechange=function()
  {
  if (xmlhttp.readyState==4 && xmlhttp.status==200)
    {
    document.getElementById("myDiv").innerHTML=xmlhttp.responseText;
    }
  }

View Code

处理办法
- 对于那些使用了Ajax和DHTML技术来改变和加载内容的页面，用python解决这个问题只有两种途径
  - 直接从JavaScript代码中抓取内容
  - 用python的第三方库执行JavaScript，直接抓取你在浏览器里看到的页面

利用API抓取数据

利用API 能够直接得到数据源。API 定义了容许一个软件与另外一个软件通讯的标准语法。这里的API指的是 Web API。
API的响应一般是JSON（JavaScript Object Notation,JavaScript对象表示法）或者XML（eXtensible Markup Language，可扩展标记语言）格式。如今的JSON远比XML流行，首先JSON文件一般比设计良好的XML文件小，另外一个缘由是Web技术的改变，如今服务器也会用一些JavaScript框架做为API的发送和接收端。

下面这段话，说明了如今Ajax如此多的缘由。

因为JavaScript框架变的愈来愈广泛，不少HTML建立任务从原来的服务器处理变成了由浏览器处理。服务器可能给用户浏览器发送一个硬编码的HTML模板，可是还须要单独的AJAX请求来加载内容。并将这些内容放在HTML模板中正确的位置。全部这些都发生在浏览器/客户端上。
最初，上述机制对于网络爬虫是一个麻烦的问题。过去，爬虫请求一个HTML页面时，获取的就是原封不动的HTML页面，全部的内容都在HTML页面上，而如今爬虫得到的是一个不带任何内容的HTML模板。
selenium能够解决这个问题。

然而，因为整个内容管理系统基本上已经移动的浏览器端，连最简单的网站均可以激增到几兆节的内容和十几个HTTP请求。

此外，当时用selenium时，用户不须要的“额外信息”也被加载了。加载侧边栏广告，图像，CSS，第三方的字体。这些内容看起来很好，可是当你编写一个须要快速移动，抓取特定数据并尽量对Web服务器形成较小负担的爬虫时，这可能会加载比你实际所需多上百倍的数据。

可是 对于JavaScript，Ajax和现代化Web来讲仍有一线但愿：由于服务器再也不将数据处理成HTML格式， 因此它们一般做为数据库自己的一个弱包装器。该弱包装器简单的从数据库中抽取数据，并经过一个API将数据返回给页面。

固然，这些API并未打算供除网页自己之外的任何人或者任何事使用，所以，开发者并未这个鞋API提供文档。

View Code

查找无文档的API
- 须要作一些侦查工做。API调用有这几个特征。
  - 它们一般包含JSON或XML。你能够利用搜索/过滤字段过滤请求列表
  - 利用GET请求，URL中会包含一个传递给它的参数。若是你要寻找一个返回搜索结果或者加载特定页面数据的API调用，这将很是有用。只须要你使用的搜索词，页面ID或者其它的识别信息，过滤结果便可。
  - 它们一般是XHR类型的。
API 数据与其它数据源结合
- 若是你用API做为惟一的数据源，那么你最多就是复制别人数据库的数据，并且这些数据基本上都是已经发表过的，真正有意思的事情，是以一个新颖的方式将两个或多个数据源组合起来，或者把API做为一种工具，从全新的角度对抓取到的数据进行数据解释。（这就是数据分析吗）

表单和登陆窗口抓取与requests模块
- form表单和input标签
  - <form action="form_action.asp" method="get"> <p>First name: <input type="text" name="fname" /></p> <p>Last name: <input type="text" name="lname" /></p> <input type="submit" value="Submit" /> </form>
    View Code
- 能够用requests模块的post方法模拟表单登陆。
  - form表单的action属性就是post方法的url。
  - input标签的name属性就是params的key值。params是个字典。data是
  - r = requests.post('url',data=params)
    View Code
- 如何处理cookies
  - 大多数现代网站都是用cookies跟踪用户是否已登陆的状态信息。一旦网站验证你的登陆凭据，就会在你的浏览器上将其保存为一个cookies，里面一般包含一个由服务器生成的令牌，登陆有效时长和状态跟踪信息。
  - requests模块的Session()方法
- HTTP基本接入验证
  - 在发明cookies之前，处理网站登陆的一种经常使用方法是 HTTP基本接入认证（HTTP basic access authentication），你会时不时见到它们，尤为是在一些安全性较高的网站或公司网站上。
  - requests库有一个auth模块，专门处理HTTP认证
    - import requests from requests.auth import AuthBase from requests.auth import HTTPBasicAuth auth = HTTPBasicAuth('zuo','password') r = requests.post('xx',auth=auth) print(r.text)
      
      View Code
    - 虽然这里看着像一个普通的post请求，可是有一个HTTPBasicAuth对象做为auth参数传递到了请求中。显示的结果将是用户名和密码验证成功的页面。若是验证失败，就是一个拒绝接入页面。
- 其它表单内容
  - 网页表单敌网络恶意机器人酷爱的网站切入点。你固然不但愿机器人建立垃圾帐号，占用昂贵的服务器资源，或者在博客上提交垃圾评论。所以，现代网站常常在HTML中采起不少安全措施，让表单不能被快速穿越。
  - 验证码
  - 蜜罐（honey pot）
  - 隐藏字段（hidden field）
  - else　
避开抓取陷阱
- 修改请求头
  - Host
  - Connection
  - Accept
  - User-Agent
    - 最重要。
    - 妙用：能够假装成手机访问，就能够看到一个更容易抓取的网站
  - Referer
  - Accept-Encoding
  - Accept-Language
    - 修改这个值，大型网站自动变成相关的语言，不用翻译
- 用JavaScript处理Cookie
  - cookie是个双刃剑。若是暴露了你的身份，即使从新链接网站，或者改变IP来假装都是白费。有些网站cookie是不可或缺的。
  - selenium
    - delete_cookie()
    - add_cookie()
    - delete_all_cookies()
- 常见表单安全措施
  - 隐藏输入字段值
    - 在HTML表单中，“隐藏”字段以字段的值对浏览器可见，可是对用户不可见（除非看网页源代码）。隐含字段在短暂失宠以后找到了另外一个不错的用处：阻止爬虫自动提交表单。
      - 第一种是表单页面上的一个字段能够用服务器生成的随机变量填充。若是提价时这个值再也不表单页面上，服务器就有理由认为他不是从原始页面提交的，而是由网络机器人直接提交到表单处理页面的。
      - 第二种方式是蜜罐。表单包含一个具备普通名称的隐含字段（设置蜜罐圈套），设计的不太好的网络机器人每每无论这个字段是否是对用户可见，直接填写这个字段并向服务器提交，这样就会中服务器的蜜罐圈套。填写隐含字段的用户有可能被网站封杀。
    - 检查是否有隐藏字段，或者较大的随机字符串变量
  - 避免蜜罐
    - 几种方式隐藏
      - CSS display:none
      - type="hide"
    - 能够用selenium的is_displayed()来区分可见元素与隐含元素
      - from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement driver = webdriver.Chrome() driver.get('url') links = driver.find_element_by_tag_name('a') for link in links: if not links.is_displayed(): print('The link {} is a trap'.format(link.get_attribute('href'))) fields = driver.find_elements_by_tag_name('input') for field in fields: if not field.is_displayed(): print('Do not change value of {}'.format(field.get_attribute('name')))
        
        View Code
远程抓取
- 为何用远程服务器
  - 避免IP地址被封杀。封杀IP地址是最后一步棋，不过是一种很是有效的方法。
- Tor代理服务器
- 远程主机
  - 云主机
动态Html
- 和Ajax，DHTML也是用于某一常见目的的一系列技术。DHTML是客户端脚本改变页面的HTML元素时，改变的HTML代码,CSS语言，或者两者兼而有之。
selenium
- Selenium是一个强大的网页抓取工具，最初是为网站自动化测试而开发的，近几年，它还被普遍用于获取精确的网站快照，由于网站能够直接运行在浏览器中。Selenium可以让浏览者自动加载网站，获取须要的数据，甚至对网页截屏，或者判断网站上是否发生了某些操做。Selenium本身不带浏览器，它须要与第三方浏览器集成才能运行。好比你在FireFox上运行selenium。虽然这样能够看的更清楚，可是我更喜欢让程序在后台静静的运行，因此我用一个叫PhantomJS的工具代替真是的浏览器。
- PhantomJS是一个无头浏览器（headless browser）。它会把网站加载到内存并执行页面上的JavaScript，可是他不会像用用户展现网页的图形界面。把Selenium和PhantomJS结合在一块儿，就能够运行一个很是强大的网络爬虫来轻松处理cookies，JavaScript，header以及任何你须要作的事情。目前PhantomJS已经再也不更新了，Chrome和FireFox都有本身的无头浏览器，并且新版本的selenium支持Chrome和FireFox的无头浏览器，再也不支持PhantomJS了。
- 在selenium中一样能够用BeautifulSoup来解析
  - from selenium import webdriver ... pageSource = webdriver.page_source bs = BeautfulSoup(pageSource,'html.parser') print(bs.find(id='content').get_text())
    View Code
分布式
实战
最后的话
- 不管如今仍是未来，遇到一个网页抓取项目时，你都应该问问本身如下几个问题
  - 我须要回答或要解决的问题是什么
  - 什么数据能够帮到我？它们都在哪里
  - 网站是如何展现数据的？我能准确的识别网站代码中包含这一信息的部分吗
  - 如何定位这些数据并获取它们？
  - 为了让数据更实用，应该作怎样的处理和分析？
  - 怎样才能让抓取过程更好，更快，更稳定？
- 例如，首先用selenium获取在亚马逊图书预览页面中经过Ajax加载的图片，而后用Tesseract读取图片，识别里面的文字。在“维基百科六度分离”问题中，先用正则表达式实现一个爬虫，把维基百科词条链接信息存储到数据库中。而后用有向图算法寻找词条“xx”与词条“xx”之间的最短连接路径。
- 在使用自动化技术抓取互联网数据时，其实不多遇到彻底没法解决的问题。记住一点就行：互联网其实就是一个用户界面不太友好的超级API。