今天用scrapy爬取壁纸的时候(url:http://pic.netbian.com/4kmein...)絮叨了一些问题,记录下来,供后世探讨,以史为鉴。**css
由于网站是动态渲染的,因此选择scrapy对接selenium(scrapy抓取网页的方式和requests库类似,都是直接模拟HTTP请求,而Scrapy也不能抓取JavaScript动态渲染的网页。)html
因此在Downloader Middlewares中须要获得Request而且返回一个Response,问题出在Response,经过查看官方文档发现class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None]),随即经过from scrapy.http import Response导入Responseweb
输入scrapy crawl girl
获得以下错误:
*results=response.xpath('//[@id="main"]/div[3]/ul/lia/img')
raise NotSupported("Response content isn't text")
scrapy.exceptions.NotSupported: Response content isn't text**
检查相关代码:scrapy
# middlewares.py from scrapy import signals from scrapy.http import Response from scrapy.exceptions import IgnoreRequest import selenium from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC class Pic4KgirlDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called try: self.browser=selenium.webdriver.Chrome() self.wait=WebDriverWait(self.browser,10) self.browser.get(request.url) self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#main > div.page > a:nth-child(10)'))) return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8')) #except: #raise IgnoreRequest() finally: self.browser.close()
推断问题出在:
return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
查看Response类的定义发现:ide
@property def text(self): """For subclasses of TextResponse, this will return the body as text (unicode object in Python 2 and str in Python 3) """ raise AttributeError("Response content isn't text") def css(self, *a, **kw): """Shortcut method implemented only by responses whose content is text (subclasses of TextResponse). """ raise NotSupported("Response content isn't text") def xpath(self, *a, **kw): """Shortcut method implemented only by responses whose content is text (subclasses of TextResponse). """ raise NotSupported("Response content isn't text")
说明Response类不能够被直接使用,须要被继承重写方法后才能使用网站
响应子类:ui
**TextResponse对象** class scrapy.http.TextResponse(url[, encoding[, ...]]) **HtmlResponse对象** class scrapy.http.HtmlResponse(url[, ...]) **XmlResponse对象** class scrapy.http.XmlResponse(url [,... ] )
举例观察TextResponse的定义
from scrapy.http import TextResponse
导入TextResponse
发现this
class TextResponse(Response): _DEFAULT_ENCODING = 'ascii' def __init__(self, *args, **kwargs): self._encoding = kwargs.pop('encoding', None) self._cached_benc = None self._cached_ubody = None self._cached_selector = None super(TextResponse, self).__init__(*args, **kwargs)
其中xpath方法已经被重写url
@property def selector(self): from scrapy.selector import Selector if self._cached_selector is None: self._cached_selector = Selector(self) return self._cached_selector def xpath(self, query, **kwargs): return self.selector.xpath(query, **kwargs) def css(self, query): return self.selector.css(query)
因此用户想要调用Response类,必须选择调用其子类,而且重写部分方法spa
Scrapy爬虫入门教程十一 Request和Response(请求和响应)
scrapy文档:https://doc.scrapy.org/en/lat...
中文翻译文档:https://blog.csdn.net/Inke88/...