前面曾经介绍过requests实现自动登陆的方法。这里介绍下使用scrapy如何实现自动登陆。仍是以csdn网站为例。
Scrapy使用FormRequest来登陆并递交数据给服务器。只是带有额外的formdata参数用来传送登陆的表单信息(用户名和密码),为了使用这个类,须要使用如下语句导入:from scrapy.http import FormRequest
那么关于登陆过程当中使用cookie值,scrapy会自动为咱们处理cookie,只要咱们登陆成功了,它就会像一个浏览器同样自动传送cookie
首先爬虫中定义start_requests
def start_requests(self):
return [Request("http://passport.csdn.net/account/login",meta={'cookiejar':1},callback=self.post_login,method="POST")]
其中采用Requests的方法首先访问登陆网站。meta属性是字典,字典格式即{‘key’:'value'},字典是一种可变容器模型,可存储任意类型对象。html
request中meta参数的做用是传递信息给下一个函数,这些信息能够是任意类型的,好比值、字符串、列表、字典......方法是把要传递的信息赋值给meta字典的键. 上面start_requests中键‘cookiejar’是一个特殊的键,scrapy在meta中见到此键后,会自动将cookie传递到要callback的函数中。既然是键(key),就须要有值(value)与之对应,例子中给了数字1,也能够是其余值,好比任意一个字符串。浏览器
Callback就是链接到了登陆网站后下一步须要调的函数。下面来看下post_login如何实现tomcat
def post_login(self,response):
html=BeautifulSoup(response.text,"html.parser")
for input in html.find_all('input'):
if 'name' in input.attrs and input.attrs['name'] == 'lt':
lt=input.attrs['value']
if 'name' in input.attrs and input.attrs['name'] == 'execution':
e1=input.attrs['value']
data={'username':'xxxx','password':'xxxxx','lt':lt,'execution':e1,'_eventId':'submit'}
return [FormRequest.from_response(response,
meta={'cookiejar':response.meta['cookiejar']},
headers=self.header,
formdata=data,
callback=self.after_login,)]
首先是获取lt,execution字段的值,具体在以前介绍requests的帖子中有解释。服务器
而后调用FormRequest.from_response。 这个方法的做用是从response中返回的网页中构造表单数据,所以第一个参数是response。这里response返回的网页也就是前面Requests中调用的http://passport.csdn.net/account/logincookie
接下来的参数是meta。Heasers,formadata以及callback。这里的callback有就是指向登陆后的函数。session
after_login的实现以下。dom
def after_login(self,response):
print 'after login'
print response.status
header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'}
return [Request("http://my.csdn.net/my/mycsdn",meta={'cookiejar':response.meta['cookiejar']},headers=header,callback=self.parse)]
def parse(self, response):
print response.text.decode('utf-8').encode(self.type)
运行后咱们来看下记录的log。从下面的红色标红部分能够看到。在向http://passport.csdn.net/account/login发起登陆请求后,scrapy紧接着
向http://passport.csdn.net/account/login;jsessionid=8B4A62EA09BBB5F1FBF4D921B64FECE6.tomcat2 发起创建请求。这也就是调用FormRequest.from_response触发的。在这里后面链接的jsessionid值也就是以前在访问登陆网站的时候获取的会话ID,在这里scrapy自动给添加上了
2017-10-16 22:17:34 [scrapy] INFO: Spider opened
2017-10-16 22:17:34 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-16 22:17:34 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-16 22:17:34 [scrapy] DEBUG: Crawled (404) <GET http://passport.csdn.net/robots.txt> (referer: None)
2017-10-16 22:17:34 [scrapy] DEBUG: Crawled (200) <POST http://passport.csdn.net/account/login> (referer: None)
2017-10-16 22:17:34 [scrapy] DEBUG: Crawled (200) <POST http://passport.csdn.net/account/login;jsessionid=8B4A62EA09BBB5F1FBF4D921B64FECE6.tomcat2> (referer: http://www.csdn.net/)
2017-10-16 22:17:35 [scrapy] DEBUG: Crawled (200) <GET http://my.csdn.net/robots.txt> (referer: None)
2017-10-16 22:17:35 [scrapy] DEBUG: Crawled (200) <GET http://my.csdn.net/my/mycsdn> (referer: http://passport.csdn.net/account/login;jsessionid=8B4A62EA09BBB5F1FBF4D921B64FECE6.tomcat2)
2017-10-16 22:17:35 [scrapy] INFO: Closing spider (finished)
2017-10-16 22:17:35 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2022,
从fiddler中抓取的数据能够看到jsessionid是在访问登陆网站后,网站返回的response 消息的header消息中。也就是网站设置的cookie值

完整的代码:
# -*- coding:UTF-8 -*- #from scrapy.spiders import Spider,CrawlSpider,Rulefrom scrapy.selector import Selectorfrom scrapy.http import Requestfrom scrapy import FormRequestfrom test2.items import Test2Itemfrom scrapy.utils.response import open_in_browserfrom scrapy.linkextractors import LinkExtractorfrom bs4 import BeautifulSoupimport sysclass testSpider(Spider): name="test2" allowd_domains=['http://www.csdn.net/'] header={'host':'passport.csdn.net','User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36','Referer':'http://www.csdn.net/'} start_urls=["http://www.csdn.net/"] reload(sys) sys.setdefaultencoding('utf-8') type = sys.getfilesystemencoding() def start_requests(self): return [Request("http://passport.csdn.net/account/login",meta={'cookiejar':1},callback=self.post_login,method="POST")] def post_login(self,response): html=BeautifulSoup(response.text,"html.parser") for input in html.find_all('input'): if 'name' in input.attrs and input.attrs['name'] == 'lt': lt=input.attrs['value'] if 'name' in input.attrs and input.attrs['name'] == 'execution': e1=input.attrs['value'] data={'username':'xxxx','password':'xxxxx','lt':lt,'execution':e1,'_eventId':'submit'} return [FormRequest.from_response(response, meta={'cookiejar':response.meta['cookiejar']}, headers=self.header, formdata=data, callback=self.after_login,)] def after_login(self,response): print 'after login' print response.status header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'} return [Request("http://my.csdn.net/my/mycsdn",meta={'cookiejar':response.meta['cookiejar']},headers=header,callback=self.parse)] def parse(self, response): print response.text.decode('utf-8').encode(self.type)