前面文章讲到怎么提取动态网页的所有内容。接下来返回文章一,怎么登陆而且保存登陆状态,以便带上cookies下次访问。html
if __name__ == '__main__':
url = 'https://www.zhihu.com/collection/146079773'
res = requests.get(url, verify=False)
resSoup = BeautifulSoup(res.content, 'lxml')
items = resSoup.select("div > h2 > a")
print(len(items))
复制代码
verify=False
:取消ssl的验证。 运行这段代码, 输出结果未0, 粘贴该网页到一个没有登陆知乎的浏览器打开,重定向到登陆页, 说明须要登陆。python
验证:git
if __name__ == '__main__':
url = 'https://www.zhihu.com/collection/146079773'
# res = requests.get(url, verify=False)
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(2)
res = driver.page_source
resSoup = BeautifulSoup(res, 'lxml')
items = resSoup.select("div > h2 > a")
print(len(items))
复制代码
执行代码,打开浏览器,显示知乎登陆页,说明访问收藏夹须要登陆。 github
登陆技巧: 使用selenium打开登陆页,设定延时时间(好比60s),手动输入帐号密码登陆知乎,60秒以后保存cookies到本地,完成登陆。后续请求携带保存的cookie进行的登陆。若是cookies过时,则简单重复这一步骤。 下面是详细步骤:web
if __name__ == '__main__':
ssl._create_default_https_context = ssl._create_unverified_context
# url = 'https://www.zhihu.com/collection/146079773'
url = "https://www.zhihu.com/signin"
# res = requests.get(url, verify=False)
driver = webdriver.Chrome()
driver.implicitly_wait(5)
driver.get(url)
time.sleep(40)
cookies = driver.get_cookies()
pickle.dump(cookies, open("cookies.pkl", "wb"))
print("save suc")
复制代码
执行这段代码,看是否有cookies.pkl文件生成, 成功保存了cookies。docker
接下来用第二段代码去验证。json
if __name__ == '__main__':
cookies = pickle.load(open("cookies.pkl", "rb"))
url = 'https://www.zhihu.com/collection/146079773'
driver = webdriver.Chrome()
driver.get("https://www.zhihu.com/signin")
for cookie in cookies:
print(cookie)
driver.add_cookie(cookie)
driver.get(url)
driver.implicitly_wait(2)
res = driver.page_source
resSoup = BeautifulSoup(res, 'lxml')
items = resSoup.select("div > h2 > a")
print(len(items))
复制代码
打开浏览器, 加载任意网页,接着加载cookies, 打开给定的url。运行代码, 浏览器
至此,最难定义的动态网页和登陆问题已经解决。 下面就是怎么保存抓到的数据。 个人想法是先将须要登陆的10页中全部问题和问题连接提取出来,保存为json文件之后后续处理。接着对每个问题下的全部图片连接提取,保存或者直接下载就看我的选择了。bash
settings.py
文件中的中间键,DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
# 'zhihu.middlewares.PhantomJSMiddleware': 100,
}
复制代码
反爬虫策略: 对于访问过快,网页通常会静止访问或者直接封ip。所以对于须要登陆的爬虫来讲,限制访问速度,好比5秒/次, 或者每一个ip每分钟最大访问次数。对于不须要登陆的页面来讲,使用代理ip是最好的选择,或者下降访问次数都是可行的办法。 settings.py
文件的设置,服务器
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
复制代码
这几个选项都是控制访问速度的,通常我设置DOWNLOAD_DELAY
便可,即每两秒访问一次。
执行代码以下:
class Zhihu(scrapy.Spider):
name = "zhihu"
cookeis = pickle.load(open("cookies.pkl", "rb"))
urls = []
questions_url = set()
for i in range(1, 11):
temp_url = "https://www.zhihu.com/collection/146079773?page=" + str(i)
urls.append(temp_url)
def start_requests(self):
for url in self.urls:
request = scrapy.Request(url=url, callback=self.parse, cookies=self.cookeis)
yield request
def parse(self, response):
print(response.url)
resSoup = BeautifulSoup(response.body, 'lxml')
items = resSoup.select("div > h2 > a")
print(len(items))
for item in items:
print(item['href'])
self.questions_url.add(item['href'] + "\n")
@classmethod
# 信号的使用
def from_crawler(cls, crawler, *args, **kwargs):
print("from_crawler")
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_closed)
return s
def spider_opened(self, spider):
print("spider close, save urls")
with open("urls.txt", "w") as f:
for url in self.questions_url:
f.write(url)
复制代码
命令行运行爬虫,查看url.txt
文件。
能够看到,成功抓取了44个连接,去除people, zhuanlan等几个无效连接, 后面便可从该文件读取内容,拼接连接,利用selenium作中间键提取全部的图片连接。
总结:这本文章讲了如何利用selenium去手动登陆网站,保存cookies,之后后续登陆(几乎能够登陆全部的网站,限制访问速度避免被封)。
这三篇文章讲解了怎么使用scrapy去抓取想要的东西。如今无需使用框架,也能够涉及实现本身的爬虫。对于怎么保存图片,使用代理,后面会作简单介绍。 后面会写一篇怎么将爬虫部署在服务器上,利用docker搭建python环境去执行爬虫。
weixin:youquwen1226
github 欢迎来信探讨。