1. Scrapy 设置文件修改javascript
配置文件就是项目根路径下的 settings,py ,改下面的配置信息html
a.遵循人机协议设置成false,不然基本啥也爬不到java
# Obey robots.txt rules ROBOTSTXT_OBEY = False
b. 设置ua,否则大部分网址是爬不到的web
# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'
c. 开下request-headers,这样看起来更像浏览器,这个能够按本身的浏览器改下。json
# Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', }
d. 设置访问延迟,这个很重要,过于频繁的请求不但容易被封,还可能把一些小站搞崩溃,由于Scrapy自己是异步并发的,不设置害人害己浏览器
# Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3
e. 日志等级,参照我的习惯设置吧,我设置成了WARNING,这样程序运行后输出就清净了,适合咱们学习阶段频繁运行程序并发
LOG_LEVEL= 'WARNING'
2. 爬取和解析笑话app
先用浏览器打开糗百,选择导航栏里的段子,由于这些是纯文本的,先不弄图片和视频。框架
把url贴过来,先测试下,是成功的。dom
再回到浏览器中,查看一个笑话的页面源码,而后折叠下,观察下html的结构,很容易就能看到笑话列表路径是这里
OK,写xpath 获取下笑话列表,一样的方法,找下做者 和内容的位置,用xpath定位。
import scrapy class Qiubai1Spider(scrapy.Spider): name = 'qiubai1' allowed_domains = ['qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): #获取笑话列表 joke_list = response.xpath("//div[contains(@class, 'article block')]") for joke in joke_list: author = joke.xpath("./div/a[2]/h2/text()").extract_first() print(author) content = joke.xpath(".//div[@class='content']/span/text()").extract() print(content)
运行看下效果
已经能够看到做者和笑话的列表了。
3. item 接收
将爬取结果放到item中,item做为scrapy 框架中的数据通用容器,既能便于管理和结构化数据,有能方便和pipeline交互进行数据存储处理。
咱们直接修改建立项目后自动生成items.py
import scrapy class Scpy1Item(scrapy.Item): author = scrapy.Field() content = scrapy.Field()
而后修改spdier中的代码,将数据传递个item
import scrapy import re from scpy1.items import Scpy1Item class Qiubai1Spider(scrapy.Spider): name = 'qiubai1' allowed_domains = ['qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): #获取笑话列表 joke_list = response.xpath("//div[contains(@class, 'article block')]") for joke in joke_list: # 解析做者、内容 author = joke.xpath("./div/a[2]/h2/text()").extract_first() content = joke.xpath(".//div[@class='content']/span/text()").extract() # 封装数据至item item = Scpy1Item() item['author'] = re.sub("[\n]", "", author) item['content'] = re.sub("[\n]", "", ','.join(content)) yield item # 打印item,测试结果 print(item)
运行看下效果