A web scraping micro-framework based on asyncio.css
轻量异步爬虫框架aspider,基于asyncio,目的是让编写单页面爬虫更方便更迅速,利用异步特性让爬虫更快(减小在IO上的耗时)html
pip install aspider
对于单页面,只要实现框架定义的 Item 就能够实现对目标数据的抓取:python
import asyncio from aspider import Request request = Request("https://news.ycombinator.com/") response = asyncio.get_event_loop().run_until_complete(request.fetch()) # Output # [2018-07-25 11:23:42,620]-Request-INFO <GET: https://news.ycombinator.com/> # <Response url[text]: https://news.ycombinator.com/ status:200 metadata:{}>
对于页面目标较多,须要进行深度抓取时,Spider就派上用场了git
import aiofiles from aspider import AttrField, TextField, Item, Spider class HackerNewsItem(Item): target_item = TextField(css_select='tr.athing') title = TextField(css_select='a.storylink') url = AttrField(css_select='a.storylink', attr='href') async def clean_title(self, value): return value class HackerNewsSpider(Spider): start_urls = ['https://news.ycombinator.com/', 'https://news.ycombinator.com/news?p=2'] async def parse(self, res): items = await HackerNewsItem.get_items(html=res.body) for item in items: async with aiofiles.open('./hacker_news.txt', 'a') as f: await f.write(item.title + '\n') if __name__ == '__main__': HackerNewsSpider.start()
支持JS的加载github
Request
类也能够很好的工做并返回内容,这里以这个为例演示下抓取须要加载js才能够抓取的例子:web
request = Request("https://www.jianshu.com/", load_js=True) response = asyncio.get_event_loop().run_until_complete(request.fetch()) print(response.body)
若是喜欢,能够玩玩看,项目Github地址:aspidershell