title: CSDN文章爬取
date: 2019-06-09 13:17:26
tags:html
因为前些时间新建了我的博客,因而想把csdn的博客迁移到此处,一键迁移功能没有使用成功,因此想到了,直接爬取,而后从新发送
时间:3小时
预期结果:博客文章保存到本地python
- 找到文章列表,进行文章爬取,提取到文章的url信息。
- 进行文章内容的解析,提取文章内容。
- 保存到本地。
- 尝试对文章样式进行保存
采用python语言来完成,使用pyquery库进行爬取。ide
article = doc('.blog-content-box') #文章标题 title = article('.title-article').text() #文章内容 content = article('.article_content')
dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt' with open(dir, 'a', encoding='utf-8') as file: file.write(title+'\n'+content.text())
urls = doc('.article-list .content a') return urls
for i in range(3): print(i) main(offset = i+1)
代码整合编码
#!/usr/bin/env python # _*_coding:utf-8 _*_ #@Time :2019/6/8 0008 下午 11:00 #@Author :喜欢二福的沧月君(necydcy@gmail.com) #@FileName: CSDN.py #@Software: PyCharm import requests from pyquery import PyQuery as pq def find_html_content(url): headers = { 'User-Agent': 'Mozilla/5.0(Macintosh;Inter Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gerko) Chrome/52.0.2743.116 Safari/537.36' } html = requests.get(url,headers=headers).text return html def read_and_wiriteblog(html): doc = pq(html) article = doc('.blog-content-box') #文章标题 title = article('.title-article').text() content = article('.article_content') try: dir = "F:/python-project/SpiderLearner/CSDNblogSpider/article/"+title+'.txt' with open(dir, 'a', encoding='utf-8') as file: file.write(title+'\n'+content.text()) except Exception: print("保存失败") def geturls(url): content = find_html_content(url) doc = pq(content) urls = doc('.article-list .content a') return urls def main(offset): url = '此处为博客地址' + str(offset) urls = geturls(url) for a in urls.items(): a_url = a.attr('href') print(a_url) html = find_html_content(a_url) read_and_wiriteblog(html) if __name__ == '__main__': for i in range(3): print(i) main(offset = i+1)