本文的文字及图片来源于网络,仅供学习、交流使用,不具备任何商业用途,版权归原做者全部,若有问题请及时联系咱们以做处理。
html
做者:崩坏的芝麻mysql
因为实验室须要一些语料作研究,语料要求是知网上的论文摘要,可是目前最新版的知网爬起来有些麻烦,因此我利用的是知网的另一个搜索接口
好比下面这个网页:
http://search.cnki.net/Search.aspx?q=肉制品web
搜索出来的结果和知网上的结果几乎同样,另外之后面试找Python工做,项目经验展现是核心,若是你缺项目练习,去小编的Python交流.裙 :一久武其而而流一思(数字的谐音)转换下能够找到了,里面不少新教程项目面试
在这个基础上,我简单看了些网页的结构,很容易就能写出爬取得代码(是最基础的,至关不完善,增长其余功能可自行增长)
网页的结构仍是很清晰的sql
摘要信息也很清晰数据库
我使用的是 pymysql 链接的数据库,效率也还能够
下面直接贴代码:markdown
# -*- coding: utf-8 -*- import time import re import random import requests from bs4 import BeautifulSoup import pymysql connection = pymysql.connect(host='', user='', password='', db='', port=3306, charset='utf8') # 注意是utf8不是utf-8 # 获取游标 cursor = connection.cursor() #url = 'http://epub.cnki.net/grid2008/brief/detailj.aspx?filename=RLGY201806014&dbname=CJFDLAST2018' #这个headers信息必须包含,不然该网站会将你的请求重定向到其它页面 headers = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language':'zh-CN,zh;q=0.8', 'Connection':'keep-alive', 'Host':'www.cnki.net', 'Referer':'http://search.cnki.net/search.aspx?q=%E4%BD%9C%E8%80%85%E5%8D%95%E4%BD%8D%3a%E6%AD%A6%E6%B1%89%E5%A4%A7%E5%AD%A6&rank=relevant&cluster=zyk&val=CDFDTOTAL', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36' } headers1 = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' } def get_url_list(start_url): depth = 20 url_list = [] for i in range(depth): try: url = start_url + "&p=" + str(i * 15) search = requests.get(url.replace('\n', ''), headers=headers1) soup = BeautifulSoup(search.text, 'html.parser') for art in soup.find_all('div', class_='wz_tab'): print(art.find('a')['href']) if art.find('a')['href'] not in url_list: url_list.append(art.find('a')['href']) print("爬取第" + str(i) + "页成功!") time.sleep(random.randint(1, 3)) except: print("爬取第" + str(i) + "页失败!") return url_list def get_data(url_list, wordType): try: # 经过url_results.txt读取连接进行访问 for url in url_list: i = 1; if url == pymysql.NULL or url == '': continue try: html = requests.get(url.replace('\n', ''), headers=headers) soup = BeautifulSoup(html.text, 'html.parser') except: print("获取网页失败") try: print(url) if soup is None: continue # 获取标题 title = soup.find('title').get_text().split('-')[0] # 获取做者 author = '' for a in soup.find('div', class_='summary pad10').find('p').find_all('a', class_='KnowledgeNetLink'): author += (a.get_text() + ' ') # 获取摘要 abstract = soup.find('span', id='ChDivSummary').get_text() # 获取关键词,存在没有关键词的状况 except: print("部分获取失败") pass try: key = '' for k in soup.find('span', id='ChDivKeyWord').find_all('a', class_='KnowledgeNetLink'): key += (k.get_text() + ' ') except: pass print("第" + str(i) + "个url") print("【Title】:" + title) print("【author】:" + author) print("【abstract】:" + abstract) print("【key】:" + key) # 执行SQL语句 cursor.execute('INSERT INTO cnki VALUES (NULL, %s, %s, %s, %s, %s)', (wordType, title, author, abstract, key)) # 提交到数据库执行 connection.commit() print() print("爬取完毕") finally: print() if __name__ == '__main__': try: for wordType in {"大肠杆菌", "菌群总落", "胭脂红", "日落黄"}: wordType = "肉+" + wordType start_url = "http://search.cnki.net/search.aspx?q=%s&rank=relevant&cluster=zyk&val=" % wordType url_list = get_url_list(start_url) print("开始爬取") get_data(url_list, wordType) print("一种类型爬取完毕") print("所有爬取完毕") finally: connection.close()
在这里的关键词我简单的选了几个,做为实验,若是爬取的不少,能够写在txt文件里,直接读取就能够,很是方便。网络