本篇使用python 的BeautifulSoup 实现爬虫,抓取知乎收藏夹的问题与神回复,用于作天然语言处理研究。开始接触python爬虫,是从一段代码分享《获取城市的PM2.5浓度和排名(含单线程和多线程)》受启发。html
#!/usr/bin/env python # by keyven # PM25.py import urllib2 from bs4 import BeautifulSoup def getPm25(cityname): site = 'http://www.pm25.com/' + cityname + '.html' html = urllib2.urlopen(site) soup = BeautifulSoup(html) city = soup.find(class_ = 'bi_loaction_city') # 城市名称 aqi = soup.find("a",{"class","bi_aqiarea_num"}) # AQI指数 quality = soup.select(".bi_aqiarea_right span") # 空气质量等级 result = soup.find("div",class_ ='bi_aqiarea_bottom') # 空气质量描述 s = city.text + '\nAQI:' + aqi.text + '\nAir Quality:' + ' ' + result.text return s
#!/usr/bin/env python # by keyven # main.py from PM25 import getPm25 s = getPm25('xianggang') print s
效果图:python
咱们从这个网址(http://www.zhihu.com/collection/27109279?page=1)来开始爬,打开网页,仔细观察源码,能够发现一些模块规律:问题和答案做为两个子模块放在同一个大模块里面。接下来开始写代码:多线程
#!/usr/bin/env python # by keyven # main.py import re import urllib2 from bs4 import BeautifulSoup for p in range(1,5): url = "http://www.zhihu.com/collection/27109279?page=" + str(p) page = urllib2.urlopen(url) soup = BeautifulSoup(page) allp = soup.findAll(class_ = 'zm-item') for each in allp: answer = each.findNext(class_ = 'zh-summary summary clearfix') if len(answer.text) > 100: continue # 答案太长了,有可能出现“显示所有”状况,直接跳过 problem = each.findNext(class_ = 'zm-item-title') print problem.text, print answer.text
效果图:python爬虫
咱们观察运行效果,发现有些小bug,观察原网页:url
原来是同一个问题出现两个如出一辙的答案所致使,这时咱们能够作一下业务性处理,在同一个page 中加入一个set 来判断答案是否出现过2 次。spa
如何正确的吐槽 code