抓取CodeSnippet中的代码片断php
<body> <div id="container"> <div class="content bor round"> <ul> <li class="con-logo bbor"> <a href="http://www.codesnippet.cn/index.html" title="分享你的世界"></a> </li> <li class="con-code bbor"> <pre class="brush:php;"> <!--代码块--> </pre> </li> <li class="con-btn bbor"> <ul> <li><a href="http://www.codesnippet.cn/pcode.html" class="button">发布代码片断</a></li> <li><a href="http://www.codesnippet.cn/list.html" class="button">片断列表</a></li> </ul> <br class="clearfloat" /> </li> <li class="con-motto bbor"> <div>一个线程若是是我的英雄主义,那么多线程就是集体主义,你再也不是一个独行侠,而是一个指挥家。</div> </li> <li class="con-count bbor"> <div> 共有<span> {15106} </span>个代码片断 </div> </li> <li class="con-copyright"> <div>京ICP备13038605号 <script src="http://s14.cnzz.com/stat.php?id=4720394&web_id=4720394" language="JavaScript"></script> </div> </li> </ul> </div> </div> </body>
咱们想要抓取的内容在为 li class="con-code bbor"
因此 BeautifulSoup find()方法获取到该标签而后获取其文本内容html
准备咱们爬虫比用的两个模块python
from urllib2 import urlopen from bs4 import BeautifulSoup
# 抓取http://www.codesnippet.cn/index.html 中的代码片断 def GrapIndex(): html = "http://www.codesnippet.cn/index.html" bsObj = BeautifulSoup(urlopen(html), 'html.parser') return bsObj.find("li", {"class":"con-code bbor"}).get_text()
当咱们抓取到咱们想要的数据以后接下来要作的就是把数据写到数据库里,因为咱们如今抓取数据简单,因此只写文件便可!web
def SaveResult(): codeFile=open("code.txt", "a") # 追加 for list in GrapIndex(): codeFile.write(list) codeFile.close()
UnicodeEncodeError: 'ascii' codec can't encode character u'u751f' in position 0: ordinal not in range(128)数据库
python2.7是基于ascii去处理字符流,当字符流不属于ascii范围内,就会抛出异常(ordinal not in range(128))多线程
import sys reload(sys) sys.setdefaultencoding('utf-8')
from urllib2 import urlopen from bs4 import BeautifulSoup import os import sys reload(sys) sys.setdefaultencoding('utf-8') def GrapIndex(): html = "http://www.codesnippet.cn/index.html" bsObj = BeautifulSoup(urlopen(html), 'html.parser') return bsObj.find("li", {"class":"con-code bbor"}).get_text() def SaveResult(): codeFile=open("code.txt", "a") for list in GrapIndex(): codeFile.write(list) codeFile.close() if __name__ == '__main__': for i in range(0,9): SaveResult()