使用Python BeautifulSoup 包得到网页数据,须要用户电脑中安装了BeautifulSoup包。该包支持pip安装,可输入以下指令安装:html
pip3 install BeautifulSoup4
Python网页数据的读写与I/O读写操做相似,所以如下操做读入维基百科“雾霾”这一页的数据:架构
from urllib.request import urlopen from bs4 import BeautifulSoup raw = urlopen("http://en.wikipedia.org/wiki/Smog").read() print(type(raw)) print(raw[100:200])
读入的数据以字节流的形式呈现app
使用BeautifulSoup类便可将其转化为BeautifulSoup的对象:网站
soup = BeautifulSoup(raw, 'html.parser') print(type(soup))
如下代码获取网页文件位于<p></p>标签之间的段落文本数据,并存在texts列表中:url
texts = [] for para in soup.find_all('p'): text = para.text texts.append(text) print(texts[:10])
若是须要去掉得到数据中的全部应用(如“[1]、[2]”等),能够对文本进行以下处理:.net
import re regex = re.compile('\[[0-9]*\]') joined_texts = '\n'.join(texts) joined_texts = re.sub(regex, '', joined_texts) print(type(joined_texts)) print(joined_texts)
joined_texts这个文本字符串首先由texts每个元素用换行符串起来得到。做处理时,以它为输入文本,将其中表示维基引用的符号所有替换为空字符串便可。code
而后,就能够对得到的joined_texts进行一些处理,例如:htm
import nltk wordlist = nltk.word_tokenize(joined_texts) print(wordlist[:8]) good_text = nltk.Text(wordlist) good_text.concordance('smog')
得到一些关于雾霾这个单词的信息。对象
更多NLTK有关信息,详情请见:blog
最后,咱们能够对于处理完毕的文档进行输出。文本输出既能够直接以单词列表的形式:
NLTK_file = open("NLTK-Smog.txt", "w", encoding='UTF-8') NLTK_file.write(str(wordlist)) NLTK_file.close()
也能够以处理以后文本的形式:
text_file = open("Smog-text.txt", "w", encoding='UTF-8') text_file.write(joined_texts) text_file.close()
Beautiful 用于其余语言(如中文网页信息),仍然须要通过特殊处理,本文的代码不必定直接适用。好比,中文的换行符号是u'\xa0',以及不少中文网站架构都与通常国际上的不一样。每每须要多重操做。如下为一个为邻居追星族爬百度页面的典型代码:
# -*- coding: utf-8 -*- from urllib.request import urlopen from bs4 import BeautifulSoup import re raw = urlopen("https://baike.baidu.com/item/%E7%8E%8B%E4%BF%8A%E5%87%AF/75850").read() #Tesing code to test wheather link is successful: #print(type(raw)) #print(raw[100:200]) text_soup = BeautifulSoup(raw, 'html.parser') #print(text_soup) texts = [] for para in text_soup.find_all('div'): text = para.text texts.append(text) texts = texts[72 : -104] #print(texts) regex = re.compile('(<cite>([^<>\/].+?)</cite>)+') joined_texts = ''.join(texts) joined_texts = re.sub(regex, '', joined_texts) regex = re.compile('\[[0-9]*\]') joined_texts = re.sub(regex, '', joined_texts) words = joined_texts.split('\n') while u'\xa0' in words: words.remove(u'\xa0') while '' in words: words.remove('') joined_texts = '\n'.join(words) text_file = open(u"王俊凯.txt", "w", encoding='UTF-8') text_file.write(joined_texts) text_file.close()
参考资料: