学的是python语言,因此使用就是python的库,没学习过python的朋友能够本身自学一下,很简单就的一门语言。默认使用的是python3.x。css
开始进行数据采集时,咱们会进入到网页背后的世界,看到html,css,js,初来乍到有点吓人,由于若是不是它们的开发者,通常是不多有人能够彻底理解一个网页文件显示给咱们的代码,它们对于人类的视觉来讲是太多,太乱了。html
想要提取网页,首先要进行网络链接,python如何进行网络链接的? 新建一个scrapetest.py文件。文件内容以下:python
from urllib.request import urlopen html = urlopen("http://pythonscraping.com/pages/page1.html") print (html.read())
使用python运行上面的文件,返回结果是linux
[clgo@localhost ps]$ python scrapetest.py b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
这是咱们在代码中写的网页的所有html代码。准确的说,输出的是http://pythonscraping.com 服务器上网络应用根目录下的pages文件夹下page1.html的源代码。centos
参考:https://docs.python.org/3/library/urllib.htmlpython3.x
urlopen是用来打开并读取一个从网络得到的远程对象。是一个很是通用的库。api
centos下安装beautifulsoup4:服务器
pip install beautifulsoup4
安装后测试一下,导入,若是没有出错则安装成功。网络
[clgo@localhost ps]$ python Python 3.5.1 (default, May 6 2016, 21:20:38) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from bs4 import BeautifulSoup >>>
运行beautifulsoup: 把上面的文件修改成下面的内容运行:ide
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://pythonscraping.com/pages/page1.html") bs0bj = BeautifulSoup(html.read()) print(bs0bj) print(bs0bj.h1)
返回的结果是:
<html> <head> <title>A Useful Page</title> </head> <body> <h1>An Interesting Title</h1> <div> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </div> </body> </html> <h1>An Interesting Title</h1>
其中调print(bs0bj.h1)返回的是<h1>An Interesting Title</h1>。 上面就是经过beautifulsoup提取html结点信息的一个例子。
3.可靠的网络链接 网络链接是十分复杂的,咱们在设计爬虫时,也要考虑异常处理。
from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup def getTitle(url): #异常处理 try: html = urlopen(url) except HTTPError as e: return None try: bsObj = BeautifulSoup(html.read()) title = bsObj.body.h1 except AttributeError as e: return None return title title = getTitle("http://www.pythonscraping.com/pages/page1.html") if title == None: print("Title could not be found") else: print(title)