python爬虫笔记：phantomjs+selenium采集内容

时间 2019-11-13

标签 python 爬虫笔记 phantomjs+selenium phantomjs selenium 采集内容栏目 Python 繁體版

原文原文链接

对于通常的网站而言，利用python的beautifulsoup均可以爬取，但面对一些须要执行页面上的JavaScript才能爬取的网站，就能够采用phantomjs+selenium的方法爬取数据。我在学习时，也遇到了这类问题，所以聊以记之。javascript

我用的案例网站是中国天气网（http://www.weather.com.cn/weather40d/101020100.shtml）。html

我想爬取的是上海的40每天气里的每一天的最高气温数据。所以，首先我使用通常的方法爬取：java

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen('http://www.weather.com.cn/weather40d/101020100.shtml')
html_parse = BeautifulSoup(html)
temp = html_parse.findAll("span",{"class":"max"})
print(temp)

可是却发现print(temp)输出的只是标签：[, ...... ]python

所以我判断数据必需要在javascript执行后才能获取，因而，我采用了phantomjs+selenium的方式获取这一类数据，代码以下：web

from bs4 import BeautifulSoup
from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path='F:\\python\\phantomjs-2.1.1-windows\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
driver.get("http://www.weather.com.cn/weather40d/101020100.shtml")
time.sleep(3)
pageSource = driver.page_source
html_parse = BeautifulSoup(pageSource)
temp = html_parse.findAll("span",{"class":"max"})
print(temp)

这段代码建立了一个新的selenium WebDriver，首先用WebDriver加载页面，所以咱们给它3秒钟时间（time.sleep(3)），以后，因为我我的比较喜欢用beautifulsoup，而WebDriver的page_source函数能够返回页面的源代码字符串，所以我用了第8,9行代码来回归到用咱们所熟悉的Beautifulsoup来解析页面内容。这个程序的最后运行结果是：[9, 9...... 12, 12, , , , , , , ],数据基本上就能够被获取了。windows

虽然这个例子比较简单，可是所谓万变不离其宗，其基本思想即是这些了，更高深的技术就须要咱们继续学习了。函数

若文中有错误不妥之处，欢迎指出，共同窗习，一块儿进步。学习