1.本次代码是在python2上运行经过的,python3的最需改2行代码,用到其它python模块html
2.爬取目标网站,个人博客:https://home.cnblogs.com/u/yoyoketang
爬取内容:爬个人博客的全部粉丝的名称,并保存到txtpython
3.因为博客园的登陆是须要人机验证的,因此是没法直接用帐号密码登陆,需借助selenium登陆web
1.大前提:先手工操做浏览器,登陆个人博客,并记住密码
(保证关掉浏览器后,下次打开浏览器访问个人博客时候是登陆状态)
2.selenium默认启动浏览器是一个空的配置,默认不加载配置缓存文件,这里先得找到对应浏览器的配置文件地址,以火狐浏览器为例
3.使用driver.get_cookies()方法获取浏览器的cookies浏览器
# coding:utf-8 import requests from selenium import webdriver from bs4 import BeautifulSoup import re import time # firefox浏览器配置文件地址 profile_directory = r'C:\Users\admin\AppData\Roaming\Mozilla\Firefox\Profiles\yn80ouvt.default' # 加载配置 profile = webdriver.FirefoxProfile(profile_directory) # 启动浏览器配置 driver = webdriver.Firefox(profile) driver.get("https://home.cnblogs.com/u/yoyoketang/followers/") time.sleep(3) cookies = driver.get_cookies() # 获取浏览器cookies print(cookies) driver.quit()
(注:要是这里脚本启动浏览器后,打开的博客页面是未登陆的,后面内容都不用看了,先检查配置文件是否是写错了)缓存
1.浏览器的cookies获取到后,接下来用requests去建一个session,在session里添加登陆成功后的cookiescookie
s = requests.session() # 新建session # 添加cookies到CookieJar c = requests.cookies.RequestsCookieJar() for i in cookies: c.set(i["name"], i['value']) s.cookies.update(c) # 更新session里cookies
1.因为个人粉丝的数据是分页展现的,这里一次只能请求到45个,因此先获取粉丝总数,而后计算出总的页数session
# 发请求 r1 = s.get("https://home.cnblogs.com/u/yoyoketang/relation/followers") soup = BeautifulSoup(r1.content, "html.parser") # 抓取个人粉丝数 fensinub = soup.find_all(class_="current_nav") print fensinub[0].string num = re.findall(u"个人粉丝\((.+?)\)", fensinub[0].string) print u"个人粉丝数量:%s"%str(num[0]) # 计算有多少页,每页45条 ye = int(int(num[0])/45)+1 print u"总共分页数:%s"%str(ye)
# 抓取第一页的数据 fensi = soup.find_all(class_="avatar_name") for i in fensi: name = i.string.replace("\n", "").replace(" ","") print name with open("name.txt", "a") as f: # 追加写入 f.write(name.encode("utf-8")+"\n") # 抓第二页后的数据 for i in range(2, ye+1): r2 = s.get("https://home.cnblogs.com/u/yoyoketang/relation/followers?page=%s"%str(i)) soup = BeautifulSoup(r1.content, "html.parser") # 抓取个人粉丝数 fensi = soup.find_all(class_="avatar_name") for i in fensi: name = i.string.replace("\n", "").replace(" ","") print name with open("name.txt", "a") as f: # 追加写入 f.write(name.encode("utf-8")+"\n")
# coding:utf-8 import requests from selenium import webdriver from bs4 import BeautifulSoup import re import time # firefox浏览器配置文件地址 profile_directory = r'C:\Users\admin\AppData\Roaming\Mozilla\Firefox\Profiles\yn80ouvt.default' s = requests.session() # 新建session url = "https://home.cnblogs.com/u/yoyoketang" def get_cookies(url): '''启动selenium获取登陆的cookies''' try: # 加载配置 profile = webdriver.FirefoxProfile(profile_directory) # 启动浏览器配置 driver = webdriver.Firefox(profile) driver.get(url+"/followers") time.sleep(3) cookies = driver.get_cookies() # 获取浏览器cookies print(cookies) driver.quit() return cookies except Exception as msg: print(u"启动浏览器报错了:%s" %str(msg)) def add_cookies(cookies): '''往session添加cookies''' try: # 添加cookies到CookieJar c = requests.cookies.RequestsCookieJar() for i in cookies: c.set(i["name"], i['value']) s.cookies.update(c) # 更新session里cookies except Exception as msg: print(u"添加cookies的时候报错了:%s" % str(msg)) def get_ye_nub(url): '''获取粉丝的页面数量''' try: # 发请求 r1 = s.get(url+"/relation/followers") soup = BeautifulSoup(r1.content, "html.parser") # 抓取个人粉丝数 fensinub = soup.find_all(class_="current_nav") print(fensinub[0].string) num = re.findall(u"个人粉丝\((.+?)\)", fensinub[0].string) print(u"个人粉丝数量:%s"%str(num[0])) # 计算有多少页,每页45条 ye = int(int(num[0])/45)+1 print(u"总共分页数:%s"%str(ye)) return ye except Exception as msg: print(u"获取粉丝页数报错了,默认返回数量1 :%s"%str(msg)) return 1 def save_name(nub): '''抓取页面的粉丝名称''' try: # 抓取第一页的数据 if nub <= 1: url_page = url+"/relation/followers" else: url_page = url+"/relation/followers?page=%s" % str(nub) print(u"正在抓取的页面:%s" %url_page) r2 = s.get(url_page, verify=False) soup = BeautifulSoup(r2.content, "html.parser") fensi = soup.find_all(class_="avatar_name") for i in fensi: name = i.string.replace("\n", "").replace(" ","") print(name) with open("name.txt", "a") as f: # 追加写入 f.write(name.encode("utf-8")+"\n") # python3的改为下面这两行 # with open("name.txt", "a", encoding="utf-8") as f: # 追加写入 # f.write(name+"\n") except Exception as msg: print(u"抓取粉丝名称过程当中报错了 :%s"%str(msg)) if __name__ == "__main__": cookies = get_cookies(url) add_cookies(cookies) n = get_ye_nub(url) for i in list(range(1, n+1)): save_name(i)
---------------------------------python接口自动化完整版-------------------------网站
全书购买地址 https://yuedu.baidu.com/ebook/585ab168302b3169a45177232f60ddccda38e695ui
做者:上海-悠悠 QQ交流群:588402570url
也能够关注下个人我的公众号: