python爬虫(一)

时间 2019-11-19

标签 python 爬虫栏目 Python 繁體版

原文原文链接

使用环境window10+python3.4php

先安装requestshtml

python3 -m pip install requests

1. 先使用青年文摘网站看看效果python

import requests
html = requests.get("http://www.qnwz.cn/index.html")
print(html.content)

若是咱们想把当前文本保存在一个文件里，能够这样操做web

with open(filename, 'wb') as fd:
    for c in html.iter_content():
        fd.write(c)

2. 有的时候网页就不能够直接爬取了，这时候可能要提交表单浏览器

import requests
params = {'xxx': 'xxx', 'xxx': 'xxx'}
r = requests.post("http://who_am_i.com/form.php", data=params) #注意这里使用post方法来提交表单
print(r.text)

可能还会让你提交文件或者图像cookie

import requests
files = {'f': open('1.png', 'rb')}
r = requests.post("http://who_am_i.com/xixi.php",files=files)
print(r.text)  #看起来也不会太复杂

也许还须要你处理登录和cookiessession

import requests
session = requests.Session()
params = {'username': 'username', 'password': 'password'}
s = session.post("http://who_am_i.com/haha.php", params)print(s.cookies.get_dict())
s = session.get("http://who_am_i.com/a.php")
print(s.text)

也许有时候会弹出一个登录窗口，这时候requests仍是可以优雅的处理工具

import requests
from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth('username', 'password')
r = requests.post(url="http://who_am_i.com//login.php", auth=auth)
print(r.text)

同时还有登录须要验证码问题，这个就不太好处理了，通常思路是编写代码获取验证码的图片，手动输入，或者经过工具对验证码进行识别，自动输入，好比python的pytesseract就有识别验证码的功能，不妨一试。post

3. 有些时候网页使用JavaScript渲染的，这时候经过requests直接获取的页面并不能像浏览器所看到的那样，这时候不妨下载一个PhantomJS程序来渲染js，要想在python中使用，须要安装selenium ，这个包具备模拟浏览器的功能网站

from selenium import webdriver
import time
driver = webdriver.PhantomJS(executable_path='填入PhantomJS程序安装路径，如D:\p\bin\PhantomJs')
driver.get("http://weixin.sogou.com/weixin?type=1&query=dp")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

基本步骤就这样，深刻了解就查看selenium文档