关于python爬取异步ajax数据的一些看法

时间 2019-11-16

标签关于 python 异步 ajax 数据一些看法栏目 Python 繁體版

原文原文链接

咱们在利用python进行爬取数据的时候，必定会遇到这样的状况，在浏览器中打开能开到全部数据，可是利用requests去爬取源码获得的倒是没有数据的页面框架。javascript

出现这样状况，是由于别人网页使用了ajax异步加载，你的requests获得的只是页面框架而已。php

遇到这样的状况有几种方法能够解决：css

　　一、分析（f12）network中的响应，从而得到ajax的请求接口，在经过这些接口去得到数据。html

　　二、使用selenium这个网页自动化测试工具，去得到源码。由于这个工具是等到页面加载完成采起获取的整个页面的代码，因此理论上是能够得到页面完整数据的。前端

我本身测试过一个页面，也是获取了完整数据的。有须要的朋友能够去本身测试。java

下面，咱们针对第二种方法，作一个实验：本地新建一个json.html前端文件和json.php后端脚本。web服务器咱们使用apache（集成环境xampp）。python

json.phpjquery

<?php

    header('Access-Control-Allow-Origin:*'); //表明容许任何网址请求

    $arr = array(
        'testarr' => array(
            'name' => 'panchao',
            'age' => 18,
            'tel' => '15928838350',
            'addr' => 'test'
        )
    );
    
    echo json_encode($arr);

?>

json.htmlweb

<div id='test'>
test
</div>
<script src="https://cdn.bootcss.com/jquery/3.4.1/jquery.min.js"></script>
<script>
    function easyAjax(requestUrl){
        $.ajax({
            url: requestUrl,
            type: "GET",
            dataType: "json", 
            success: function(msg){
            var a = "<span>"+msg.testarr.name+"</span>";
            //动态的向页面中加入html元素
            $("#test").append(a);
            },
            error: function(XMLHttpRequest, textStatus, errorThrown) {
            alert(XMLHttpRequest.status);
            alert(XMLHttpRequest.readyState);
            alert(textStatus);
            }
        });

    }
    easyAjax("http://localhost:8080/json/json.php")
</script>

而后咱们分别用python的request和selenium（webdriver.Chrome）来作实验。ajax

request

import requests
r = requests.get("http://localhost:8080/json/json.html")
r.encoding = 'utf-8'
print(r.text)

selenium（webdriver.Chrome）至于selenium怎么使用我前面的文章中有提到

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path=(r'C:\Users\0923\AppData\Local\Google\Chrome\Application\chromedriver.exe'), options=chrome_options)
base_url = "http://localhost:8080/json/json.html"
driver.get(base_url)
print(driver.page_source)

咱们来看结果：

第一种，利用python request请求的方法获得的页面数据为：

<div id='test'>
test
</div>
<script src="https://cdn.bootcss.com/jquery/3.4.1/jquery.min.js"></script>
<script>
    function easyAjax(requestUrl){
    $.ajax({

    url: requestUrl,
    type: "GET",
    //async : false,
    dataType: "json", 

    success: function(msg){

    var a = "<span>"+msg.testarr.name+"</span>";

    
    console.log(msg);
    $("#test").append(a);
    },
    error: function(XMLHttpRequest, textStatus, errorThrown) {
    alert(XMLHttpRequest.status);
    alert(XMLHttpRequest.readyState);
    alert(textStatus);
    }
    });

    }
    easyAjax("http://localhost:8080/json/json.php")
</script>

第二种，利用selenium（webdriver.Chrome）方法获得的页面数据为：

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><div id="test">
test
<span>panchao</span></div>
<script src="https://cdn.bootcss.com/jquery/3.4.1/jquery.min.js"></script>
<script>
    function easyAjax(requestUrl){
    $.ajax({

    url: requestUrl,
    type: "GET",
    //async : false,
    dataType: "json", 

    success: function(msg){

    var a = "&lt;span&gt;"+msg.testarr.name+"&lt;/span&gt;";

    
    console.log(msg);
    $("#test").append(a);
    },
    error: function(XMLHttpRequest, textStatus, errorThrown) {
    alert(XMLHttpRequest.status);
    alert(XMLHttpRequest.readyState);
    alert(textStatus);
    }
    });

    }
    easyAjax("http://localhost:8080/json/json.php")
</script></body></html>

咱们能够看到以上两种结果，最主要的差别就是第二种方法（selenium（webdriver.Chrome））获得的web代码中包含了ajax异步加载的数据。

<div id="test">
test
<span>panchao</span></div>

而第一种方法（python request）获得的web代码中没有包含ajax异步加载的数据。

<div id='test'>
test
</div>

根据以上结论，证实利用selenium（webdriver.Chrome）来获取页面数据，是能够获取到javascript脚本加载的数据的。

不知道你们有没有注意到利用selenium（webdriver.Chrome）来获取页面数据的方法还自动的给咱们不全了html的标签

但愿能够帮助到有须要的人。