【转】requests、BeautifulSoup使用总结

时间 2019-11-10

标签 requests beautifulsoup 使用总结繁體版

原文原文链接

转自，https://www.cnblogs.com/wupeiqi/articles/6283017.html ----html

Python标准库中提供了：urllib、urllib二、httplib等模块以供Http请求，可是，它的 API 太渣了。它是为另外一个时代、另外一个互联网所建立的。它须要巨量的工做，甚至包括各类方法覆盖，来完成最简单的任务。python

Requests 是使用 Apache2 Licensed 许可证的基于Python开发的HTTP 库，其在Python内置模块的基础上进行了高度的封装，从而使得Pythoner进行网络请求时，变得美好了许多，使用Requests能够垂手可得的完成浏览器可有的任何操做。git

一、GET请求github

 
         # 一、无参数实例 
        
         import  
         requests 
        
         ret  
         =  
         requests.get( 
         'https://github.com/timeline.json' 
         ) 
        
         print  
         ret.url 
        
         print  
         ret.text 
        
         # 二、有参数实例 
        
         import  
         requests 
        
         payload  
         =  
         { 
         'key1' 
         :  
         'value1' 
         ,  
         'key2' 
         :  
         'value2' 
         } 
        
         ret  
         =  
         requests.get( 
         "http://httpbin.org/get" 
         , params 
         = 
         payload) 
        
         print  
         ret.url 
        
         print  
         ret.text

二、POST请求json

 
         # 一、基本POST实例 
        
         import  
         requests 
        
         payload  
         =  
         { 
         'key1' 
         :  
         'value1' 
         ,  
         'key2' 
         :  
         'value2' 
         } 
        
         ret  
         =  
         requests.post( 
         "http://httpbin.org/post" 
         , data 
         = 
         payload) 
        
         print  
         ret.text 
        
         # 二、发送请求头和数据实例 
        
         import  
         requests 
        
         import  
         json 
        
         url  
         =  
         'https://api.github.com/some/endpoint' 
        
         payload  
         =  
         { 
         'some' 
         :  
         'data' 
         } 
        
         headers  
         =  
         { 
         'content-type' 
         :  
         'application/json' 
         } 
        
         ret  
         =  
         requests.post(url, data 
         = 
         json.dumps(payload), headers 
         = 
         headers) 
        
         print  
         ret.text 
        
         print  
         ret.cookies

三、其余请求api

 
    
     
       
       
         requests.get(url, params 
         = 
         None 
         ,  
         * 
         * 
         kwargs) 
        
 
         requests.post(url, data 
         = 
         None 
         , json 
         = 
         None 
         ,  
         * 
         * 
         kwargs) 
        
 
         requests.put(url, data 
         = 
         None 
         ,  
         * 
         * 
         kwargs) 
        
 
         requests.head(url,  
         * 
         * 
         kwargs) 
        
 
         requests.delete(url,  
         * 
         * 
         kwargs) 
        
 
         requests.patch(url, data 
         = 
         None 
         ,  
         * 
         * 
         kwargs) 
        
 
         requests.options(url,  
         * 
         * 
         kwargs) 
        
 
            
        
 
         # 以上方法均是在此方法的基础上构建 
        
 
         requests.request(method, url,  
         * 
         * 
         kwargs) 
        
 
     
 
    
  

四、更多参数浏览器

参数列表

参数示例

官方文档：http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#id4cookie

BeautifulSoup

BeautifulSoup是一个模块，该模块用于接收一个HTML或XML字符串，而后将其进行格式化，以后遍可使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。网络

 
         from  
         bs4  
         import  
         BeautifulSoup 
        
         html_doc  
         =  
         """ 
        
         <html><head><title>The Dormouse's story</title></head> 
        
         <body> 
        
         asdf 
        
         <div class="title"> 
        
         <b>The Dormouse's story总共</b> 
        
         <h1>f</h1> 
        
         </div> 
        
         <div class="story">Once upon a time there were three little sisters; and their names were 
        
         <a  class="sister0" id="link1">Els<span>f</span>ie</a>, 
        
         <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
        
         <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
        
         and they lived at the bottom of a well.</div> 
        
         ad<br/>sf 
        
         <p class="story">...</p> 
        
         </body> 
        
         </html> 
        
         """ 
        
         soup  
         =  
         BeautifulSoup(html_doc, features 
         = 
         "lxml" 
         ) 
        
         # 找到第一个a标签 
        
         tag1  
         =  
         soup.find(name 
         = 
         'a' 
         ) 
        
         # 找到全部的a标签 
        
         tag2  
         =  
         soup.find_all(name 
         = 
         'a' 
         ) 
        
         # 找到id＝link2的标签 
        
         tag3  
         =  
         soup.select( 
         '#link2' 
         )

安装：app

 
         pip3 install beautifulsoup4

使用示例：

 
         from  
         bs4  
         import  
         BeautifulSoup 
        
         html_doc  
         =  
         """ 
        
         <html><head><title>The Dormouse's story</title></head> 
        
         <body> 
        
         ... 
        
         </body> 
        
         </html> 
        
         """ 
        
         soup  
         =  
         BeautifulSoup(html_doc, features 
         = 
         "lxml" 
         )

1. name，标签名称

2. attr，标签属性

3. children,全部子标签

 
         # body = soup.find('body') 
        
         # v = body.children

4. children,全部子子孙孙标签

5. clear,将标签的全部子标签所有清空（保留标签名）

6. decompose,递归的删除全部的标签

7. extract,递归的删除全部的标签，并获取删除的标签

8. decode,转换为字符串（含当前标签）；decode_contents（不含当前标签）

9. encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

10. find,获取匹配的第一个标签

11. find_all,获取匹配的全部标签

12. has_attr,检查标签是否具备该属性

13. get_text,获取标签内部文本内容

14. index,检查标签在某标签中的索引位置

 
         # tag = soup.find('body') 
        
         # v = tag.index(tag.find('div')) 
        
         # print(v) 
        
         # tag = soup.find('body') 
        
         # for i,v in enumerate(tag): 
        
         # print(i,v)

15. is_empty_element,是不是空标签(是否能够是空)或者自闭合标签，

判断是不是以下标签：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

16. 当前的关联标签

17. 查找某标签的关联标签

18. select,select_one, CSS选择器

19. 标签的内容

20.append在当前标签内部追加一个标签

21.insert在当前标签内部指定位置插入一个标签

22. insert_after,insert_before 在当前标签后面或前面插入

23. replace_with 在当前标签替换为指定标签

24. 建立标签之间的关系

25. wrap，将指定标签把当前标签包裹起来

26. unwrap，去掉当前标签，将保留其包裹的标签

 
         # tag = soup.find('a') 
        
         # v = tag.unwrap() 
        
         # print(soup)

更多参数官方：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

一大波"自动登录"示例

抽屉新热榜

github

知乎

博客园

拉勾网