爬虫 urllib.request 模块

时间 2019-12-20

标签爬虫 urllib.request urllib request 模块栏目网络爬虫繁體版

原文原文链接

爬虫网络请求方式的一种html

爬虫数据提取方式咱们用的是正则表达式python

咱们用到的：正则表达式

re模块在个人随笔中有这个浏览器

Request 用来建立请求对象网络

urlopen 发送请求app

导入：ide

import re
from urllib.request import Request, urlopen

class CSDNSpider(object):
      

    def __init__(self,url):
       self.url = url
       #设置浏览器标识
        self.user_agent = "       " 
 
    def get_page_code(self):
       #建立请求对象
       request = Request(url = self.url , headers = {'User-Agent':self.user_agent})
       #发送请求
        try:
             response = urlopen(request)
             # 从响应对象中获取源代码字符串。
             # response.read(): <class 'bytes'>字节类型，python3新增
             # decode()： 将bytes类型转成str类型
             # encode():  将str类型转成bytes类型
              data = response.read().decode()
              except Exception as e:
                  print('请求异常')
               else:
                  return data


     def parse_data_by_html(self,html):
           """ 
            解析Html，获取数据
            :param html: 源代码
            :return: 返回解析的数据
            """
            pattern = re.compile(r'   ' , re.S)
            res = re.findall(pattern, html)
            return  res

res中的数据可能含有一些咱们不须要的字符串注:由于咱们用的正则匹配的对象是字符串,因此匹配出来的可能含一些杂乱的字符串函数

因此咱们要对res进行处理url

方法是建立一个处理数据的函数spa

class DataParserTool(object):
    @classmethod
    def parser_data(cls, data):
        """
        处理数据
        :param data: 数据元组 [(), (),()]
        :return: [(), (), ()]
        """
        data_list = []
        
        for n1, n2, n3, n4 ,n5,n6 in data:
            n1 =n1.strip() # 去除两端空格
            n2 = n2.replace('\n', '')
            data_list.append((n1, n2, n3, n4 ,n5,n6))
        return data_list

@classmethod 调用对象方法    DataParserTool.parser_data()不加的话 调用对象 在调方法     DataParserTool().parser_data()

1. 爬虫——urllib.request包
2. 爬虫-urllib模块
3. Python爬虫（urllib.request和BeautifulSoup）
4. 爬虫开发.2urllib模块
5. python-爬虫之urllib模块
6. Python爬虫-urllib模块
7. python爬虫值requests模块
8. 爬虫——urllib.request库的基本使用
9. python爬虫基础知识（一）--Urllib.request
10. 爬虫 BeatifulSoup 模块
更多相关文章...
• Lua 模块与包 - Lua 教程
• DTD - XML 构建模块 - DTD 教程
• 委托模式
• NewSQL-TiDB相关