爬取w3c课程—Urllib库使用

时间 2020-04-29

标签 w3c 课程 urllib 使用栏目 C&C++ 繁體版

原文原文链接

爬虫原理html

浏览器获取网页内容的步骤：浏览器提交请求、下载网页代码、解析成页面，爬虫要作的就是：python

模拟浏览器发送请求：经过HTTP库向目标站点发起请求Request，请求能够包含额外的header等信息，等待服务器响应
获取响应内容：若是服务器正常响应，会获得一个响应Response，响应的内容即是所要获取的页面内容，类型多是HTML,Json字符串，二进制数据（图片或者视频）等
解析响应内容：获取响应内容后，解析各类数据，如：解析html数据：正则表达式，第三方解析库，解析json数据：json模块，解析二进制数据:进一步处理或以wb的方式写入文件
保存数据：保存为文本，数据库，或者保存特定格式的文件

简单例子：利用Urllib库爬取w3c网站教程正则表达式

一、urllib的request模块能够很是方便地抓取URL内容，也就是发送一个GET请求到指定的页面，而后返回HTTP的响应：例如，对百度的一个w3c发送一个GET请求，并返回响应：数据库

# coding:utf-8
import urllib.request my_url='https://www.w3cschool.cn/tutorial'#要获取课程的网址
page = urllib.request.urlopen(my_url) html = page.read().decode('utf-8') print(html)

把发送一个GET请求到指定的页面，返回HTTP的响应写成一个函数：json

def get_html(url):#访问url
    page = urllib.request.urlopen(url) html = page.read().decode('utf-8') return html

将返回以下内容，这与在浏览器查看源码看到的是同样的，接下来能够根据返回的内容进行解析：浏览器

二、利用正则表达式的分组提取课程名称、课程简介、课程连接，导入python里面的re库服务器

reg = r'<a href="([\s\S]*?)" title=[\s\S]*?<h4>(.+)</h4>\n<p>([\s\S]*?)</p>'#运用正则表达式，分组提取数据
reg_tutorial = re.compile(reg)#编译一下正则表达式，运行更快
tutorial_list = reg_tutorial.findall(get_html(my_url))#进行匹配，

到如今代码以下：函数

# coding:utf-8
import urllib.request import re my_url='https://www.w3cschool.cn/tutorial'#要获取课程的网址


def get_html(url):#访问url
    page = urllib.request.urlopen(url) html = page.read().decode('utf-8') return html reg = r'<a href="([\s\S]*?)" title=[\s\S]*?<h4>(.+)</h4>\n<p>([\s\S]*?)</p>'#运用正则表达式，分组提取数据
reg_tutorial = re.compile(reg)#编译一下正则表达式，运行更快
tutorial_list = reg_tutorial.findall(get_html(my_url))#进行匹配

print("一共有课程数：" + str(len(tutorial_list)))#打印出有多少课程

for i in range(len(tutorial_list)):#把课程名称、课程简介、课程连接写到excel，python里面excel从0开始计算
    print (tutorial_list[i])

运行，打印结果：字体

三、保存数据，保存数据到excel里面，用到excel第三方库xlwt，也能够只用openpyxl，库的使用能够参照官网：http://www.python-excel.org/网站

本次须要新建一个Excel，把课程名称、课程简介、课程连接写到Excel里面，课程连接用xlwt.Formula设置超连接，Excel第一行设置为宋体，加粗，写一些课程内容外的东西

import xlwt excel_path=r'tutorial.xlsx'#excel的路径
book = xlwt.Workbook(encoding='utf-8', style_compression=0)# 建立一个Workbook对象，这就至关于建立了一个Excel文件
sheet = book.add_sheet('课程',cell_overwrite_ok=True)# 添加表
style = xlwt.XFStyle()#初始化样式
font = xlwt.Font()#建立字体
font.name = '宋体'#指定字体名字
font.bold = True#字体加粗
style.font = font#将该font设定为style的字体
sheet.write(0, 0, '序号',style)#用以前的style格式写第一行，行、列从0开始计算
sheet.write(0, 1, '课程',style) sheet.write(0, 2, '简介',style) sheet.write(0, 3, '课程连接',style)

写课程内容到Excel

for i in range(len(tutorial_list)):#把课程名称、课程简介、课程连接写到excel，python里面excel从0开始计算
    print (tutorial_list[i]) sheet.write(i+1, 0, i+1) sheet.write(i+1, 1, tutorial_list[i][1]) sheet.write(i+1, 2, tutorial_list[i][2]) sheet.write(i+1, 3, xlwt.Formula("HYPERLINK(" +'"'+"https:" + tutorial_list[i][0]+'"'+')'))#把连接写进去，并用xlwt.Formula设置超连接
 book.save(excel_path)#保存到excel

Excel内容：

所有代码以下：

# coding:utf-8
import urllib.request import re import xlwt excel_path=r'tutorial.xlsx'#excel的路径
my_url='https://www.w3cschool.cn/tutorial'#要获取课程的网址
book = xlwt.Workbook(encoding='utf-8', style_compression=0)# 建立一个Workbook对象，这就至关于建立了一个Excel文件
sheet = book.add_sheet('课程',cell_overwrite_ok=True)# 添加表
style = xlwt.XFStyle()#初始化样式
font = xlwt.Font()#建立字体
font.name = '宋体'#指定字体名字
font.bold = True#字体加粗
style.font = font#将该font设定为style的字体
sheet.write(0, 0, '序号',style)#用以前的style格式写第一行，行、列从0开始计算
sheet.write(0, 1, '课程',style) sheet.write(0, 2, '简介',style) sheet.write(0, 3, '课程连接',style) def get_html(url):#访问url
    page = urllib.request.urlopen(url) html = page.read().decode('utf-8') return html reg = r'<a href="([\s\S]*?)" title=[\s\S]*?<h4>(.+)</h4>\n<p>([\s\S]*?)</p>'#运用正则表达式，分组提取数据
reg_tutorial = re.compile(reg)#编译一下正则表达式，运行更快
tutorial_list = reg_tutorial.findall(get_html(my_url))#进行匹配

print("一共有课程数：" + str(len(tutorial_list)))#打印出有多少课程

for i in range(len(tutorial_list)):#把课程名称、课程简介、课程连接写到excel，python里面excel从0开始计算
    print (tutorial_list[i]) sheet.write(i+1, 0, i+1) sheet.write(i+1, 1, tutorial_list[i][1]) sheet.write(i+1, 2, tutorial_list[i][2]) sheet.write(i+1, 3, xlwt.Formula("HYPERLINK(" +'"'+"https:" + tutorial_list[i][0]+'"'+')'))#把连接写进去，并用xlwt.Formula设置超连接
 book.save(excel_path)#保存到excel