Python-15：爬虫之单线程爬虫

时间 2019-11-17

标签 python 爬虫单线栏目 Python 繁體版

原文原文链接

一、Requests的介绍与安装javascript

　　requests:HTTP for Humanscss

　　完美替代Python的urllib2模块html

　　更多的自动化java

　　更友好的用户体验python

　　如何安装：Windows:pip install requestsjquery

　　第三方库的安装技巧程序员

　　　　少用easy_install由于只能安装不能卸载正则表达式

　　　　多用pip方式安装编程

　　　　在国内撞墙的时候咱们可使用这个网站：http://www.lfd.uci.edu/~gohlke/pythonlibs/　　　　网页爬虫

　　　　打开该网站，搜素requests，单击图中红框中的内容将其下载

　　　　打开下载的文件目录，将此文件的后缀名从whl改成zip

　　　　将其解压，获得requests与requests-2.17.3.dist-info两个文件夹，咱们将requests文件夹总体赋值粘贴到C:\Python27\Lib（Python的安装目录下的lib文件夹下）

二、第一个网页爬虫

　　使用requests获取网页源代码

　　　　直接获取源代码

#coding:utf-8
import requests
qhmu=requests.get("http://www.cnblogs.com/jiyongxin")
print qhmu.text

　　　　输出结果　　

<!DOCTYPE html>
<html lang="zh-cn">
<head>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>小新丶 - 博客园</title>
<link type="text/css" rel="stylesheet" href="/bundles/blog-common.css?v=m_FXmwz3wxZoecUwNEK23PAzc-j9vbX_C6MblJ5ouMc1"/>
<link id="MainCss" type="text/css" rel="stylesheet" href="/skins/CodingLife/bundle-CodingLife.css?v=s0uk-4nDGKcoZzCtu5RG1QHcsOLuk4tFWHoK2TgaeRE1"/>
<link id="mobile-style" media="only screen and (max-width: 768px)" type="text/css" rel="stylesheet" href="/skins/CodingLife/bundle-CodingLife-mobile.css?v=svj45cmFr8dtGVB0Cq4o-6MjM6Bz3Y76mAYaZnSJon01"/>
<link title="RSS" type="application/rss+xml" rel="alternate" href="http://www.cnblogs.com/jiyongxin/rss"/>
<link title="RSD" type="application/rsd+xml" rel="EditURI" href="http://www.cnblogs.com/jiyongxin/rsd.xml"/>
<link type="application/wlwmanifest+xml" rel="wlwmanifest" href="http://www.cnblogs.com/jiyongxin/wlwmanifest.xml"/>
<script src="//common.cnblogs.com/script/jquery.js" type="text/javascript"></script>  
<script type="text/javascript">var currentBlogApp = 'jiyongxin', cb_enable_mathjax=false;var isLogined=false;</script>
<script src="/bundles/blog-common.js?v=E1-LyrzANB2jbN9omtnpOHx3eU0Kt3DyislfhU0b5p81" type="text/javascript"></script>
</head>
<body>
<a name="top"></a>

<!--done-->
<div id="home">
<div id="header">
    <div id="blogTitle">
    <a id="lnkBlogLogo" href="http://www.cnblogs.com/jiyongxin/"><img id="blogLogo" src="/Skins/custom/images/logo.gif" alt="返回主页" /></a>            
        
<!--done-->
<h1><a id="Header1_HeaderTitle" class="headermaintitle" href="http://www.cnblogs.com/jiyongxin/">小新丶</a></h1>
<h2>小白程序员大杂烩学习之路</h2>



        
    </div><!--end: blogTitle 博客的标题和副标题 -->
    <div id="navigator">
        
<ul id="navList">
<li><a id="blog_nav_sitehome" class="menu" href="http://www.cnblogs.com/">博客园</a></li>
<li><a id="blog_nav_myhome" class="menu" href="http://www.cnblogs.com/jiyongxin/">首页</a></li>
<li><a id="blog_nav_newpost" class="menu" rel="nofollow" href="https://i.cnblogs.com/EditPosts.aspx?opt=1">新随笔</a></li>
<li><a id="blog_nav_contact" class="menu" rel="nofollow" href="https://msg.cnblogs.com/send/%E5%B0%8F%E6%96%B0%E4%B8%B6">联系</a></li>
<li><a id="blog_nav_rss" class="menu" href="http://www.cnblogs.com/jiyongxin/rss">订阅</a>
<!--<a id="blog_nav_rss_image" class="aHeaderXML" href="http://www.cnblogs.com/jiyongxin/rss"><img src="//www.cnblogs.com/images/xml.gif" alt="订阅" /></a>--></li>
<li><a id="blog_nav_admin" class="menu" rel="nofollow" href="https://i.cnblogs.com/">管理</a></li>
</ul>
        <div class="blogStats">
            
            <div id="blog_stats">
<span id="stats_post_count">随笔 - 18&nbsp; </span>
<span id="stats_article_count">文章 - 0&nbsp; </span>
<span id="stats-comment_count">评论 - 2</span>
</div>
            
        </div><!--end: blogStats -->
    </div><!--end: navigator 博客导航栏 -->
</div><!--end: header 头部 -->

<div id="main">
    <div id="mainContent">
    <div class="forFlow">
        

<!--done-->


<div class="day">
    <div class="dayTitle">
        <a id="homepage1_HomePageDays_DaysList_ctl00_ImageLink" href="http://www.cnblogs.com/jiyongxin/archive/2017/06/12.html">2017年6月12日</a>                  
    </div>

    
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl00_DayList_TitleUrl_0" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6993224.html">Python-15：爬虫之正则表达式应用举例</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 首先，咱们从最简单的使用python来读取本地文件中的文本内容来开始 一、在本地新建一个html文档，内容以下 二、在与html文件相同目录下建立咱们的Python文件，内容以下 三、输出结果为 这样咱们就使用Python简单的一句代码将HTML文件读取出来了，那么，咱们想要获得HTML中特定的信息<a href="http://www.cnblogs.com/jiyongxin/p/6993224.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-06-12 16:24 小新丶 阅读(2) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6993224" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
</div>


<div class="day">
    <div class="dayTitle">
        <a id="homepage1_HomePageDays_DaysList_ctl01_ImageLink" href="http://www.cnblogs.com/jiyongxin/archive/2017/06/07.html">2017年6月7日</a>                  
    </div>

    
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl01_DayList_TitleUrl_0" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6957997.html">Python-14：爬虫之正则表达式1</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 通过一段时间的基础知识铺垫，终于能够开始学习爬虫了，对，就是爬虫，能够爬东西的虫子！ 爬虫基础，一个是Python，一个就是正则表达式了，固然，正则表达式在任什么时候候都不容忽视它的重要性！ 接下来，咱们从一个破解小密码来开始学习正则表达式 laodhfejzuwxyzixyzladivhwanxyzl<a href="http://www.cnblogs.com/jiyongxin/p/6957997.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-06-07 17:04 小新丶 阅读(9) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6957997" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
            <div class="postSeparator"></div>
        
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl01_DayList_TitleUrl_1" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6957635.html">Python-13：模块</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 一、认识Python模块 函数是能够实现一项或多项功能的一段程序，模块是能够实现一项或多项功能的程序块 安装目录下的lib文件夹中都是模块 如何导入模块 使用import关键字，若是使用这个模块必须先导入 sys模块 在Python中有一些模块是不用咱们本身去定义的，Python官方提供的自带的模块<a href="http://www.cnblogs.com/jiyongxin/p/6957635.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-06-07 16:14 小新丶 阅读(5) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6957635" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
</div>


<div class="day">
    <div class="dayTitle">
        <a id="homepage1_HomePageDays_DaysList_ctl02_ImageLink" href="http://www.cnblogs.com/jiyongxin/archive/2017/05/18.html">2017年5月18日</a>                  
    </div>

    
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl02_DayList_TitleUrl_0" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6873408.html">使用地图切片并最终将地图发布在arcgis for server</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 一、记录好下载的离线地图切片文件夹所在的位置（上个随笔有介绍如何下载离线地图） 二、打开arcmap 三、新建一个空的模板 四、点击add data 五、选择咱们下载的切片点击add 六、界面如图所示 七、生成发布所需的文件并在arcgis for server上进行发布 依次点击File-Shar<a href="http://www.cnblogs.com/jiyongxin/p/6873408.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-05-18 14:56 小新丶 阅读(2) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6873408" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
            <div class="postSeparator"></div>
        
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl02_DayList_TitleUrl_1" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6873158.html">使用下载器下载适用于arcgis的离线地图切片</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 一、下载太乐地图下载器或水经注离线地图下载器（官网下载为适用版，加水印且控制下载大小） 二、地图下载器界面以下 太乐地图下载器 水经注离线地图下载器 三、两款软件操做相似，咱们就以水经注离线地图下载器为例，详解各个步骤 ①点击设置 ②选择在线地图 ③选择本身须要下载的地图，这里以谷歌卫星地图为例（注<a href="http://www.cnblogs.com/jiyongxin/p/6873158.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-05-18 14:25 小新丶 阅读(2) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6873158" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
            <div class="postSeparator"></div>
        
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl02_DayList_TitleUrl_2" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6835990.html">Python-12：Python语法基础-函数</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 一、函数 function，通俗来说函数就是功能，函数是用来封装功能的，函数分为两种类型，一种是系统自带的不用咱们编写就可使用的。另外一种函数是自定义的，须要咱们编写其功能，这种函数自由度高，叫作自定义函数。 函数的定义： ①声明这个指定的部分是函数 ②编写这个函数的功能 格式：def 函数名():<a href="http://www.cnblogs.com/jiyongxin/p/6835990.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-05-18 14:02 小新丶 阅读(7) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6835990" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
</div>


<div class="day">
    <div class="dayTitle">
        <a id="homepage1_HomePageDays_DaysList_ctl03_ImageLink" href="http://www.cnblogs.com/jiyongxin/archive/2017/05/10.html">2017年5月10日</a>                  
    </div>

    
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl03_DayList_TitleUrl_0" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6835196.html">Python-11：Python语法基础-控制流</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 一、Python中的三种控制流 程序中代码的执行是有顺序的，有的代码会从上到下按顺序执行，有的程序代码会跳转着执行，有的程序代码会选择不一样的分支执行，有的代码会循环着执行，什么样的程序应该选择分支执行，什么样的代码应该循环着执行，在Python中是有相应的控制语句控制的，控制语句能控制某段代码的执行<a href="http://www.cnblogs.com/jiyongxin/p/6835196.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-05-10 11:51 小新丶 阅读(7) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6835196" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
            <div class="postSeparator"></div>
        
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl03_DayList_TitleUrl_1" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6831564.html">Python-10：Python语法基础-运算符与表达式</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 一、Python运算符简介 1）什么是运算符 在Python中常常须要对一个或多个数字进行操做，2+3中的+是运算符,&quot;hello&quot;*20中的*也是运算符 2）运算符有哪些 + - * / ** &lt; &gt; != // % &amp; | ^ ~ &gt;&gt; &lt;&lt; &lt;= &gt;= == not and or 3）运算符的<a href="http://www.cnblogs.com/jiyongxin/p/6831564.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-05-10 09:43 小新丶 阅读(7) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6831564" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
</div>


<div class="day">
    <div class="dayTitle">
        <a id="homepage1_HomePageDays_DaysList_ctl04_ImageLink" href="http://www.cnblogs.com/jiyongxin/archive/2017/05/09.html">2017年5月9日</a>                  
    </div>

    
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl04_DayList_TitleUrl_0" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6831239.html">Python-09：Python语法基础-行与缩进</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 一、逻辑行和物理行 Python中逻辑行主要指一段代码，在乎义上它的行数，而物理行，指的是咱们实际看到的行数 二、行中分号的使用规则 在Python中一个物理行通常能够包括多个逻辑行，在一个物理行中编写多个逻辑行的时候，逻辑行与逻辑行用;号隔开。 每一个逻辑行是必需要有分号的，可是咱们在编写程序的时候<a href="http://www.cnblogs.com/jiyongxin/p/6831239.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-05-09 16:35 小新丶 阅读(8) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6831239" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
            <div class="postSeparator"></div>
        
            <div class="postTitle">
                <a id="homepage1_HomePageDays_DaysList_ctl04_DayList_TitleUrl_1" class="postTitle2" href="http://www.cnblogs.com/jiyongxin/p/6830731.html">Python-08：Python语法基础-标识符和对象</a>
            </div>
            <div class="postCon"><div class="c_b_p_desc">摘要: 一、什么是标识（zhi）符？ Python中咱们在编程的时候，起的名字就叫作标识符。其中变量和常量就是标识符的一种 二、标识符的命名规则 ①标识符的第一个字符必须是字母或者下划线，不能是数字或者特殊符号等 ②除了第一个字符外，其余的可使字母下划线和数字 ③大小写敏感 stuName和stuname<a href="http://www.cnblogs.com/jiyongxin/p/6830731.html" class="c_b_p_desc_readmore">阅读全文</a></div></div>
            <div class="clear"></div>
            <div class="postDesc">posted @ 2017-05-09 15:19 小新丶 阅读(7) 评论(0)  <a href ="https://i.cnblogs.com/EditPosts.aspx?postid=6830731" rel="nofollow">编辑</a></div>
            <div class="clear"></div>
        
</div>

<div class="topicListFooter"><div id="nav_next_page"><a href="http://www.cnblogs.com/jiyongxin/default.html?page=2">下一页</a></div></div>


    </div><!--end: forFlow -->
    </div><!--end: mainContent 主体内容容器-->

    <div id="sideBar">
        <div id="sideBarMain">
            
<!--done-->
<div class="newsItem">
<h3 class="catListTitle">公告</h3>
    <div id="blog-news"></div><script type="text/javascript">loadBlogNews();</script>
</div>

            <div id="blog-calendar" style="display:none"></div><script type="text/javascript">loadBlogDefaultCalendar();</script>
            
            <div id="leftcontentcontainer">
                <div id="blog-sidecolumn"></div><script type="text/javascript">loadBlogSideColumn();</script>
            </div>
            
        </div><!--end: sideBarMain -->
    </div><!--end: sideBar 侧边栏容器 -->
    <div class="clear"></div>
    </div><!--end: main -->
    <div class="clear"></div>
    <div id="footer">
        
<!--done-->
Copyright &copy;2017 小新丶
    </div><!--end: footer -->
</div><!--end: home 自定义的最大容器 -->
</body>
</html>

　修改Http头获取源代码

　　　若是有些网站有反爬虫，会检测你访问的客户端是否是浏览器，若是是浏览器能够进入，不是的话会被拒绝，这时候咱们就能够吧咱们的爬虫假装成浏览器

#coding:utf-8
import requests
_header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
qhmu=requests.get("http://jp.tingroom.com/yuedu/yd300p/",headers=_header)
qhmu.encoding='utf-8'
print qhmu.text

　　这样，也能够把网页的源代码给弄出来

　　获取到网页的源代码是第一步，咱们最终目的是要获取咱们所须要的信息，咱们就来扒取到个人博客中的标题

　　观察咱们标题处的代码

<div class="postTitle">
    <a XXXXXX>Python-09：Python语法基础-行与缩进</a>
</div>

　　咱们能够先找出网页上全部class为postTitle的div，在从找到的内容中根据“>(.*?)</a>”来检索

　　第一步，先找出全部包括随笔标题的div

#coding:utf-8
import requests
import re
_header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
qhmu=requests.get("http://www.cnblogs.com/jiyongxin",headers=_header)
qhmu.encoding='utf-8'
div_Title=re.findall('<div class="postTitle">(.*?)</div>',qhmu.text,re.S)
for each in div_Title:
    print each

　　输出结果为

　　第二步，从div中寻找标题

#coding:utf-8
import requests
import re
_header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
qhmu=requests.get("http://www.cnblogs.com/jiyongxin",headers=_header)
qhmu.encoding='utf-8'
div_Title=re.findall('<div class="postTitle">(.*?)</div>',qhmu.text,re.S)
for each in div_Title:
    a_title=re.findall('>(.*?)</a>',each,re.S)
    print a_title[0]

输出结果为

　　成功将全部的标题扒取出来

三、向网页提交数据

　　Get 与 Post介绍

　　　　get：从服务器上获取数据

　　　　post：向服务器传送数据

　　　　get经过构造url中的参数来实现功能

　　　　post是将数据放在header中提交数据

　　分析目标网站

　　request的表单提交功能