Python爬虫碎碎念

时间 2019-11-30

标签 python 爬虫栏目 Python 繁體版

原文原文链接

最近领导给了一个任务，从单位的数据库里面导出全部的数据，存到本地excel表格。我就想，这不挺简单的么，给我数据库的密码帐户，几条语句搞定。html

结果让人大失所望，单位数据库只能经过后台管理系统查看，平台压根不提供批量导出功能，至于数据库直接访问什么的，更是想都别想，大领导不给批。前端

因此，只能采起笨办法了，网络爬虫给爬下来！python

因而乎，重拾丢弃了大半年的python。开始钻研如何写一个简单的小爬虫。git

python写爬虫的思路其实很简单。下面简单说下web

1）python模拟登陆。主要是获取cookie~正则表达式

2）分析与平台交互过程当中http包所含的数据特色。主要就是请求和响应。算法

这个平台诡异的地方在于，要想提取数据，并非一次到位。首先，得获取大的列表，列表会有分页，而后，点进列表中的每一项查看详情。数据库

经过对来往http包的分析，其流程大体以下：express

模拟登陆->发起获取列表请求（post, json）->返回列表数据（json）->发起获取详情请求（get）->返回详情页面（html）json

完整的数据须要拼合列表数据和详情页面数据。前者，只需解析json数据既可，可是后面的详情页面，得对html页面进行解析，提取所需项。

流程并不复杂，可是写起来坑却太多。此文着力记录踩过的坑。主要是三大坑。

坑No1:蛋疼的python编码方式

这个坑能够分为几个小问题来解答

1）unicode和utf-8是什么关系？

这个问题，知乎上有一句话解释挺好的，那就是：utf8是对unicode字符集进行编码的一种编码方式。

unicode字符集自己是一种映射，它将每个真实世界的字符与某一数值联系在一块儿，是一种逻辑关系。utf-8则是额外的一种编码方式，是对unicode所表明的值进行编码的算法。

简单地说，就是：字符->unicode->utf-8

例如：中文“你好” -> \u4f60\u597d -> \xe4\xbd\xa0\xe5\xa5\xbd

2）str和unicode又是什么关系？

str和unicode是python2.X里面的概念。

例如 s=u'你好'

s变量就是一个unicode字符串，是一个unicode对象（type(s) == unicode），严格意义上说，unicode就是python内部自定义的一个数据类型，是抽象的，而非存储实体。

python官方language reference给出的解释是

Unicode
The items of a Unicode object are Unicode code units. A Unicode code unit is represented by a Unicode object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time). Surrogate pairs may be present in the Unicode object, and will be reported as two separate items. The built-in functions unichr() and ord()convert between code units and nonnegative integers representing the Unicode ordinals as defined in the Unicode Standard 3.0. Conversion from and to other encodings are possible through the Unicode method encode() and the built-in function unicode().

其中len(s) = 2，存储的值为\u4f60\u597d。

至于str，除了表示通常的字符串，还能够表示python中原始的数据流。能够理解为字节流，即二进制码。

python官方language reference给出的解释是：

Strings
The items of a string are characters. There is no separate character type; a character is represented by a string of one item. Characters represent (at least) 8-bit bytes. The built-in functions chr() and ord() convert between characters and nonnegative integers representing the byte values. Bytes with the values 0-127 usually represent the corresponding ASCII values, but the interpretation of values is up to the program. The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. (On systems whose native character set is not ASCII, strings may use EBCDIC in their internal representation, provided the functions chr() and ord() implement a mapping between ASCII and EBCDIC, and string comparison preserves the ASCII order. Or perhaps someone can propose a better rule?)

此外，还有如下描述

Python has two different datatypes. One is 'unicode' and other is 'str'.

Type 'unicode' is meant for working with codepoints of characters.

Type 'str' is meant for working with encoded binary representation of characters.

以上述“你好”为例，其unicode是\u4f60\u597d。这个值还能够进行一次utf-8的编码，最终成为新的字节流，也就是\xe4\xbd\xa0\xe5\xa5\xbd

在python3中，全部的str都变成了unicode，3中bytes则替代了2.X中的str

stackoverflow有一个解答说的挺好：http://stackoverflow.com/questions/18034272/python-str-vs-unicode-types

unicode, which is python 3's str, is meant to handle text. Text is a sequence of code points whichmay be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...). Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.

On the contrary str is a plain sequence of bytes. It does not represent text! In fact, in python 3 str is called bytes. You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str. Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level.

如此，便很明了了~

3）encode和decode的使用方法。

有了上述两点的基础，这里的使用方法就不难了。

所谓encode，就是unicode->str，比如，有意义的文字，变为字节流。

而decode，就是str->unicode，比如，字节流，变为有意义的文字。

decode对str使用，encode对unicode使用。

例如：

u_a=u'你好'   #这里是unicode字符

u_a   #输出u'\u4f60\u597d'

s_a = u_a.encode('utf-8')  #对u_a进行utf-8编码，转化为字节流

s_a   #输出'\xe4\xbd\xa0\xe5\xa5\xbd'

u_a_ = s_a.decode('utf-8') #对s_a进行utf-8解码，还原为unicode

u_a_  #输出u'\u4f60\u597d'

utf-8是一种编码方法，此外，常见的还有gbk等等。

4）#coding:utf-8和setdefaultencoding有什么区别？

#coding:utf-8做用是定义源代码的编码，若是没有定义，此源码中不能够包含中文字符。

setdefaultencoding是python代码在执行时，unicode类型数据默认的编码方式。（Set the current default string encoding used by the Unicode implementation.）这是由于，unicode有不少编码方式，包括UTF-八、UTF-1六、UTF-32，中文还有gbk。在调用decode和encode函数时，在不显示指定参数的状况下，就会采用上述默认的编解码方式。

须要注意的是，在windows下的idle中，在不显式指出u前缀的前提下，会默认采用gbk编码。

下面看一个例子：

a = '你好'  #在windows下的idle里面，a是gbk编码 
a    #输出'\xc4\xe3\xba\xc3' 这是gbk
b = a.decode('gbk')  #进行gbk解码为unicode
b    #输出u'\u4f60\u597d'
print b  #输出 你好
b = a.decode() #在不指定参数状况下，默认采用ascii编解码，此时会报错， 
               #UnicodeDecodeError: 'ascii' codec can't decode byte
a = u'你好'
b = a.encode() #同理，也会报错
               #UnicodeEncodeError: 'ascii' codec can't encode characters

那么，到底python何时会调用默认的编码呢？

这里我不作全面的总结，但目前实践的状况看来，如下几种状况确定会有默认的转换

1.试图对str进行encode，试图对unicode进行decode。

stackoverflow上有人解释道:http://stackoverflow.com/questions/11339955/python-string-encode-decode

In the second case you do the reverse attempting to encode a byte string. Encoding is an operation that converts unicode to a byte string so Python helpfully attempts to convert your byte string to unicode first

也就是说，str.encode()实际上等效于：str.decode(sys.getdefaultencoding()).encode()。

对str进行encode，python首先会默认对str先行进行一次decode，而后再进行encode()。

假设系统默认编码方式是ascii，那么若是此时str中包含有不在ascii范围内的codepoint，即有相似于中文字符这样的东西，那么利用ascii试图进行解码，必然会报错~

2.任何可能调用str()的地方，如调用系统函数write的时候。

看下面一个例子

a = u'你好'

f=open('test.txt', 'w')

f.write(a)  #这里会报错，UnicodeEncodeError: 'ascii' codec can't   
            #encode characters in position 0-1: ordinal not in range(128)

str(a)  #这里一样会报错，内容同上

因为a是unicode，因此在写入文件时，或者转换为str时，必然会进行一次encode。若是系统默认的是ascii编码，那么就会报错了。

上述语句中，将f.write(a) 改成f.write(a.encode('utf-8')) 就ok了。显式指定其编码方式便可~

若是不想这么麻烦，python脚本开头，加上sys.setdefaultencoding( "utf-8" )就行了。

【注意】随着python3的兴起，sys.setdefaultencoding( "utf-8" )这种用法即可以舍弃了。即使放在2.X，这种方法也是不被推荐的。

（参见：http://stackoverflow.com/questions/3828723/why-we-need-sys-setdefaultencodingutf-8-in-a-py-script

"Also, the use of sys.setdefaultencoding() has always been discouraged, and it has become a no-op in py3k. The encoding is hard-wired to utf-8 and changing it raises an error."）

因此，在此讨论甚多，到了python3.0时代，也并无什么卵用了。不过了解下python的发展历史和自我更新，对这门语言也算是有一个新的了解吧~

坑No2: 奇葩的用户输入与繁杂的正则表达式~

以前只对正则表达式有过粗略的了解，并无细细深刻其中了解并使用。这一次，由于须要从html提取我所需的文字，因此不得不摆开正则表达式大干一场。

按照惯例，应该使用例如beautifulsoup来提取。无奈，全部的关键项都没有div，name，class等标签。因此最终只能使用正则表达式硬抽了~

基本的语法不在此累述。我只想描述下我这个初学者在实际运用中出现的几个小问题

1）用户输入奇葩，没有考虑到全部状况。

本来，我想提取中文，觉得用[\u4e00-\u9fa5]就ok了。结果我错了~

倒不是说中文提取不能用这个，而是说，用户的输入太奇葩了。

好比，某一个老师填写学校地址，填写的是：北京市海淀区学院路(32）号

对，没错，你看见了！还有空格！还有英文的括号！还有中文的括号！还有数字！

这类的坑真是防不胜防啊~后来不得不扩大通配符的范围了。这才解决了问题

2）深刻理解（.*）和诸如（b*）之间的区别

所谓*是指前一个字符匹配屡次（大于等于0），这个字符能够是通配符*，也能够是具体的字符如b。

须要记住的是，一旦匹配开始，那么就会计算连续匹配的状况。

*之因此会重复匹配，由于它将不一样的字符看作比如是重复的。

例如.*匹配abc，能匹配出完整的abc，是由于abc都是属于'.'，在此处，abc都算是“连续重复”的，都是'.'的重复。

同理，具体的字符b*，去匹配abcd，则会获得b。有人说，c也能够匹配啊，重复了零次嘛，d也能够匹配嘛，重复零次嘛。

其实这么理解是彻底错误的。所谓b*，是从遇到b开始，计算可以连续匹配的b，匹配的计算只会在碰到b的一瞬间开始，直到日后遇到不是b的字符了，匹配即中止了。然后，接下来的待匹配字符，并不会接着按照b*去匹配了，也无所谓记为零次了。

因此，概念要抓准~

3）正则匹配通用原则

这里的通用原则，更多指的是greedy和lazy的问题~

咱们知道.*是greedy原则，.*?是lazy原则，那么当这两个匹配法则连在一块了怎么办？如何处理其优先级？

这里有一篇很好的英文文章。在此转载一下~

Regular Expressions: The Rules
By hari on Jan 24, 2010 The following are the rules, a non-POSIX regular expression engine(such as in PERL, JAVA, etc ) would adhere to while attempting to match with the string, Notation: the examples would list the given regex(pattern) , the string tested against (string) and the actual match happened in the string in between '<<<' and '>>>'. 1. The match that begins earliest/leftmost wins. The intention is to match the cat at the end but the 'cat' in the catalogue won the match as it appears leftmost in the string. pattern :cat string :This catalogue has the names of different species of cat. Matched: This <<< cat >>> alogue has the names of different species of cat. 1a.The leftmost match in the string wins, irrespective of the order a pattern appears in alternation Though last in the alternation, 'catalogue' got the match as it appeared leftmost among the patterns in the alternation. pattern :species|names|catalogue string :This catalogue has the names of different species of cat. Matched: This <<< catalogue >>> has the names of different species of cat. 1b. If there are more than one plausible match occurs in the same position, then the order of the plausible matching patterns in the alternation counts. All three patterns have a possible match at the same position, but 'over' is successful as it appeared first in the alternation. pattern :over|o|overnight string :Actually, I'm an overnight success. But it took twenty years. Matched: Actually, I'm an <<< over >>> night success. But it took twenty years. 2. The standard quantifiers (\* +, ? and {m,n}) are greedy Greediness (\*,+,?) would always try to match more before it tries to match minimum characters needed for the match to be successful ( '0' for \*,? ; '1' for + ) The intention is to match the "Joy is prayer", though .\* went pass across all the double quotes and grabbing all the strings only to match the last double quote ("). pattern :".\*" string :"Joy is prayer"."Joy is strength"."Joy is Love". Matched: <<< "Joy is prayer"."Joy is strength"."Joy is Love" >>> . 2a. Lazy quantifiers would favor the minimum match Laziness (\*?,+?,??) would always try to settle with minimum characters needed for the match to be successful before it tries to match the maximum. The first double quote (') appeared was matched using lazy quantifier. pattern :".\*?" string :"Joy is prayer"."Joy is strength"."Joy is Love". Matched: <<< "Joy is prayer" >>> ."Joy is strength"."Joy is Love". 2b. The only time the greedy quantifiers would give up what they've matched earlier and settle for less is 'when matching too much ends up causing some later part of the regex to fail'. The \\w\* would match the whole word 'regular_expressions' initially. Later, since 's' didn't have a character to match and tend to fail would trigger the \\w\* to backtrack and match one character less. Thus the final 's' matches the 's' just released by \\w\* and whole match succeeds. Note: Though the pattern would work the same way without paranthesis, I'd used them to show the individual matches in $1, $2, etc. pattern :(\\w\*)(s) string :regular_expressions Matched: <<< regular_expressions >>> $1 = regular_expression $2 = s Similarly, the initial match 'x' by 'x\*' was given by later for the favor of the last 'x' in the pattern. pattern :(x\*)(x) string :ox Matched: o<<< x >>> $1 = $2 = x 2c. When more than one greedy quantifiers appears in a pattern, the first greedy would get the preference. Though the .\* initially matched the whole string, the [0-9]+ would able to grab just one digit '5' from the .\*, and the 0-9]+ settles with it since that satisfies its minimum match criteriat. Note that the '+' is also a greedy quantifier and here it cant grab beyond its minimum requirement, since already there is an another greedy quantifier shares the same match. Enter pattern :(.\*)([0-9]+) Enter string :Bangalore-560025 Matched: <<< Bangalore-560025 >>> $1 = Bangalore-56002 $2 = 5 3. Overall match takes precedence. Ability to report a successful match takes precedence. As its shown in previous example, if its necessary for a successful match the quantifiers ( greedy or lazy ) would work in harmony with the rest of the pattern.

一共3大条原则，详情参见文章。若是把贪婪简记为G，把懒惰简记为L。那么看看不一样组合的优先级吧~

首先，优先级最高的是，以知足成功匹配为准！

其次，在可以匹配成功的前提下，分为如下四种组合:

G+G：这种状况，前者尽管贪婪，尽量多的匹配，后者知足最少状况便可

G+L：同上，前者能够尽管贪婪，尽量多的匹配，后者知足最小状况便可

L+G：前者尽可能少的匹配，后者尽量多的匹配，先保证前者最小，再照顾后者

L+L：先后都尽量少的匹配，优先保证前者，其次照顾后者。

其实总结起来也容易，优先保证全局能匹配，而后最早来的先匹配，最早来的具备更高优先级（不管是最lazy仍是最greedy）。这么说容易晕，举几个例子吧。例子比较乱，慢慢体会吧~

>>> t = u'城市：</td>    <td>   ="张好"; 参数";'
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]*)";', t)   
>>> print temp.group(1)

>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]+)";', t)   
>>> print temp.group(1)
数
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+)";', t)  
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]*)";', t) 
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]?)";', t)
>>> print temp.group(1)
好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]?)";', t)
>>> print temp.group(1)

>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]*?)";', t)
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+?)";', t)
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+?)', t)
>>> print temp.group(1)
张
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+)', t)
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]+)"', t)
>>> print temp.group(1)
数
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]?)"', t)
>>> print temp.group(1)

>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]+?)"', t)
>>> print temp.group(1)
数
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]*)"', t)
>>> print temp.group(1)

>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]{2})"', t)
>>> print temp.group(1)
参数
>>>

坑No3:python读写文件

也是分为几个小问题来解释一下吧

1）w+和r+有什么区别？

r+ : Open for reading and writing. The stream is positioned at the beginning of the file.

w+ : Open for reading and writing. The file is created if it does not exist, otherwise it is truncated. The stream is positioned at the beginning of the file.

本质上说，r+是可谓是“先读后写”。w+是“先写后读”。r+并不会清空文件，可是w+一开始就会清空文件。

因此，若是要读写文件，必定要区分是先读仍是先写。否则，若是首先要读文件的话，而使用了w+，那就什么都读不到了。

2）w+和r+使用过程当中须要注意的地方

举一个例子，若是使用r+，而文件是空的话，先调用一个readline函数，再调用write函数，则会报IOError: [Errno 0] Error，这时候必需要在write以前添加一个seek(0)才能续写；若是文件非空的话，那么先调用readline，再调用write，则会直接在后面续写，不会报错。

究其缘由，官方文档是这样解释的：

When the "r+", "w+", or "a+" access type is specified, both reading and writing are allowed (the file is said to be open for "update"). However, when you switch between reading and writing, there must be an intervening fflush, fsetpos, fseek, or rewind operation. The current position can be specified for the fsetpos or fseek operation, if desired.

也就是说，在读与写之间，必须添加seek函数，从新定位当前文件位置。否则各类错误折腾够呛

3）truncate函数的使用

若是须要不断往一个文件里写数据，又要先清空文件的话，能够调用truncate函数。

注意，在不指定truncate()参数的状况下，会默认从当前文件位置开始，砍掉后面的数据。因此，要清空文件的话，

要么，使用先seek(0)，再truncate()，要么，直接调用truncate(0)。好好体会下面的定义~

The method truncate() truncates the file's size. If the optional size argument is present, the file is truncated to (at most) that size..

The size defaults to the current position. The current file position is not changed. Note that if a specified size exceeds the file's current size, the result is platform-dependent.

Note: This method would not work in case file is opened in read-only mode.

4）write函数，flush函数，到底都作了些什么？

援引stackoverflow上的一段回答：

There's typically two levels of buffering involved:

Internal buffers

Operating system buffers

The internal buffers are buffers created by the runtime/library/language that you're programming against and is meant to speed things up by avoiding system calls for every write. Instead, when you write to a file object, you write into its buffer, and whenever the buffer fills up, the data is written to the actual file using system calls.

However, due to the operating system buffers, this might not mean that the data is written to disk. It may just mean that the data is copied from the buffers maintained by your runtime into the buffers maintained by the operating system.

If you write something, and it ends up in the buffer (only), and the power is cut to your machine, that data is not on disk when the machine turns off.

So, in order to help with that you have the flush and fsync methods, on their respective objects.

The first, flush, will simply write out any data that lingers in a program buffer to the actual file. Typically this means that the data will be copied from the program buffer to the operating system buffer.

Specifically what this means is that if another process has that same file open for reading, it will be able to access the data you just flushed to the file. However, it does not necessarily mean it has been "permanently" stored on disk.

To do that, you need to call the os.fsync method which ensures all operating system buffers are synchronized with the storage devices they're for, in other words, that method will copy data from the operating system buffers to the disk.

Typically you don't need to bother with either method, but if you're in a scenario where paranoia about what actually ends up on disk is a good thing, you should make both calls as instructed.

①write，是应用程序写到program buffer里；

②flush，则是从program buffer到OS buffer；

③os.fsync是从OS buffer到硬盘。

f.close函数，实际上包含了flush函数，一样，也不包含写入硬盘。

那么，问题来了哈，请看下列场景：

windows下，开启一个cmd，运行一段python代码，里面有循环写文件的操做（每隔2s写一次，一共写20次），而且被try语句所包含，后续的finally语句则是包含写一段字符串（例如haha，随意哪一个都行，作个标记就好，代表执行到这里）和关闭文件f.close()函数的调用。现程序执行到一半，还在write过程当中：

1. 若是使用ctrl+C，则程序至关于书捕获了KeyboardInterrupt异常，执行finally语句。查看文件，成功写入haha。

2.若是关闭cmd窗口，文件一样会被写入以前的内容，但没有haha。这代表，OS在关闭进程时，作了一些clean up的工做，例如关闭文件句柄，将program buffer数据写到硬盘上。可是这不属于异常，故finally语句并无调用。

3.若是直接在任务管理器里面直接结束任务，那么，文件将是空文件，没有任何字符，即使已经write过一部分了。

为何呢？首先，咱们不去深究上述三种关闭程序的方法windows到底是怎么作的。但至少，咱们知道，它们背后的隐藏过程确定是不同的。

情景2里面，此种关闭方式，操做系统有帮助完成program buffer到OS buffer这个过程，而场景3，却不包含这个过程。

事实上，若是在场景3里面，在每一条write后面都加一条flush，那么即使进程被任务管理器终止了，字符仍是会被写入文件的。

若是要了解背后的缘由，那就要多多学习操做系统背后到底作了些什么，这个坑，之后再填吧。

一样，在python多进程多线程程序的编写过程当中，signal的处理，背后操做系统的行为，子进程父进程的工做模式和关系，都不是特别清楚，坑也不少，仍是等有时间和精力了，再好好琢磨下吧~

最后的吐槽：

作web系统开发，前端框架最好规范化，用户输入最好通过合法化检查、规范后再入库，数据库最好别太烂，一个页面加载15s也是醉了。此吐槽纯粹针对该管理平台，无他~

就这样~