经常使用模块之hashlib,subprocess,logging,re,collections

时间 2019-11-07

标签经常使用模块 hashlib subprocess logging collections 栏目 Log4j 繁體版

原文原文链接

hashlib

什么是hashlib

什么叫hash:hash是一种算法（3.x里代替了md5模块和sha模块，主要提供 SHA1, SHA224, SHA256, SHA384, SHA512 ，MD5 算法），该算法接受传入的内容，通过运算获得一串hash值

hash值的特色是：
只要传入的内容同样，获得的hash值必然同样=====>要用明文传输密码文件完整性校验
不能由hash值返解成内容=======》把密码作成hash值，不该该在网络传输明文密码
只要使用的hash算法不变，不管校验的内容有多大，获得的hash值长度是固定的

什么是摘要算法呢？摘要算法又称哈希算法、散列算法。它经过一个函数，把任意长度的数据转换为一个长度固定的数据串（一般用16进制的字符串表示）。html

摘要算法就是经过摘要函数f()对任意长度的数据data计算出固定长度的摘要digest，目的是为了发现原始数据是否被人篡改过。python

摘要算法之因此能指出数据是否被篡改过，就是由于摘要函数是一个单向函数，计算f(data)很容易，但经过digest反推data却很是困难。并且，对原始数据作一个bit的修改，都会致使计算出的摘要彻底不一样。git

咱们以常见的摘要算法MD5为例，计算出一个字符串的MD5值正则表达式

import hashlib

data = 'how to use md5 in python hashlib?'
md5 = hashlib.md5(data.encode('utf-8'))
print(md5.hexdigest())
计算结果以下：
d26a53750bc40b38b65a520292f69306

若是数据量很大，能够分块屡次调用update()，最后计算的结果是同样的：算法

import hashlib
md5 = hashlib.md5()
md5.update(b'how to use md5 in ')
md5.update(b'python hashlib?')
print(md5.hexdigest())

d03b3899d2d6ac723a4e70db7ca2b83f

View Code

以上是对于英文进行md5加密的，若是要对中文进行加密，发现按照上面来写会报错，缘由在于字符转码问题shell

#中文加密
m1 = hashlib.sha512()
str_cn ='你好，世界'
#中文形式时要指定编码方式
m1.update(str_cn.encode("utf-8"))
print(m1.hexdigest())

我要用md5加密图片名字，爬取图片的时候防止图片重复出现。把它放到下载图片循环里，例如：数据库

for ii in i.xpath('div/div/img/@data-original'):
    img_url = ii[2:]
    wei = img_url[-4:]
    md5 = hashlib.md5(wei.encode("gb2312"))
    listss = md5.hexdigest()
    if listss in ['.jpg','.gif','.png']:
        make_files(img_name + '\\' + str(random.randint(1, 99999999999999)) + listss, img_url)
    else:
        print(img_url)

View Code

以上加密算法虽然依然很是厉害，但时候存在缺陷，即：经过撞库能够反解。因此，有必要对加密算法中添加自定义key再来作加密。express

import hashlib
passwds=[
    'alex3714',
    'alex1313',
    'alex94139413',
    'alex123456',
    '123456alex',
    'a123lex',
    ]
def make_passwd_dic(passwds):
    dic={}
    for passwd in passwds:
        m=hashlib.md5()
        m.update(passwd.encode('utf-8'))
        dic[passwd]=m.hexdigest()
    return dic

def break_code(cryptograph,passwd_dic):
    for k,v in passwd_dic.items():
        if v == cryptograph:
            print('密码是===>\033[46m%s\033[0m' %k)

cryptograph='aee949757a2e698417463d47acac93df'
break_code(cryptograph,make_passwd_dic(passwds))

模拟撞库

python 还有一个 hmac 模块，它内部对咱们建立 key 和内容进行进一步的处理而后再加密django

import hmac
h = hmac.new('alvin'.encode('utf8'))
h.update('hello'.encode('utf8'))
print (h.hexdigest())#320df9832eab4c038b6c1d7ed73a5940


要想保证hmac最终结果一致，必须保证：
1:hmac.new括号内指定的初始key同样
2:不管update多少次，校验的内容累加到一块儿是同样的内容

import hmac

h1=hmac.new(b'egon')
h1.update(b'hello')
h1.update(b'world')
print(h1.hexdigest())

h2=hmac.new(b'egon')
h2.update(b'helloworld')
print(h2.hexdigest())

h3=hmac.new(b'egonhelloworld')
print(h3.hexdigest())

'''
f1bf38d054691688f89dcd34ac3c27f2
f1bf38d054691688f89dcd34ac3c27f2
bcca84edd9eeb86f30539922b28f3981
'''

View Code

任何容许用户登陆的网站都会存储用户登陆的用户名和口令。如何存储用户名和口令呢？方法是存到数据库表中：

name    | password
michael | 123456
bob     | abc999
alice   | alice2008
若是以明文保存用户口令，若是数据库泄露，全部用户的口令就落入黑客的手里。此外，网站运维人员是能够访问数据库的，也就是能获取到全部用户的口令。正确的保存口令的方式是不存储用户的明文口令，而是存储用户口令的摘要，好比MD5：

username | password
michael  | e10adc3949ba59abbe56e057f20f883e
bob      | 878ef96e86145580c38c87f0410ad153
alice    | 99b1c2188db85afee403b1536010c2c9
考虑这么个状况，不少用户喜欢用123456，888888，password这些简单的口令，因而，黑客能够事先计算出这些经常使用口令的MD5值，获得一个反推表：

'e10adc3949ba59abbe56e057f20f883e': '123456'
'21218cca77804d2ba1922c33e0151105': '888888'
'5f4dcc3b5aa765d61d8327deb882cf99': 'password'
这样，无需破解，只须要对比数据库的MD5，黑客就得到了使用经常使用口令的用户帐号。
对于用户来说，固然不要使用过于简单的口令。可是，咱们可否在程序设计上对简单口令增强保护呢？

因为经常使用口令的MD5值很容易被计算出来，因此，要确保存储的用户口令不是那些已经被计算出来的经常使用口令的MD5，这一方法经过对原始口令加一个复杂字符串来实现，俗称“加盐”：

hashlib.md5("salt".encode("utf8"))
通过Salt处理的MD5口令，只要Salt不被黑客知道，即便用户输入简单口令，也很难经过MD5反推明文口令。

可是若是有两个用户都使用了相同的简单口令好比123456，在数据库中，将存储两条相同的MD5值，这说明这两个用户的口令是同样的。有没有办法让使用相同口令的用户存储不一样的MD5呢？

若是假定用户没法修改登陆名，就能够经过把登陆名做为Salt的一部分来计算MD5，从而实现相同口令的用户也存储不一样的MD5。

摘要算法在不少地方都有普遍的应用。要注意摘要算法不是加密算法，不能用于加密（由于没法经过摘要反推明文），只能用于防篡改，可是它的单向计算特性决定了能够在不存储明文口令的状况下验证用户口令。

摘要算法的应用与一些面临的问题解决方案

subprocess

咱们常常须要经过Python去执行一条系统命令或脚本，系统的shell命令是独立于你的python进程以外的，每执行一条命令，就是发起一个新进程，经过python调用系统命令或脚本的模块在python2有os.system，json

>> os.system('uname -a')
Darwin Alexs-MacBook-Pro.local 15.6.0 Darwin Kernel Version 15.6.0: Sun Jun  4 21:43:07 PDT 2017; root:xnu-3248.70.3~1/RELEASE_X86_64 x86_64
0

View Code

这条命令的实现原理是什么呢

除了os.system能够调用系统命令，,commands,popen2等也能够，比较乱，因而官方推出了subprocess,目地是提供统一的模块来实现对系统命令或脚本的调用

The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several older modules and functions:

os.system
os.spawn*

The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle. For more advanced use cases, the underlying Popen interface can be used directly.

The run() function was added in Python 3.5; if you need to retain compatibility with older versions, see the Older high-level API section.

三种执行命令的方法

subprocess.run(*popenargs, input=None, timeout=None, check=False, **kwargs) #官方推荐
subprocess.call(*popenargs, timeout=None, **kwargs) #跟上面实现的内容差很少，另外一种写法
subprocess.Popen() #上面各类方法的底层封装

run()方法

源码解释

使用参数运行命令并返回一个CompletedProcess实例。返回的实例将具备属性args、returncode、stdout和stderr。默认状况下，stdout和stderr没有被捕获，这些属性将是None。经过stdout=管道和/或stderr=管道来捕获它们。

若是检查为真，退出代码为非零，则会产生一个称为processerror。名为processerror的对象将在returncode属性中具备返回代码，若是捕获了这些流，则输出和stderr属性

若是给定了超时，而且进程花费的时间太长，将会抛出一个超时过时的异常。

其余参数与Popen构造函数相同。

标准写法

subprocess.run(['df','h'],stderr=subprocess.PIPE,stdout=subprocess.PIPE,check=True)

涉及到管道|的命令须要这样写

subprocess.run('df -h|grep disk1',shell=True) #shell=True的意思是这条命令直接交给系统去执行，不须要python负责解析

call()方法

#执行命令，返回命令执行状态 ， 0 or 非0
>>> retcode = subprocess.call(["ls", "-l"])
 
#执行命令，若是命令结果为0，就正常返回，不然抛异常
>>> subprocess.check_call(["ls", "-l"])
0
 
#接收字符串格式命令，返回元组形式，第1个元素是执行状态，第2个是命令结果 
>>> subprocess.getstatusoutput('ls /bin/ls')
(0, '/bin/ls')
 
#接收字符串格式命令，并返回结果
>>> subprocess.getoutput('ls /bin/ls')
'/bin/ls'
 
#执行命令，并返回结果，注意是返回结果，不是打印，下例结果返回给res
>>> res=subprocess.check_output(['ls','-l'])
>>> res
b'total 0\ndrwxr-xr-x 12 alex staff 408 Nov 2 11:05 OldBoyCRM\n'

View Code

Popen()方法

经常使用参数：

args：shell命令，能够是字符串或者序列类型（如：list，元组）
 
 
stdin, stdout, stderr：分别表示程序的标准输入、输出、错误句柄
 
preexec_fn：只在Unix平台下有效，用于指定一个可执行对象（callable object），
它将在子进程运行以前被调用
 
shell：同上
 
cwd：用于设置子进程的当前目录
 
env：用于指定子进程的环境变量。若是env = None，子进程的环境变量将从父进程中继承。

View Code　

下面这2条语句执行会有什么区别？

a=subprocess.run('sleep 10',shell=True,stdout=subprocess.PIPE)
a=subprocess.Popen('sleep 10',shell=True,stdout=subprocess.PIPE)

区别是Popen会在发起命令后马上返回，而不等命令执行结果。这样的好处是什么呢？

若是你调用的命令或脚本须要执行10分钟，你的主程序不需卡在这里等10分钟，能够继续往下走，干别的事情，每过一会，经过一个什么方法来检测一下命令是否执行完成就行了。

执行shell脚本

执行shell脚本这个有多种方法最后仍是选择了subprocess这个python标准库

subprocess这个模块能够很是方便的启动一个子进程，而且控制其输入和输出

Class Popen（args，bufsize = 0，executable=None，
stdin =None，stdout =None，stderr =None，
preexec_fn = None，close_fds = False，shell = False，
cwd = None，env = None，universal_newlines = False，
startupinfo = None，creationflags = 0）：
参数是：
args 应该是一个字符串，或一系列程序参数。要执行的程序一般是args序列或字符串中的第一项，但可使用可执行参数进行显式设置。
在UNIX上，与shell=False（默认）：在这种状况下，POPEN 类使用os.execvp（）来执行子程序。 args一般应该是一个序列。一个字符串将被视为一个字符串做为惟一项目（要执行的程序）的序列。

在UNIX上，使用shell = True：若是args是一个字符串，则它指定要经过shell执行的命令字符串。若是args是一个序列，则第一个项目指定命令字符串，而且任何其余项目将被视为附加的shell参数。

能够先建立一个简单的shell脚本 a.sh

$1 $2 分别表明传进脚本的第一个和第二个参数

若是不写shell=True,默认为shell=False，须要在args的第一个参数指定执行器路径

bufsize 若是给出，bufsize与内建的open（）函数的相应参数具备相同的含义：0表示无缓冲，1表示行缓冲，任何其余正值意味着使用（大约）该大小的缓冲区。负bufsize意味着使用系统默认值，一般意味着彻底缓冲。bufsize的默认值是0（无缓冲）。

stdin，stdout和stderr分别指定执行的程序的标准输入，标准输出和标准错误文件句柄。有效值是PIPE，现有文件描述符（正整数），现有文件对象和 None。 PIPE表示应该建立一个新的管道给孩子。随着无，则不会发生重定向; 孩子的文件句柄将从父类继承。另外，stderr 能够是STDOUT，它表示应用程序的stderr数据应该被捕获到与stdout相同的文件句柄中。
在Popen对象中，能够设值subprocess.stdout=PIPE 即经过管道 p.stdout.read()取出该进程的标准输出

preexec_fn 若是将preexec_fn设置为可调用对象，则该对象将在子进程执行前被调用。

若是close_fds为true，则在执行子进程以前，将关闭除0,1和2以外的全部文件描述符。

若是shell为true，则指定的命令将经过shell执行。

若是cwd不是None，那么在执行子代以前，当前目录将更改成cwd。

若是env不是None，它将为新进程定义环境变量。

若是设置universal_newlines为true，则文件对象stdout和stderr将做为文本文件打开，但可能会有\ n，Unix行尾约定\ r，Macintosh约定或\ r \ n中的任何行终止， Windows约定。全部这些外部表示被Python程序视为\ n。注意：此功能仅在Python是使用通用换行支持（默认）构建时才可用。此外，文件对象stdout，stdin和stderr的newlines属性不会被communications（）方法更新。

若是设置了STARTUPINFO和creationflags，将被传递到下层的CreateProcess（）函数。他们能够指定诸如主窗口的外观和新过程的优先级等内容。（仅限Windows）

Popen调用后会返回一个对象，能够经过这个对象拿到命令执行结果或状态等，该对象有如下方法

poll()
    
    Check if child process has terminated. Returns returncode
    检查子进程是否已终止。返回returncode

wait()
    Wait for child process to terminate. Returns returncode attribute.
    等待子进程终止。返回returncode属性
 
terminate()
    终止所启动的进程   Terminate the process with SIGTERM
 
kill()
    杀死所启动的进程    Kill the process with SIGKILL
 
communicate()
    与启动的进程交互，发送数据到stdin,并从stdout接收输出，而后等待任务结束
 
send_signal(signal.xxx)
　　发送系统信号
 
pid 
　　拿到所启动进程的进程号

View Code

>>> a = subprocess.Popen('python3 guess_age.py',stdout=subprocess.PIPE,stderr=subprocess.PIPE,stdin=subprocess.PIPE,shell=True)
 
>>> a.communicate(b'22')
 
(b'your guess:try bigger\n', b'')

View Code

详细的参考官方文档：subprocess

logging

CRITICAL = 50 #FATAL = CRITICAL
ERROR = 40
WARNING = 30 #WARN = WARNING
INFO = 20
DEBUG = 10
NOTSET = 0 #不设置

import logging

logging.debug('调试debug')
logging.info('消息info')
logging.warning('警告warn')
logging.error('错误error')
logging.critical('严重critical')

'''
WARNING:root:警告warn
ERROR:root:错误error
CRITICAL:root:严重critical
'''

默认级别为warning时才打印到终端

为logging模块指定全局配置，包括打印格式，针对全部logger有效，控制打印到文件中

可在logging.basicConfig()函数中可经过具体参数来更改logging模块默认行为，可用参数有
filename：用指定的文件名建立FiledHandler（后边会具体讲解handler的概念），这样日志会被存储在指定的文件中。
filemode：文件打开方式，在指定了filename时使用这个参数，默认值为“a”还可指定为“w”。
format：指定handler使用的日志显示格式。
datefmt：指定日期时间格式。
level：设置rootlogger（后边会讲解具体概念）的日志级别
stream：用指定的stream建立StreamHandler。能够指定输出到sys.stderr,sys.stdout或者文件，默认为sys.stderr。若同时列出了filename和stream两个参数，则stream参数会被忽略。


format参数中可能用到的格式化串：
%(name)s Logger的名字
%(levelno)s 数字形式的日志级别
%(levelname)s 文本形式的日志级别
%(pathname)s 调用日志输出函数的模块的完整路径名，可能没有
%(filename)s 调用日志输出函数的模块的文件名
%(module)s 调用日志输出函数的模块名
%(funcName)s 调用日志输出函数的函数名
%(lineno)d 调用日志输出函数的语句所在的代码行
%(created)f 当前时间，用UNIX标准的表示时间的浮 点数表示
%(relativeCreated)d 输出日志信息时的，自Logger建立以 来的毫秒数
%(asctime)s 字符串形式的当前时间。默认格式是 “2003-07-08 16:49:45,896”。逗号后面的是毫秒
%(thread)d 线程ID。可能没有
%(threadName)s 线程名。可能没有
%(process)d 进程ID。可能没有
%(message)s用户输出的消息


import logging
logging.basicConfig(filename='access.log',
                    format='%(asctime)s - %(name)s - %(levelname)s -%(module)s:  %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S %p',
                    level=10)

logging.debug('调试debug')
logging.info('消息info')
logging.warning('警告warn')
logging.error('错误error')
logging.critical('严重critical')


access.log内容:
2017-07-28 20:32:17 PM - root - DEBUG -test:  调试debug
2017-07-28 20:32:17 PM - root - INFO -test:  消息info
2017-07-28 20:32:17 PM - root - WARNING -test:  警告warn
2017-07-28 20:32:17 PM - root - ERROR -test:  错误error
2017-07-28 20:32:17 PM - root - CRITICAL -test:  严重critical

logging模块的Formatter，Handler，Logger，Filter对象

logger：产生日志的对象
Filter：过滤日志的对象
Handler：接收日志而后控制打印到不一样的地方，FileHandler用来打印到文件中，StreamHandler用来打印到终端
Formatter对象：能够定制不一样的日志格式对象，而后绑定给不一样的Handler对象使用，以此来控制不一样的Handler的日志格式

'''
critical=50
error =40
warning =30
info = 20
debug =10
'''
import logging


一、logger对象：负责产生日志，而后交给Filter过滤，而后交给不一样的Handler输出
logger=logging.getLogger(__file__)

二、Filter对象：不经常使用，略

三、Handler对象：接收logger传来的日志，而后控制输出
h1=logging.FileHandler('t1.log') #打印到文件
h2=logging.FileHandler('t2.log') #打印到文件
h3=logging.StreamHandler() #打印到终端

其余一些经常使用的方法：
Handler.__init__(level=NOTSET) 根据日志级别初始化处理器，设置过滤器列表和建立一个锁来访问I/O。当派生类继承本类时，须要在构造函数里调用本函数
Handler.createLock() 建立一个锁以便在多线程下安全使用
Handler.acquire() 获取线程锁
Handler.release() 释放线程锁
Handler.flush()
确保全部日志都已经输出
Handler.close() 
回收全部使用的资源

四、Formatter对象：日志格式
formmater1=logging.Formatter('%(asctime)s - %(name)s - %(levelname)s -%(module)s:  %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S %p',)
formmater2=logging.Formatter('%(asctime)s :  %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S %p',)
formmater3=logging.Formatter('%(name)s %(message)s',)


五、为Handler对象绑定格式
h1.setFormatter(formmater1)
h2.setFormatter(formmater2)
h3.setFormatter(formmater3)

六、将Handler添加给logger并设置日志级别
logger.addHandler(h1)
logger.addHandler(h2)
logger.addHandler(h3)
logger.setLevel(10)

七、测试
logger.debug('debug')
logger.info('info')
logger.warning('warning')
logger.error('error')
logger.critical('critical')

Logger与Handler的级别

logger是第一级过滤，而后才能到handler，咱们能够给logger和handler同时设置level，可是须要注意的是

Logger也是第一个基于级别过滤消息的人——若是您将Logger设置为INFO，全部的处理程序都设置为DEBUG，您仍然不会在处理程序上接收调试消息——它们将被Logger本身拒绝。若是您将logger设置为DEBUG，可是全部的处理程序都设置为INFO，那么您也不会收到任何调试消息——由于当记录器说“ok，处理这个”时，处理程序会拒绝它(DEBUG < INFO)。



#验证
import logging


form=logging.Formatter('%(asctime)s - %(name)s - %(levelname)s -%(module)s:  %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S %p',)

ch=logging.StreamHandler()

ch.setFormatter(form)
# ch.setLevel(10)
ch.setLevel(20)

l1=logging.getLogger('root')
# l1.setLevel(20)
l1.setLevel(10)
l1.addHandler(ch)

l1.debug('l1 debug')

View Code

logger的继承

import logging

formatter=logging.Formatter('%(asctime)s - %(name)s - %(levelname)s -%(module)s:  %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S %p',)

ch=logging.StreamHandler()
ch.setFormatter(formatter)


logger1=logging.getLogger('root')
logger2=logging.getLogger('root.child1')
logger3=logging.getLogger('root.child1.child2')


logger1.addHandler(ch)
logger2.addHandler(ch)
logger3.addHandler(ch)
logger1.setLevel(10)
logger2.setLevel(10)
logger3.setLevel(10)

logger1.debug('log1 debug')
logger2.debug('log2 debug')
logger3.debug('log3 debug')
'''
2017-07-28 22:22:05 PM - root - DEBUG -test:  log1 debug
2017-07-28 22:22:05 PM - root.child1 - DEBUG -test:  log2 debug
2017-07-28 22:22:05 PM - root.child1 - DEBUG -test:  log2 debug
2017-07-28 22:22:05 PM - root.child1.child2 - DEBUG -test:  log3 debug
2017-07-28 22:22:05 PM - root.child1.child2 - DEBUG -test:  log3 debug
2017-07-28 22:22:05 PM - root.child1.child2 - DEBUG -test:  log3 debug

View Code

logger的应用

"""
logging配置
"""

import os
import logging.config

# 定义三种日志输出格式 开始

standard_format = '[%(asctime)s][%(threadName)s:%(thread)d][task_id:%(name)s][%(filename)s:%(lineno)d]' \
                  '[%(levelname)s][%(message)s]' #其中name为getlogger指定的名字

simple_format = '[%(levelname)s][%(asctime)s][%(filename)s:%(lineno)d]%(message)s'

id_simple_format = '[%(levelname)s][%(asctime)s] %(message)s'

# 定义日志输出格式 结束

logfile_dir = os.path.dirname(os.path.abspath(__file__))  # log文件的目录

logfile_name = 'all2.log'  # log文件名

# 若是不存在定义的日志目录就建立一个
if not os.path.isdir(logfile_dir):
    os.mkdir(logfile_dir)

# log文件的全路径
logfile_path = os.path.join(logfile_dir, logfile_name)

# log配置字典
LOGGING_DIC = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'standard': {
            'format': standard_format
        },
        'simple': {
            'format': simple_format
        },
    },
    'filters': {},
    'handlers': {
        #打印到终端的日志
        'console': {
            'level': 'DEBUG',
            'class': 'logging.StreamHandler',  # 打印到屏幕
            'formatter': 'simple'
        },
        #打印到文件的日志,收集info及以上的日志
        'default': {
            'level': 'DEBUG',
            'class': 'logging.handlers.RotatingFileHandler',  # 保存到文件
            'formatter': 'standard',
            'filename': logfile_path,  # 日志文件
            'maxBytes': 1024*1024*5,  # 日志大小 5M
            'backupCount': 5,
            'encoding': 'utf-8',  # 日志文件的编码，不再用担忧中文log乱码了
        },
    },
    'loggers': {
        #logging.getLogger(__name__)拿到的logger配置
        '': {
            'handlers': ['default', 'console'],  # 这里把上面定义的两个handler都加上，即log数据既写入文件又打印到屏幕
            'level': 'DEBUG',
            'propagate': True,  # 向上（更高level的logger）传递
        },
    },
}


def load_my_logging_cfg():
    logging.config.dictConfig(LOGGING_DIC)  # 导入上面定义的logging配置
    logger = logging.getLogger(__name__)  # 生成一个log实例
    logger.info('It works!')  # 记录该文件的运行状态

if __name__ == '__main__':
    load_my_logging_cfg()

logging配置文件

#logging_config.py
LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'standard': {
            'format': '[%(asctime)s][%(threadName)s:%(thread)d][task_id:%(name)s][%(filename)s:%(lineno)d]'
                      '[%(levelname)s][%(message)s]'
        },
        'simple': {
            'format': '[%(levelname)s][%(asctime)s][%(filename)s:%(lineno)d]%(message)s'
        },
        'collect': {
            'format': '%(message)s'
        }
    },
    'filters': {
        'require_debug_true': {
            '()': 'django.utils.log.RequireDebugTrue',
        },
    },
    'handlers': {
        #打印到终端的日志
        'console': {
            'level': 'DEBUG',
            'filters': ['require_debug_true'],
            'class': 'logging.StreamHandler',
            'formatter': 'simple'
        },
        #打印到文件的日志,收集info及以上的日志
        'default': {
            'level': 'INFO',
            'class': 'logging.handlers.RotatingFileHandler',  # 保存到文件，自动切
            'filename': os.path.join(BASE_LOG_DIR, "xxx_info.log"),  # 日志文件
            'maxBytes': 1024 * 1024 * 5,  # 日志大小 5M
            'backupCount': 3,
            'formatter': 'standard',
            'encoding': 'utf-8',
        },
        #打印到文件的日志:收集错误及以上的日志
        'error': {
            'level': 'ERROR',
            'class': 'logging.handlers.RotatingFileHandler',  # 保存到文件，自动切
            'filename': os.path.join(BASE_LOG_DIR, "xxx_err.log"),  # 日志文件
            'maxBytes': 1024 * 1024 * 5,  # 日志大小 5M
            'backupCount': 5,
            'formatter': 'standard',
            'encoding': 'utf-8',
        },
        #打印到文件的日志
        'collect': {
            'level': 'INFO',
            'class': 'logging.handlers.RotatingFileHandler',  # 保存到文件，自动切
            'filename': os.path.join(BASE_LOG_DIR, "xxx_collect.log"),
            'maxBytes': 1024 * 1024 * 5,  # 日志大小 5M
            'backupCount': 5,
            'formatter': 'collect',
            'encoding': "utf-8"
        }
    },
    'loggers': {
        #logging.getLogger(__name__)拿到的logger配置
        '': {
            'handlers': ['default', 'console', 'error'],
            'level': 'DEBUG',
            'propagate': True,
        },
        #logging.getLogger('collect')拿到的logger配置
        'collect': {
            'handlers': ['console', 'collect'],
            'level': 'INFO',
        }
    },
}


# -----------
# 用法:拿到俩个logger

logger = logging.getLogger(__name__) #线上正常的日志
collect_logger = logging.getLogger("collect") #领导说,须要为领导们单独定制领导们看的日志

另一个Django的配置文件，了解

注意
有了上述方式咱们的好处是：全部与logging模块有关的配置都写到字典中就能够了，更加清晰，方便管理


咱们须要解决的问题是：
一、从字典加载配置：logging.config.dictConfig(settings.LOGGING_DIC)
二、拿到logger对象来产生日志
logger对象都是配置到字典的loggers 键对应的子字典中的
按照咱们对logging模块的理解，要想获取某个东西都是经过名字，也就是key来获取的
因而咱们要获取不一样的logger对象就是
logger=logging.getLogger('loggers子字典的key名')

Logger类历来不要直接构造一个实例，它都是经过模块级别的函数logging.getLogger(name)来获取到实例，当屡次调用时提供同样的名称，老是返回一个实例
    
    但问题是：若是咱们想要不一样logger名的logger对象都共用一段配置，那么确定不能在loggers子字典中定义n个key   
 'loggers': {    
        'l1': {
            'handlers': ['default', 'console'],  #
            'level': 'DEBUG',
            'propagate': True,  # 向上（更高level的logger）传递
        },
        'l2: {
            'handlers': ['default', 'console' ], 
            'level': 'DEBUG',
            'propagate': False,  # 向上（更高level的logger）传递
        },
        'l3': {
            'handlers': ['default', 'console'],  #
            'level': 'DEBUG',
            'propagate': True,  # 向上（更高level的logger）传递
        },

}
  
咱们的解决方式是，定义一个空的key
    'loggers': {
        '': {
            'handlers': ['default', 'console'], 
            'level': 'DEBUG',
            'propagate': True, 
        },

}

这样咱们再取logger对象时
logging.getLogger(__name__)，不一样的文件__name__不一样，这保证了打印日志时标识信息不一样，可是拿着该名字去loggers里找key名时却发现找不到，因而默认使用key=''的配置

使用logging会出现的bug

现象：

生产中心进行拷机任务下了300个任务，过了一阵时间后发现任务再也不被调度起来，查看后台日志发现日志输出停在某个时间点。

分析：
首先确认进程存在并无dead。
而后用strace –p看了一下进程，发现进程卡在futex调用上面，应该是在锁操做上面出问题了。
用gdb attach进程ID，用py-bt查看一下堆栈，发现堆栈的信息大体为：sig_handler(某个信号处理函数)->auroralogger(自定义的日志函数)->logging(python的logging模块)->threading.acquire(获取锁)。从gdb的bt信息基本验证了上面的猜测，应该是出现了死锁。
Python的logging模块自己确定不会有死锁的这种bug有可能出问题的就是咱们的使用方式，看python中logging模块的doc，发现有一个有一个Thread
 Safety的章节，内容很简单可是也一下就解释了我遇到的这个问题，内容以下：


The logging module is intended to be thread-safe without any special work needing to be done by its clients. It achieves this though using threading
 locks; there is one lock to serialize access to the module’s shared data, and each handler also creates a lock to serialize access to its underlying I/O.

If you are implementing asynchronous signal handlers using the signal module,
 you may not be able to use logging from within such handlers. This is because lock implementations in the threading module
 are not always re-entrant, and so cannot be invoked from such signal handlers.


第一部分是说logging是线程安全的，经过threading的lock对公用的数据进行了加锁。
第二部分特地提到了在异步的信号处理函数中不能使用logging模块，由于threading的lock机制是不支持重入的。
这样就解释了上面我遇到的死锁问题，由于我在信号处理函数中调用了不能够重入的logging模块。
线程安全和可重入：
      
从上面的logging模块来看线程安全和可重入不是等价的，那么这两个概念之间有什么联系、区别呢？

可重入函数：从字面意思来理解就是这个函数能够重复调用，函数被多个线程乱序执行甚至交错执行都能保证函数的输出和函数单独被执行一次的输出一致。也就是说函数的输出只决定于输入。
线程安全函数：函数能够被多个线程调用，而且保证不会引用到错误的或者脏的数据。线程安全的函数输出不只仅依赖于输入还可能依赖于被调用时的顺序。

可重入函数和线程安全函数之间有一个最大的差别是：是不是异步信号安全。可重入函数在异步信号处理函数中能够被安全调用，而线程安全函数不保证能够在异步信号处理函数中被安全调用。

上面咱们遇到的loggin模块就是非异步信号安全的，在主线程中咱们正在使用log函数而log函数调用了threading.lock来获取到了锁，此时一个异步信号产生程序跳转到信号处理函数中，信号处理函数又正好调用了log函数，由于前一个被调用的log函数还未释放锁，最后就造成了一个死锁。

可重入函数必然是线程安全函数和异步信号安全函数，线程安全函数不必定是可重入函数。

总结：     
异步信号处理函数中必定要尽量的功能简单而且不能调用不可重入的函数。    
Python loggin模块是线程安全可是是不可重入的。

re

什么是正则？

正则就是用一些具备特殊含义的符号组合到一块儿（称为正则表达式）来描述字符或者字符串的方法。或者说：正则就是用来描述一类事物的规则。（在Python中）它内嵌在Python中，并经过 re 模块实现。正则表达式模式被编译成一系列的字节码，而后由用 C 编写的匹配引擎执行。

r"""Support for regular expressions (RE).

This module provides regular expression matching operations similar to
those found in Perl.  It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.

Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves.  You can
concatenate ordinary characters, so last matches the string 'last'.

The special characters are:
    "."      Matches any character except a newline.
    "^"      Matches the start of the string.
    "$"      Matches the end of the string or just before the newline at
             the end of the string.
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
             Greedy means that it will match as many repetitions as possible.
    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
    "?"      Matches 0 or 1 (greedy) of the preceding RE.
    *?,+?,?? Non-greedy versions of the previous three special characters.
    {m,n}    Matches from m to n repetitions of the preceding RE.
    {m,n}?   Non-greedy version of the above.
    "\\"     Either escapes special characters or signals a special sequence.
    []       Indicates a set of characters.
             A "^" as the first character indicates a complementing set.
    "|"      A|B, creates an RE that will match either A or B.
    (...)    Matches the RE inside the parentheses.
             The contents can be retrieved or matched later in the string.
    (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
    (?:...)  Non-grouping version of regular parentheses.
    (?P<name>...) The substring matched by the group is accessible by name.
    (?P=name)     Matches the text matched earlier by the group named name.
    (?#...)  A comment; ignored.
    (?=...)  Matches if ... matches next, but doesn't consume the string.
    (?!...)  Matches if ... doesn't match next.
    (?<=...) Matches if preceded by ... (must be fixed length).
    (?<!...) Matches if not preceded by ... (must be fixed length).
    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
                       the (optional) no pattern otherwise.

The special sequences consist of "\\" and a character from the list
below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.
    \number  Matches the contents of the group of the same number.
    \A       Matches only at the start of the string.
    \Z       Matches only at the end of the string.
    \b       Matches the empty string, but only at the start or end of a word.
    \B       Matches the empty string, but not at the start or end of a word.
    \d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.
    \D       Matches any non-digit character; equivalent to [^\d].
    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
    \S       Matches any non-whitespace character; equivalent to [^\s].
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric characters (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.
    \W       Matches the complement of \w.
    \\       Matches a literal backslash.

This module exports the following functions:
    match     Match a regular expression pattern to the beginning of a string.
    fullmatch Match a regular expression pattern to all of a string.
    search    Search a string for the presence of a pattern.
    sub       Substitute occurrences of a pattern found in a string.
    subn      Same as sub, but also return the number of substitutions made.
    split     Split a string by the occurrences of a pattern.
    findall   Find all occurrences of a pattern in a string.
    finditer  Return an iterator yielding a match object for each match.
    compile   Compile a pattern into a RegexObject.
    purge     Clear the regular expression cache.
    escape    Backslash all non-alphanumerics in a string.

Some of the functions in this module takes flags as optional parameters:
    A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D
                   match the corresponding ASCII character categories
                   (rather than the whole Unicode categories, which is the
                   default).
                   For bytes patterns, this flag is the only available
                   behaviour and needn't be specified.
    I  IGNORECASE  Perform case-insensitive matching.
    L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
    M  MULTILINE   "^" matches the beginning of lines (after a newline)
                   as well as the string.
                   "$" matches the end of lines (before a newline) as well
                   as the end of the string.
    S  DOTALL      "." matches any character at all, including the newline.
    X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
    U  UNICODE     For compatibility only. Ignored for string patterns (it
                   is the default), and forbidden for bytes patterns.

This module also defines an exception 'error'.

"""

python3.6 re相关文档解释

字符组：[0-9][a-z][A-Z]

在同一个位置可能出现的各类字符组成了一个字符组，在正则表达式中用[]表示，字符分为不少类，好比数字、字母、标点等等。
假如你如今要求一个位置"只能出现一个数字",那么这个位置上的字符只能是0、一、2...9这10个数之一。
能够写成这种 [0-5a-eA-Z] 取范围的匹配

字符

 .   匹配除换行符之外的任意字符
\w  匹配字母或数字或下划线
\s  匹配任意的空白符
\d  匹配数字
\n  匹配一个换行符
\t  匹配一个制表符
\b  匹配一个单词的结尾
^   匹配字符串的开始
$   匹配字符串的结尾
\W  匹配非字母或数字或下划线
\D  匹配非数字
\S  匹配非空白符
a|b 匹配字符a或字符b
()  匹配括号内的表达式，也表示一个组
[...]   匹配字符组中的字符
[^...]  匹配除了字符组中字符的全部字符

量词

量词  用法说明
*   重复零次或更屡次
+   重复一次或更屡次
?   重复零次或一次
{n} 重复n次
{n,}    重复n次或更屡次
{n,m}   重复n到m次

.^$

正则      待匹配字符       匹配结果           说明
东.       东方东娇东东     东方东娇东东        匹配全部"东."的字符
^东.      东方东娇东东     东方               只从开头匹配"东."
东.$      东方东娇东东     东东               只匹配结尾的"东.$"

*+?{}

正则      待匹配字符                   匹配结果                说明
李.?     李杰和李莲英和李二棍子     李杰/李莲/李二              ?表示重复零次或一次，即只匹配"李"后面一个任意字符 
李.*     李杰和李莲英和李二棍子     李杰和李莲英和李二棍子        *表示重复零次或屡次，即匹配"李"后面0或多个任意字符
李.+     李杰和李莲英和李二棍子     李杰和李莲英和李二棍子        +表示重复一次或屡次，即只匹配"李"后面1个或多个任意字符
李.{1,2} 李杰和李莲英和李二棍子     李杰和/李莲英/李二棍         {1,2}匹配1到2次任意字符

注意：前面的*,+,?等都是贪婪匹配，也就是尽量匹配，后面加?号使其变成惰性匹配
正则      待匹配字符                   匹配结果        说明
李.*?     李杰和李莲英和李二棍子       李/李/李         惰性匹配

字符集[][^]

正则                   待匹配字符                  匹配结果                说明
李[杰莲英二棍子]*      李杰和李莲英和李二棍子     李杰/李莲英/李二棍子     表示匹配"李"字后面[杰莲英二棍子]的字符任意次 
李[^和]*               李杰和李莲英和李二棍子     李杰/李莲英/李二棍子     表示匹配一个不是"和"的字符任意次
[\d]                   456bdha3                   4/5/6/3                  表示匹配任意一个数字，匹配到4个结果
[\d]+                  456bdha3                   456/3                    表示匹配任意个数字，匹配到2个结果

分组()或|和[^]　

身份证号码是一个长度为15或18个字符的字符串，若是是15位则所有是数字组成，首位不能为0；若是是18位，则前17位所有是数字，末位多是数字或x，下面咱们尝试用正则来表示：

正则                                 待匹配字符                   匹配结果                  说明
^[1-9]\d{13,16}[0-9x]$              110101198001017032          110101198001017032      表示能够匹配一个正确的身份证号
^[1-9]\d{13,16}[0-9x]$              1101011980010170            1101011980010170        表示也能够匹配这串数字，但这并非一个正确的身份证号码，它是一个16位的数字
^[1-9]\d{14}(\d{2}[0-9x])?$         1101011980010170            False                   如今不会匹配错误的身份证号了()表示分组，将\d{2}[0-9x]分红一组，就能够总体约束他们出现的次数为0-1次
^([1-9]\d{16}[0-9x]|[1-9]\d{14})$   110105199812067023          110105199812067023      表示先匹配[1-9]\d{16}[0-9x]若是没有匹配上就匹配[1-9]\d{14}

转义符\

在正则表达式中，有不少有特殊意义的是元字符，好比\d和\s等，若是要在正则中匹配正常的"\d"而不是"数字"就须要对"\"进行转义，变成'\\'。
在python中，不管是正则表达式，仍是待匹配的内容，都是以字符串的形式出现的，在字符串中\也有特殊的含义，自己还须要转义。因此若是匹配一次"\d",字符串中要写成'\\d'，那么正则里就要写成"\\\\d",这样就太麻烦了。
这个时候咱们就用到了r'\d'这个概念，此时的正则是r'\\d'就能够了。
正则                   待匹配字符                  匹配结果                说明
d                      \d                         False                   由于在正则表达式中\是有特殊意义的字符，因此要匹配\d自己，用表达式\d没法匹配
\\d                    \d                          True                    转义\以后变成\\，便可匹配
"\\\\d"                '\\d'                       True                    若是在python中，字符串中的'\'也须要转义，因此每个字符串'\'又须要转义一次
r'\\d'                  r'\d'                     True                    在字符串以前加r，让整个字符串不转义

贪婪匹配

re模块在py中的使用

1.compile
re.compile(pattern[, flags])
做用：把正则表达式语法转化成正则表达式对象

obj = re.compile('\d{3}')  #将正则表达式编译成为一个 正则表达式对象，规则要匹配的是3个数字
ret = obj.search('abc123eeee') #正则表达式对象调用search，参数为待匹配的字符串
print(ret.group())  #结果 ： 123

flags定义包括：
re.I：忽略大小写
re.L：表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
re.M：多行模式
re.S：' . '而且包括换行符在内的任意字符（注意：' . '不包括换行符）
re.U： 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库

多行匹配的例子
import re

line = "IF_MIB::=Counter32: 12345\nIF_MIB::=Counter32: 1234556";
result = re.findall(r'(?<=\:\s)\d+$', line, re.M)

if result:
    print(result)
else:
    print("Nothing found!!")

'''不加re.M
输出['1234556']
    加上re.M
输出['12345', '1234556']
'''

2.search
re.search(pattern, string[, flags])

ret = re.search('a', 'eva egon yuan').group()
print(ret) #结果 : 'a'
# 函数会在字符串内查找模式匹配,只到找到第一个匹配而后返回一个包含匹配信息的对象,该对象能够经过调用group()方法获得匹配的字符串,若是字符串没有匹配，则返回None。

做用：在字符串中查找匹配正则表达式模式的位置，返回 MatchObject 的实例，若是没有找到匹配的位置，则返回 None。

3.match
re.match(pattern, string[, flags])
match(string[, pos[, endpos]])
做用：match() 函数只在字符串的开始位置尝试匹配正则表达式，也就是只报告从位置 0 开始的匹配状况，
而 search() 函数是扫描整个字符串来查找匹配。若是想要搜索整个字符串来寻找匹配，应当用 search()

ret = re.match('a', 'abc').group()  # 同search,不过只在字符串开始处进行匹配
print(ret)
#结果 : 'a'


例子：
import re
r1 = re.compile(r'world')
if r1.match('helloworld'):
    print 'match succeeds'
else:
    print 'match fails'
if r1.search('helloworld'):
    print 'search succeeds'
else:
    print 'search fails'
结果
#match fails
#search succeeds


4.split
re.split(pattern, string[, maxsplit=0, flags=0])
split(string[, maxsplit=0])
做用：能够将字符串匹配正则表达式的部分割开并返回一个列表

import re
inputStr = 'abc aa;bb,cc | dd(xx).xxx 12.12';
print(re.split(' ',inputStr))
结果
#['abc', 'aa;bb,cc', '|', 'dd(xx).xxx', '12.12']

5.findall
re.findall(pattern, string[, flags])
findall(string[, pos[, endpos]])

ret = re.findall('a', 'eva egon yuan')  # 返回全部知足匹配条件的结果,放在列表里
print(ret) #结果 : ['a', 'a']
做用：在字符串中找到正则表达式所匹配的全部子串，并组成一个列表返回
例：查找[]包括的内容（贪婪和非贪婪查找）

6.finditer
re.finditer(pattern, string[, flags])
finditer(string[, pos[, endpos]])
说明：和 findall 相似，在字符串中找到正则表达式所匹配的全部子串，并组成一个迭代器返回。

ret = re.finditer('\d', 'ds3sy4784a')   #finditer返回一个存放匹配结果的迭代器
print(ret)  # <callable_iterator object at 0x10195f940>
print(next(ret).group())  #查看第一个结果
print(next(ret).group())  #查看第二个结果
print([i.group() for i in ret])  #查看剩余的左右结果


7.sub
re.sub(pattern, repl, string[, count, flags])
sub(repl, string[, count=0])
说明：在字符串 string 中找到匹配正则表达式 pattern 的全部子串，用另外一个字符串 repl 进行替换。若是没有找到匹配 pattern 的串，则返回未被修改的 string。
Repl 既能够是字符串也能够是一个函数。

import re
def pythonReSubDemo():
    inputStr = "hello 123,my 234,world 345"
    def _add111(matched):
        intStr = int(matched.group("number"))
        _addValue = intStr + 111;
        _addValueStr = str(_addValue)
        return _addValueStr

    replaceStr = re.sub("(?P<number>\d+)",_add111,inputStr,1)
    print("replaceStr=",replaceStr)

if __name__ == '__main__':
    pythonReSubDemo();
##############
#hello 234,my 234,world 345


注意：
#1 findall的优先级查询
import re

ret = re.findall('www.(baidu|oldboy).com', 'www.oldboy.com')
print(ret)  # ['oldboy']     这是由于findall会优先把匹配结果组里内容返回,若是想要匹配结果,取消权限便可

ret = re.findall('www.(?:baidu|oldboy).com', 'www.oldboy.com')
print(ret)  # ['www.oldboy.com']

#2 split的优先级查询
ret=re.split("\d+","eva3egon4yuan")
print(ret) #结果 ： ['eva', 'egon', 'yuan']

ret=re.split("(\d+)","eva3egon4yuan")
print(ret) #结果 ： ['eva', '3', 'egon', '4', 'yuan']

#在匹配部分加上（）以后所切出的结果是不一样的，
#没有（）的没有保留所匹配的项，可是有（）的却可以保留了匹配的项，
#这个在某些须要保留匹配部分的使用过程是很是重要的。

#findall  #直接返回一个列表
#正常的正则表达式
#可是只会把分组里的显示出来

#search   #返回一个对象 .group()
#match    #返回一个对象 .group()

#!/usr/bin/python env
#_*_coding:utf-8_*_
 
贪婪匹配：在知足匹配时，匹配尽量长的字符串，默认状况下，采用贪婪匹配
正则                   待匹配字符                  匹配结果                说明
<.*>                   <script>...<script>         <script>...<script>     默认为贪婪匹配模式，会匹配尽可能长的字符串
<.*?>                   r'\d'                       <script>/<script>      加上？为将贪婪匹配模式转为非贪婪匹配模式，会匹配尽可能短的字符串

几个经常使用的非贪婪匹配Pattern
*? 重复任意次，但尽量少重复
+? 重复1次或更屡次，但尽量少重复
?? 重复0次或1次，但尽量少重复
{n,m}? 重复n到m次，但尽量少重复
{n,}? 重复n次以上，但尽量少重复

.*?的用法
. 是任意字符
* 是取 0 至 无限长度
? 是非贪婪模式。
何在一块儿就是 取尽可能少的任意字符，通常不会这么单独写，他大多用在：
.*?x

就是取前面任意长度的字符，直到一个x出现

View Code

 1 #!/usr/bin/python env
 2 #_*_coding:utf-8_*_
 3 
 4 #一、匹配标签
 5 import re
 6 ret = re.search("<(?P<tag_name>\w+)>\w+</(?P=tag_name)>","<h1>hello</h1>")
 7 #还能够在分组中利用?<name>的形式给分组起名字
 8 #获取的匹配结果能够直接用group('名字')拿到对应的值
 9 print(ret.group('tag_name'))  #结果 ：h1
10 print(ret.group())  #结果 ：<h1>hello</h1>
11 
12 ret = re.search(r"<(\w+)>\w+</\1>","<h1>hello</h1>")
13 #若是不给组起名字，也能够用\序号来找到对应的组，表示要找的内容和前面的组内容一致
14 #获取的匹配结果能够直接用group(序号)拿到对应的值
15 print(ret.group(1))
16 print(ret.group())  #结果 ：<h1>hello</h1>
17 
18 #二、匹配整数
19 import re
20 ret=re.findall(r"\d+","1-2*(60+(-40.35/5)-(-4*3))")
21 print(ret) #['1', '2', '60', '40', '35', '5', '4', '3']
22 ret=re.findall(r"-?\d+\.\d*|(-?\d+)","1-2*(60+(-40.35/5)-(-4*3))")
23 print(ret) #['1', '-2', '60', '', '5', '-4', '3']
24 ret.remove("")
25 print(ret) #['1', '-2', '60', '5', '-4', '3']
26 
27 #三、数字匹配
28 # 一、 匹配一段文本中的每行的邮箱
29 #       http://blog.csdn.net/make164492212/article/details/51656638
30 # 二、 匹配一段文本中的每行的时间字符串，好比：‘1990-07-12’；
31 #
32 #    分别取出1年的12个月（^(0?[1-9]|1[0-2])$）、
33 #    一个月的31天：^((0?[1-9])|((1|2)[0-9])|30|31)$
34 # 三、 匹配qq号。(腾讯QQ号从10000开始)  ［1,9］[0,9]{4,}
35 # 四、 匹配一个浮点数。       ^(-?\d+)(\.\d+)?$   或者  -?\d+\.?\d*
36 # 五、 匹配汉字。             ^[\u4e00-\u9fa5]{0,}$
37 # 六、 匹配出全部整数

re习题

 1 #!/usr/bin/python env
 2 #_*_coding:utf-8_*_
 3 import re
 4 import json
 5 import requests  #urlopen
 6 
 7 def getPage(url):
 8     response = requests.get(url)
 9     return response.text
10 
11 def parsePage(s):
12     com = re.compile(
13         '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>'
14         '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)
15 
16     ret = com.finditer(s)
17     for i in ret:
18         yield {
19             "id": i.group("id"),
20             "title": i.group("title"),
21             "rating_num": i.group("rating_num"),
22             "comment_num": i.group("comment_num"),
23         }
24 
25 def main(num):
26     url = 'https://movie.douban.com/top250?start=%s&filter=' % num
27     response_html = getPage(url)
28     ret = parsePage(response_html)
29     print(ret)
30     f = open("move_info7", "a", encoding="utf8")
31 
32     for obj in ret:  #循环生成器
33         print(obj)
34         data = json.dumps(obj, ensure_ascii=False)
35         f.write(data + "\n")
36 
37 count = 0
38 for i in range(10):
39     main(count)
40     count += 25

爬豆瓣网页匹配

collection

在内置数据类型（dict、list、set、tuple）的基础上，collections模块还提供了几个额外的数据类型：Counter、deque、defaultdict、namedtuple和OrderedDict等。

namedtuple: 生成可使用名字来访问元素内容的tuple

deque: 双端队列，能够快速的从另一侧追加和推出对象

Counter: 计数器，主要用来计数

OrderedDict: 有序字典

defaultdict: 带有默认值的字典

View Code

namedtuple

我们知道tuple能够表示不变集合，例如，一个点的二维坐标就能够表示成：

>>> p = (1, 2)

View Code

可是，看到(1, 2)，很难看出这个tuple是用来表示一个坐标的。

这时，namedtuple就派上了用场：

>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> p = Point(1, 2)
>>> p.x
1
>>> p.y
2

View Code

类似的，若是要用坐标和半径表示一个圆，也能够用namedtuple定义：

#namedtuple('名称', [属性list]):
Circle = namedtuple('Circle', ['x', 'y', 'r'])

View Code

deque

使用list存储数据时，按索引访问元素很快，可是插入和删除元素就很慢了，由于list是线性存储，数据量大的时候，插入和删除效率很低。

deque是为了高效实现插入和删除操做的双向列表，适合用于队列和栈：

>>> from collections import deque
>>> q = deque(['a', 'b', 'c'])
>>> q.append('x')
>>> q.appendleft('y')
>>> q
deque(['y', 'a', 'b', 'c', 'x'])

View Code

deque除了实现list的append()和pop()外，还支持appendleft()和popleft()，这样就能够很是高效地往头部添加或删除元素。

OrderedDict

使用dict时，Key是无序的。在对dict作迭代时，咱们没法肯定Key的顺序。

若是要保持Key的顺序，能够用OrderedDict：

>>> from collections import OrderedDict
>>> d = dict([('a', 1), ('b', 2), ('c', 3)])
>>> d # dict的Key是无序的
{'a': 1, 'c': 3, 'b': 2}
>>> od = OrderedDict([('a', 1), ('b', 2), ('c', 3)])
>>> od # OrderedDict的Key是有序的
OrderedDict([('a', 1), ('b', 2), ('c', 3)])

View Code

注意，OrderedDict的Key会按照插入的顺序排列，不是Key自己排序：

>>> od = OrderedDict()
>>> od['z'] = 1
>>> od['y'] = 2
>>> od['x'] = 3
>>> od.keys() # 按照插入的Key的顺序返回
['z', 'y', 'x']

View Code

defaultdict

有以下值集合 [11,22,33,44,55,66,77,88,99,90...]，将全部大于 66 的值保存至字典的第一个key中，将小于 66 的值保存至第二个key的值中。

即： { 'k1' : 大于 66 , 'k2' : 小于 66 }

values = [11, 22, 33,44,55,66,77,88,99,90]

my_dict = {}

for value in  values:
    if value>66:
        if my_dict.has_key('k1'):
            my_dict['k1'].append(value)
        else:
            my_dict['k1'] = [value]
    else:
        if my_dict.has_key('k2'):
            my_dict['k2'].append(value)
        else:
            my_dict['k2'] = [value]

View Code

from collections import defaultdict

values = [11, 22, 33,44,55,66,77,88,99,90]

my_dict = defaultdict(list)

for value in  values:
    if value>66:
        my_dict['k1'].append(value)
    else:
        my_dict['k2'].append(value)

View Code

使用dict时，若是引用的Key不存在，就会抛出KeyError。若是但愿key不存在时，返回一个默认值，就能够用defaultdict：

>>> from collections import defaultdict
>>> dd = defaultdict(lambda: 'N/A')
>>> dd['key1'] = 'abc'
>>> dd['key1'] # key1存在
'abc'
>>> dd['key2'] # key2不存在，返回默认值
'N/A'

View Code

Counter

Counter类的目的是用来跟踪值出现的次数。它是一个无序的容器类型，以字典的键值对形式存储，其中元素做为key，其计数做为value。计数值能够是任意的Interger（包括0和负数）。Counter类和其余语言的bags或multisets很类似。

建立

下面的代码说明了Counter类建立的四种方法：

计数值的访问与缺失的键

当所访问的键不存在时，返回0，而不是KeyError；不然返回它的计数。

计数器的更新（update和subtract）

可使用一个iterable对象或者另外一个Counter对象来更新键值。

计数器的更新包括增长和减小两种。其中，增长使用update()方法：

减小则使用subtract()方法：

键的修改和删除

当计数值为0时，并不意味着元素被删除，删除元素应当使用del。

elements()

返回一个迭代器。元素被重复了多少次，在该迭代器中就包含多少个该元素。元素排列无肯定顺序，个数小于1的元素不被包含。

most_common([n])

返回一个TopN列表。若是n没有被指定，则返回全部元素。当多个元素计数值相同时，排列是无肯定顺序的。

>>> c = Counter('abracadabra')
>>> c.most_common()
[('a', 5), ('r', 2), ('b', 2), ('c', 1), ('d', 1)]
>>> c.most_common(3)
[('a', 5), ('r', 2), ('b', 2)]

View Code

算术和集合操做

+、-、&、|操做也能够用于Counter。其中&和|操做分别返回两个Counter对象各元素的最小值和最大值。须要注意的是，获得的Counter对象将删除小于1的元素。

其余经常使用操做

下面是一些Counter类的经常使用操做，来源于Python官方文档