Python 正在表达式

什么是正则表达式

正则表达式：是一组特殊的字符序列，又称为规则表达式，它能方便你检查一个字符串是否与某种模式匹配。一般用来检索和替换那些符合某些模式的文本。python中的re模块，实现了所有的正则表达式的功能，接下来将会对其进行详细的讲解。python

python 正则表达式的方法

python的re模块，提供了不少正则表达式的用法，接下来就对其re提供的方法，来进行正则表达式的操做。正则表达式

re.compile()

# re.compile(pattern, flags=0) # 给定一个正则表达式 pattern，指定使用的模式 flags 默认为0 即不使用任何模式,而后会返回一个 SRE_Pattern(re 内置对象用法)对象 # 表达式能够不进行编译，也能够进行匹配，可是若在大量使用正则时，能够先对其编译，由于编译须要时间，在正则的使用时编译事后就能够  # 复用而无需从新编译，能够提升运行效率

# 例如
key = '121aaa111' pattern = re.compile(r'\d+') print pattern # 输出结果返回一个对象： <_sre.SRE_Pattern object at 0x0000000002444588>
value = re.search(pattern, key).group() value2 = re.search(r'\d+', key).group() print value # 121
print value2 # 121

# re.compile 的返回值对象,拥有和re模块的全部方法,除compile ''' 例如： key = '123fdd' pattern = re.compile('\d+') pattern.findall(key) 等于 re.findall(pattern, key) pattern.findall() 会把pattern对象自己当作第一个参数传入 '''

　2. re.findall()缓存

# findall(pattern, string, flags=0) # 参数 pattern 为正则表达式, string 为待操做字符串, flags 为所用模式，函数做用为在待操做字符串中寻找全部匹配正则表达式的字串， # 返回一个列表，若是没有匹配到任何子串，返回一个空列表

key = '''first line second line third line'''
# compile 预编译后使用 findall
pattern = re.compile("\w+") print re.findall(pattern, key) # ['first', 'line', 'second', 'line', 'third', 'line'] # 返回对象直接调用findall(key) 
print pattern.findall(key)  # ['first', 'line', 'second', 'line', 'third', 'line']

　　3.re.search()函数

# re.search(pattern, string, flags=0) # 使用指定正则去待操做字符串中寻找能够匹配的子串, 返回匹配上的第一个字串，而且再也不继续找. 若是找不到返回none

key = '''first line second line third line''' pattern = re.compile("\w+") print pattern.search(key).group() # first
print re.search(pattern, key).group() # first
 key = '''@@@ line second line third line''' pattern = re.compile("\w+") print pattern.search(key).group()  # line
print re.search(pattern, key).group() #line

　　4.re.match()spa

# re.match(pattern, string, flags=0) # 使用指定正则去待操做字符串中寻找能够匹配的子串, 返回匹配上的第一个字串，而且再也不继续找，须要注意的是 match 函数是从字符串开始处开始查找的， # 若是开始处不匹配，则再也不继续寻找，返回值为 一个 SRE_Match (re 内置对象用法) 对象，找不到时返回 None

key = '''first line second line third line''' pattern = re.compile("\w+") print pattern.match(key).group()  # first
print re.match(pattern, key).group() # first
 key = '''@@@ line second line third line''' pattern = re.compile("\w+") print pattern.match(key).group()  # 报错, 由于返回none因此没法调用group方法，而报错
print re.match(pattern, key).group() # 报错

　　5. re.escape(pattern)code

# re.escape(pattern) # 转义 若是你须要操做的文本中含有正则的元字符，你在写正则的时候须要将元字符加上反斜扛 \ 去匹配自身， 而当这样的字符不少时， # 写出来的正则表达式就看起来很乱并且写起来也挺麻烦的，这个时候你可使用这个函数, 这种状况下，pattern里面的元字符会彻底失去意义，用法是能够帮助咱们转义须要转义的字符串，而后能够将这些字符串进行拼接成新的正则

key = ".+\d123" pattern = re.escape(".+\d123") # 查看转义后的字符
print pattern # '\.\+\\d123' 格式是字符串的形式  # 查看匹配到的结果
print re.findall(pattern, key) # ['.+\\d123']

　　6. re.finditer()对象

# re.finditer(pattern, string, flags=0) # 参数和做用与 findall 同样，不一样之处在于 findall 返回一个列表， finditer 返回一个迭代器， 并且迭代器每次返回的值并非字符串， # 而是一个 SRE_Match (re 内置对象用法) 对象，这个对象的具体用法见 match 函数

key = '''first line second line third line''' pattern = re.compile("\w+") print pattern.finditer(key)  # 返回迭代器对象
print re.finditer(pattern, key) # 返回迭代器对象
for i in pattern.finditer(key): print i print i.group()  # 元素用group进行取值

　　7. re.split()blog

# re.split(pattern, string, maxsplit=0, flags=0) # 参数 maxsplit 指定切分次数， 函数使用给定正则表达式寻找切分字符串位置，返回包含切分后子串的列表，若是匹配不到，则返回包含原字符串的一个列表

key = '''first 111 line second 222 line third 333 line'''
# 按照数字切分
pattern = re.compile('\d+') print pattern.split(key, 1)  # ['first ', ' line\nsecond 222 line\nthird 333 line']
print re.split('\d+', key) # ['first ', ' line\nsecond ', ' line\nthird ', ' line'] # \.+ 匹配不到 返回包含自身的列表
print re.split('\.+', key, 1) # ['first 111 line\nsecond 222 line\nthird 333 line'] # maxsplit 参数
print re.split('\d+', key, 1) # ['first ', ' line\nsecond 222 line\nthird 333 line']

　　8. re.sub()索引

# re.sub(pattern, repl, string, count=0, flags=0) # 替换函数，将正则表达式 pattern 匹配到的字符串替换为 repl 指定的字符串, 参数 count 用于指定最大替换次数

key = "the sum of 7 and 9 is [7+9]."
# 基本用法 将目标替换为固定字符串
print re.sub('\[7\+9\]', '16', key) # the sum of 7 and 9 is 16. # # 高级用法 1 使用前面匹配的到的内容 \1 表明 pattern 中捕获到的第一个分组的内容
print re.sub('\[(7)\+(9)\]', r'\2\1', key) # the sum of 7 and 9 is 97.

　　sub的特殊用法，用函数进行替换：字符串

key = "the sum of 7 and 9 is [7+9]."

def replacement(m): p_str = m.group() if p_str == '7': return '77'
    if p_str == '9': return '99'
    return ''
# 这个用法的形式是将全部匹配到的值，做为参数传入提供的函数中，每一个参数都执行一次该函数，也能够在末尾加上匹配次数
print re.sub('\d', replacement, key)  # the sum of 77 and 99 is [77+99].

　　9.re.purge()

# re.purge() 清空表达式缓存 直接使用清空便可

　　以上就是对re模块经常使用的一些正则表达式的方法。

正则表达式的元字符

标准字符：是可以与“多种普通字符”匹配的字符串组合，以下所示

标准字符	含义
\d	匹配0-9中的任意一个数字，等效于[0-9]
\D	匹配非数字字符，等效于[^0-9]
\w	匹配任意一个字母、数字或下划线，等效于[^A-Za-z0-9_]
\W	与任何非字母、数字或下划线字符匹配，等效于[^A-Za-z0-9_]
\s	匹配任何空白字符，包括空格、制表符、换页符，等效于 ?[\f\n\r\t\v]
\S	匹配任何非空白字符，等效于[^\f\n\r\t\v]
\n	匹配换行符
\r	匹配一个回车符
\t	匹配制表符
\v	匹配垂直制表符
\f	匹配换页符

以下例子：

import re key = '''@132fjasldfj1231 aaa, bbb ''' pattern = re.compile(r'\d+') print pattern.search(key).group()  # 132
print pattern.findall(key) # ['132', '1231']
 pattern0 = re.compile(r'\D+') # @
print pattern0.search(key).group() # ['@', 'fjasldfj', '\n aaa,\n bbb\n']
print pattern0.findall(key) pattern1 = re.compile(r'\w+') print pattern1.search(key).group() # 132fjasldfj1231
print pattern1.findall(key) # ['132fjasldfj1231', 'aaa', 'bbb']
 pattern2 = re.compile(r'\W+') print pattern2.search(key).group() # @
print pattern2.findall(key)   # ['@', '\n ', ',\n ', '\n']
 pattern3 = re.compile(r'\s+') print pattern3.search(key).group()  # 输出的是空
print pattern3.findall(key) # ['\n ', '\n ', '\n']
 pattern4 = re.compile(r'\S+') print pattern4.search(key).group() # @132fjasldfj1231
print pattern4.findall(key) # ['@132fjasldfj1231', 'aaa,', 'bbb']
 pattern5 = re.compile(r'\n+') print pattern5.search(key).group() # 输出空
print pattern5.findall(key) # ['\n', '\n', '\n']
 pattern6 = re.compile(r'\t+') print pattern6.search(key).group()  # 报错
print pattern6.findall(key) # []

特殊字符：一些特殊字符有一些特殊的含义，具体含义以下表所示：

特殊字符	含义
\	转义字符，将下一个字符标记为一个特殊字符
^	匹配字符串开始的位置
$	匹配字符串结尾的位置
*	零次或屡次匹配前面的字符或子表达式
+	一次或屡次匹配前面的字符或子表达式
?	零次或一次匹配前面的字符或子表达式
.	“点” 匹配除“\r\n”以外的任何单个字符
\|	或
[ ]	字符集合
( )	分组，要匹配圆括号字符，请使用 “(” ?或 “)”

import re key = """SelectSelect **)()*^aaaa1214301asjl$@#! aaa 111, ^$&*@&dfjsalfjl""" pattern = re.compile(r'[\w\*]+')  # 对*进行转义
print pattern.findall(key) # ['Select', '**', '*', 'aaaa1214301asjl', 'aaa', '111', '*', 'dfjsalfjl']
print pattern.search(key).group() # Select
 pattern1 = re.compile(r'^Select[\s+]')  # 表示必须以Select开头，若不是则返回None
print pattern1.findall(key) # ['Select ']
print pattern1.search(key).group() # Select
 pattern2 = re.compile(r'\w+l$')  # 从最后开始匹配
print pattern2.findall(key) # ['dfjsalfjl']
print pattern2.search(key).group() # dfjsalfjl
 pattern3 = re.compile(r'\w') pattern31 = re.compile(r'\w+') pattern32 = re.compile(r'\w*') print pattern3.findall(key) #['S', 'e', 'l', 'e', 'c', 't', 'a', 'a', 'a' 。。。
print pattern3.search(key).group() # S
print pattern31.findall(key) # ['Select', 'aaaa1214301asjl', 'aaa', '111', 'dfjsalfjl']
print pattern31.search(key).group() # Select
print pattern32.findall(key) # ['Select', '', '', '', '', '', '', '', '', 'aaaa1214301asjl', '', 。。。
print pattern32.search(key).group() # Select

限定字符：指的是匹配字符的次数的限制

限定字符	含义
*	零次或屡次匹配前面的字符或子表达式
+	一次或屡次匹配前面的字符或子表达式
?	零次或一次匹配前面的字符或子表达式
{n}	n是一个非负整数，匹配肯定的n次
{n,}	n是非负整数，至少匹配n次
{n,m}	n和m是非负整数，其中n<=m；匹配至少n次，至多m次

定位符：指的是从那个地方开始匹配，或是匹配的起始或开始的限定

定位字符	含义
^	匹配字符串开始的位置，表示开始
$	匹配字符串结尾的位置，表示结尾
\b	匹配一个单词边界

re模块的匹配模式

修饰符	描述
re.I	使匹配对大小写不敏感
re.L	作本地化识别（locale-aware）匹配
re.M	多行匹配，影响 ^ 和 $
re.S	使 . 匹配包括换行在内的全部字符
re.U	根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B.
re.X	该标志经过给予你更灵活的格式以便你将正则表达式写得更易于理解

re模块分组用法

(...) 分组，默认为捕获，即被分组的内容能够被单独取出，默认每一个分组有个索引，从 1 开始，按照"("的顺序决定索引值

s= '123123asasas1212aaasss1' p=r'([a-zA-Z]*)(\d*)' pattern=re.compile(p) print pattern.findall(s)  # [('', '123123'), ('asasas', '1212'), ('aaasss', '1'), ('', '')]
print re.search(pattern,s).group() # 123123
print re.match(pattern,s).group() # 123123 #\D 匹配非数字,至关于 [^0-9] #\s 匹配任意空白字符， 至关于 [ \t\n\r\f\v]
s=r'123123asasas 1212aaa sss 1' p=r'([a-zA-Z]*\s)(\d*)' pattern=re.compile(p) print pattern.findall(s) # # [('asasas ', '1212'), ('aaa ', ''), ('sss ', '1')] #\S 匹配非空白字符，至关于 [^ \t\n\r\f\v]
s=r'123123asasas 1212aaa sss 1' p=r'(\S+)(\d*)' pattern=re.compile(p) print pattern.findall(s) # [('123123asasas', ''), ('1212aaa', ''), ('sss', ''), ('1', '')] #\w 匹配数字、字母、下划线中任意一个字符， 至关于 [a-zA-Z0-9_]
s=r'123123asasas _12_ 1212aaa sss 1' p=r'(\w+)(\w+)' pattern=re.compile(p) print pattern.findall(s) # [('123123asasa', 's'), ('_12', '_'), ('1212aa', 'a'), ('ss', 's')] #\W 匹配非数字、字母、下划线中的任意字符，至关于 [^a-zA-Z0-9_]
s=r'12738647@qq.com//@' p=r'(\W+)qq(\W+)com(\W+)' pattern=re.compile(p) print pattern.findall(s) # [('@', '.', '//@')] #\b 匹配位于单词开始或结束位置的空字符串,表示字母数字与非字母数字的边界,非字母数字与字母数字的边界
s=r'absd @bsc //bsc@ 1234' p=r'(\b\w+\b)' pattern=re.compile(p) print pattern.findall(s) # ['absd', 'bsc', 'bsc', '1234'] #\B 表示字母数字与字母数字的边界，非字母数字与非字母数字的边界
s=r' @absd @bsc ///bs___adc 1234@qq.com' p=r'(\B.+\B)' pattern=re.compile(p) print pattern.findall(s) # [' @absd @bsc ///bs___adc 1234@qq.co']

re模块环视用法

环视还有其余的名字，例如 界定、断言、预搜索等，叫法不一。
环视是一种特殊的正则语法，它匹配的不是字符串，而是 位置，其实就是使用正则来讲明这个位置的左右应该是什么或者应该不是什么，而后去寻找这个位置。
环视的语法有四种，见第一小节元字符，基本用法以下：

s = 'Hello, Mr.Gumby : 2016/10/26 Hello,r.Gumby: 2016/10/26'
# 不加环视限定
print re.compile("(?P<name>\w+\.\w+)").findall(s) # ['Mr.Gumby', 'r.Gumby'] # 环视表达式所在位置 左边为 "Hello, "
print re.compile("(?<=Hello, )(?P<name>\w+\.\w+)").findall(s) # ['Mr.Gumby'] # 环视表达式所在位置 左边不为 ","
print re.compile("(?<!,)(?P<name>\w+\.\w+)").findall(s) # ['Mr.Gumby'] # 环视表达式所在位置 右边为 "M"
print re.compile("(?=M)(?P<name>\w+\.\w+)").findall(s) # ['Mr.Gumby'] # 环视表达式所在位置 右边不为 r
print re.compile("(?!r)(?P<name>\w+\.\w+)").findall(s)  # ['Mr.Gumby']