认识python正则模块re

时间 2019-11-19

原文原文链接

python正则模块re

　　python中re中内置匹配、搜索、替换方法见博客---python附录-re.py模块源码（含re官方文档连接）html

　　正则的应用是处理一些字符串，phthon的博文python-基础学习篇（二）中提到了字符串类型有一些字符串内置的处理方法，可是须要了解一点内置方法是适用于一些简单字符串的处理，复杂的字符串处理方法仍是正则表达式的天下。至于为啥要整一些内置方法，我我的认为对于一些简单应用中的字符串处理，无需使用一个总体的系统的正则知识，同时也是python易入门的体现。python

　　python中的正则内置于re模块中，使用正则以前须要导入re模块。git

import re

　　有了以前的正则表达式的基础，咱们能够写出一些正则表达式（pattern）了，如何使用正则表达式去处理字符串(string)呢？只能经过re模块中内置的几个方法去操做。正则表达式

　　re模块内置的函数方法

　　re.compile(pattern, flags=0)

　　re.compile()方法能够把一个正则表达式编译成一个正则对象(PatternObj)，返回的正则对象是操做其余处理字符串方法的主体。数组

pattern_obj = re.compile(pattern)
match_obj = pattern_obj.compile(string)

　　等同于缓存

match_obj = re.match(pattern,string)

　　实际上re.match()处理流程内含re.compile()的过程。match方法源码：app

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

　　能够看出match方法返回的实际就是正则对象pattern_obj调用match()方法的结果。函数

　　re.search(pattern, string, flags=0)

　　re.search()方法是搜索整个字符串，找到第一个符合正则规则的字符串部分，返回一个匹配对象(MatchObject)；没有匹配成功，就返回None。post

 1 import re
 2 
 3 
 4 pattern = r'the'
 5 match_obj = re.search(pattern, 'The dog is eating the bone', re.I)
 6 print(match_obj.group(0))
 7 print(match_obj)
 8 
 9 
10 # The
11 # <re.Match object; span=(0, 3), match='The'>

　　re.match(pattern, string, flags=0)

　　re.match()方法是从字符串开始位置匹配整个字符串，当从字符串开始成功匹配到部分字符内容，返回一个匹配对象(MatchObject)；没有匹配成功，就返回None。学习

1 import re
2 
3 
4 pattern = r'the'
5 match_obj = re.match(pattern, 'Dog is eating the bone', re.I)
6 print(match_obj)
7 
8 # None

　　对比

 1 import re
 2 
 3 
 4 pattern = r'the'
 5 match_obj = re.match(pattern, 'The dog is eating the bone', re.I)
 6 print(match_obj.group(0))
 7 print(match_obj)
 8 
 9 # The
10 # <re.Match object; span=(0, 3), match='The'>

　　re.search()和re.match()区别对比：位置上，search()方法能够从字符串任意位置匹配部分字符串内容，match()方法必须从字符串开始位置匹配字符串内容，一旦开头匹配不成，则匹配失败；内容上，search()方法是非贪婪匹配，只要找到第一个符合正则规则的部分字符串就返回匹配对象，match()方法则是按照正则规则只匹配字符串开始位置的部分字符串；多行模式下，match()方法依旧只会匹配字符串的开始位置，而search()方法和“^”联合使用则是从多行的每一行开始匹配。

　　re.fullmatch(pattern, string, flags=0)

　　re.fullmatch()类似于re.match()是从字符串开始位置开始匹配，re.match()是匹配字符串部分或者所有，而re.fullmatch()是匹配字符串的所有，当且仅当正则表达式匹配整个字符串内容的时候，返回一个匹配对象MatchObject，不然返回None。

　　re.split(pattern, string, maxsplit=0, flags=0)

　　re.split()表示对字符串string，按照正则表达式pattern匹配内容分隔字符串，其中maxsplit是指最大分隔次数，最大分隔次数应该是小于默认分隔次数的。分隔后的字符串内容组成列表返回。

 1 import re
 2 
 3 
 4 split_list_default = re.split(r'\W+', 'Words, words, words.')
 5 print(split_list_default)
 6 
 7 # ['Words', 'words', 'words', ''] 正则表达式\W+表示以一个或多个非单词字符对字符串分隔，分隔后组成列表的形式返回，注意列表后空字符串为'.'和以前的words分隔结果
 8 
 9 split_list_max = re.split(r'\W+', 'Words, words, words.', 1)
10 print(split_list_max)
11 
12 # ['Words', 'words, words.'] 指定分隔次数，字符串分隔会由左至右按照maxsplit最大分隔次数分隔，实际最大分隔次数是小于等于默认分隔次数的
13 
14 split_list_couple = re.split(r'(\W+)', 'Words, words, words.')
15 print(split_list_couple)
16 
17 # ['Words', ', ', 'words', ', ', 'words', '.', ''] 正则表达式中存在分组状况，即捕获型括号，(\W+)会捕获字符串中‘， ’并添加至列表一块儿显示出来

　　re.findall(pattern, string, flags=0)

　　re.findall()相似于re.search()方法，re.search()是在字符串中搜索到第一个与正则表达式匹配的字符串内容就返回一个匹配对象MatchObject，而re.findall()方法是在字符串中搜索并找到全部与正则表达式匹配的字符串内容，组成一个列表返回，列表中元素顺序是按照正则表达式在字符串中由左至右匹配的返回；未匹配成功，返回一个空列表。

import re


pattern = r'\d{3}'
find = re.findall(pattern, 'include21321exclude13243alert213lib32')
print(find)

# ['213', '132', '213']

　　注意：当re.findall()中的正则表达式存在两个或两个以上分组时，按照分组自左向右的形式匹配，匹配结果按照顺序组成元组，返回列表中元素以元组的形式给出。

import re


pattern = r'(\d{3})(1)'
find = re.findall(pattern, 'include21321exclude13243alert213lib32')
print(find)

# [('132', '1')]

　　re.finditer(pattern, string, flags=0)

　　re.finditer()类似于re.findall()方法，搜索字符串中全部与正则表达式匹配的字符串内容，返回一个迭代器Iterator，迭代器Iterator内保存了全部匹配字符串内容生成的匹配对象MatchObject。即匹配文本封装在匹配对象MatchObject中，多个匹配对象MatchObject保存在一个迭代器Iterator中。

import re


pattern = r'\d{3}'
find = re.finditer(pattern, 'include21321exclude13243alert213lib32')
print(find)
for i in find:
    print(i)
    print(i.group(0))

# <callable_iterator object at 0x00000000028FB0F0>
# <re.Match object; span=(7, 10), match='213'>
# 213
# <re.Match object; span=(19, 22), match='132'>
# 132
# <re.Match object; span=(29, 32), match='213'>
# 213

　　re.sub(pattern, repl, string, count=0, flags=0)

　　re.sub()表示用正则表达式匹配字符串string中的字符串内容，使用repl参数内容替换匹配完成的字符串内容，返回替换后的字符串。参数count指定替换次数，正则表达式匹配字符串是由左至右的，可能匹配多个内容，替换操做也是自左向右替换，若是只想替换左边部分匹配内容能够设置count参数，参数值为非负整数且小于等于最大匹配成功个数；未匹配成功，不作替换，返回原字符串。

import re


pattern = r'\d+'
find_default = re.sub(pattern, ' ', 'include21321exclude13243alert213lib32')
print(find_default)

find_count = re.sub(pattern, ' ', 'include21321exclude13243alert213lib32', 2)
print(find_count)

# include exclude alert lib
# include exclude alert213lib32

　　注意：repl参数内容能够是字符串也能够是函数，若是repl是函数，要求这个函数只能有一个匹配对象MatchObject参数，将匹配成功后生成的匹配对象传入函数处理后拼接到原字符串返回。

import re


def replace_func(match_obj):
    if match_obj.group(0).isdigit():
        return ' '
    else:
        return '-'


pattern = r'\d+'
find_default = re.sub(pattern, replace_func,  'include21321exclude13243alert213lib32')
print(find_default)

# include exclude alert lib

　　re.subn(pattern, repl, string, count=0, flags=0)

　　re.subn()与re.sub()做用相同，只在返回结果有所差异，re.sub()返回是替换后的字符串，而re.subn()返回是一个由替换后的字符串和替换次数组合成的元组。

import re


def replace_func(match_obj):
    if match_obj.group(0).isdigit():
        return ' '
    else:
        return '-'


pattern = r'\d+'
find_default = re.subn(pattern, replace_func,  'include21321exclude13243alert213lib32')
print(find_default)

# ('include exclude alert lib ', 4)

　　re.escape(pattern)

　　转义正则表达式中能够产生特殊含义的字符，主要用于匹配文本字符串中含有正则表达式的情形。

import re


result = re.escape('\d*')
print(result)

# \\d\*

　　re.purge()

　　清除正则表达式缓存

　　参数flags

　　上述方法中含有默认参数flags=0，能够经过函数的调用为flags指定特殊的参数值来指定匹配模式。经常使用参数值有：

　　re.I(re.IGNORECASE)，不区分大小写模式；

　　re.M(re.MULTILINE)，多行模式；

　　re.S(re.DOTALL)，单行模式；

　　re.X(re.VERBOSE)，注释模式；

　　正则对象（Pattern）

　　正则对象可使用直接调用上述方法，在re.match()方法中有所描述。由match()方法到subn()方法都是正则对象Pattern的实例方法，正则对象Pattern的实例属性有：

　　Pattern.flags

　　指定或获取匹配模式，如：Pattern.flags = re.I，可是通常不直接操做实例属性，由实例方法操做实例属性，故该属性多用于获取匹配模式。

　　Pattern.groups

　　获取捕获分组的数量。

　　Pattern.pattern

　　获取原始正则表达式

　　匹配对象（MatchObject）

　　匹配对象是对匹配内容的封装。

　　MatchObject.group(num)

　　获取匹配对象中封装的匹配内容，group(0)表示获取所有内容，大于等于1表示获取对应捕获分组中的内容。

import re


pattern = r'(\w+) (\w+)'
match_obj = re.match(pattern, 'Snow Stack')
print(match_obj.group(0))
print(match_obj.group(1))
print(match_obj.group(2))

# Snow Stack
# Snow
# Stack

　　MatchObject.__getitem__(num)

　　做用同MatchObject.group(num)。

import re


pattern = r'(\w+) (\w+)'
match_obj = re.match(pattern, 'Snow Stack')
print(match_obj[0])
print(match_obj[1])
print(match_obj[2])

# Snow Stack
# Snow
# Stack

　　MatchObject.groups()

　　以元组的形式返回全部捕获分组内容，只返回捕获分组中的内容，不包含其余匹配内容。

import re


pattern = r'(\w+) (\w+)'
match_obj = re.match(pattern, 'Snow Stack')
print(match_obj.groups())

# ('Snow', 'Stack')

　　MatchObject.groupdict()

　　返回一个字典，包含了全部的命名子组。key就是组名，value就是捕获分组匹配的内容。

import re


pattern = r'(?P<first_name>\w+) (?P<last_name>\w+)'
match_obj = re.match(pattern, 'Snow Stack')
print(match_obj.groupdict())

# {'first_name': 'Snow', 'last_name': 'Stack'}