Python学习手册之捕获组和特殊匹配字符串

时间 2020-05-06

原文原文链接

在上一篇文章中，咱们介绍了 Python 的字符类和对元字符进行了深刻讲解，如今咱们介绍 Python 的捕获组和特殊匹配字符串。查看上一篇文章请点击：https://www.cnblogs.com/dustman/p/10036661.htmlhtml

捕获组
能够经过用括号包围正则表达式的部分来建立组，意味着一个组能够做为元字符 (例如 * 和 ?) 的参数。java

import re pattern = r"python(ice)*" string1 = "python!" string2 = "ice" string3 = "pythonice" match1 = re.match(pattern,string1) match2 = re.match(pattern,string2) match3 = re.match(pattern,string3) if match1: print(match1.group()) print("match 1") if match2: print(match2.group()) print("match 2") if match3: print(match3.group()) print("match 3")

运行结果：python

>>> python match 1 pythonice match 3 >>>

上面的例子 (ice) 表示捕获组。

以前介绍元字符和字符类时，咱们都用到了 group 函数访问捕获组中的内容。group(0) 或 group() 返回所有匹配，group(n) 调用 n 大于 0 返回第 n 组匹配。groups() 返回一个包含全部捕获组的元组。mysql

import re pattern = r"j(av)(ap)(yt(h)o)n" string = "javapythonhtmlmysql" match = re.match(pattern,string) if match: print(match.group()) print(match.group(0)) print(match.group(1)) print(match.group(2)) print(match.groups())

运行结果：正则表达式

>>> javapython javapython av ap ('av', 'ap', 'ytho', 'h') >>>

捕获组同时能够嵌套，也就是说一个组能够是另外一个组的子集。

有一些特殊的捕获组，它们叫非捕获组和命名捕获组。
命名捕获组的格式是 (?p<name>...)，其中 name 是组的名称，...是要匹配的表达式。它们的行为与正常组彻底相同，除了能够经过索引访问还能够经过 group(name) 方式访问它们。
非捕获组的格式是 (?:...)。非捕获组值匹配结果，但不捕获结果，也不会分配组号，固然也不能在表达式和程序中作进一步处理。sql

import re pattern = r"(?P<python>123)(?:456)(789)" string = "123456789" match = re.match(pattern,string) if match: print(match.group("python")) print(match.groups())

运行结果：函数

>>> 123 ('123', '789') >>>

或匹配的元字符 |，red|blue 表示匹配 red 或者 blue。spa

import re string1 = "python" string2 = "pyihon" string3 = "pylhon" pattern = r"py(t|i)hon" match1 = re.match(pattern,string1) match2 = re.match(pattern,string2) match3 = re.match(pattern,string3) if match1: print(match1.group()) print("match 1") if match2: print(match2.group()) print("match 2") if match3: print(match3.group()) print("match 3")

运行结果：code

>>> python match 1 pyihon match 2 >>>

特殊匹配字符串
特殊序列
在正则表达式中可使用各类的捕获组序列。它们被写成反斜杠，后面跟着另外一个数字字符。
特殊序列是一个反斜杠和一个介于 1 到 99 之间的数字，好比：\1。数字自发表示捕获组的序列，也就是说咱们能够在正则表达式里引用先前的捕获组。htm

import re string1 = "html python" string2 = "python python" string3 = "java java" pattern = r"(.+) \1" match1 = re.match(pattern,string1) match2 = re.match(pattern,string2) match3 = re.match(pattern,string3) if match1: print(match1.group()) print("match 1") if match2: print(match2.group()) print("match 2") if match3: print(match3.group()) print("match 3")

运行结果：

>>> python python match 2 java java match 3 >>>

注意：(.+) \1 不等同于 (.+)(.+)，由于 \1 引用第一组的表达式，即匹配表达式自己，而不是正则匹配模式。

正则中还有一些特殊的匹配模式 \d, \s, 和 \w, 它们匹配数字，空白和单词字符。在 ASCII 模式里正则里等同 [0-9], [ \t\n\r\v] 和 [a-zA-Z0-9], 可是在 Unicode 模式里 \w 匹配一个字。
若是咱们把这几个字母变成大写 \D, \S, 和 \W, 那么意味着匹配模式相反。好比: \D 匹配非数字。

import re string1 = "python 2017!" string2 = "1,00,867!" string3 = "!@#?" pattern = r"(\D+\d)" match1 = re.match(pattern,string1) match2 = re.match(pattern,string2) match3 = re.match(pattern,string3) if match1: print(match1.group()) print("match 1") if match2: print(match2.group()) print("match 2") if match3: print(match3.group()) print("match 3")

运行结果：

>>> python 2 match 1 >>>

(\D+\d) 意味着匹配一个或者多个非数字后面跟随一个数字。

特殊匹配
还有一些特殊的匹配表达式 \A, \Z, 和 \b。\A 仅匹配字符串的开始，在大多数条件下，它的做用等同于在模式中使用 ^。 \Z 仅匹配字符串的结束，在大多数状况下，相等于 $。
\b 匹配一个词的边界。一个词的边界就是一个词不被另一个词跟随的位置或者不是另外一个词汇字符前边的位置。至关于\w 和 \W 之间有个一个空字符串。
\B 匹配一个非单词边界。它匹配一个先后字符都是相同类型的位置：都是单词或者都不是单词。一个字符串的开始和结尾都被认为是非单词。

import re string1 = "The dog eat!" string2 = "<dog>dog<>?" string3 = "dogeatpython" pattern = r"\b(dog)\b" search1 = re.search(pattern,string1) search2 = re.search(pattern,string2) search3 = re.search(pattern,string3) if search1: print(search1.group()) print("search 1") if search2: print(search2.group()) print("search 2") if search3: print(search3.group()) print("search 3")

运行结果：

>>> dog search 1 dog search 2 >>>

注意：一个匹配词的边界并不包含在匹配的内容中，换句话说，一个匹配的词的边界的内容的长度是0。\b(dog)\b 匹配的结果是 "dog"。

“美满婚姻并不是 “壁人成双”，而是不完美的一双学会互相欣赏彼此的差异。”　-- 大卫·鲍伊

原文出处：https://www.cnblogs.com/dustman/p/10040430.html