python cookbook 2字符串(2)

时间 2019-12-06

原文原文链接

6搜索替代文本,忽略大小写.python

为实现忽略大小写,须要使用re的re.IGNORECASE标志位正则表达式

>>> text = 'UPPER PYTHON, lower python, Mixed Python'
>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']
>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)
'UPPER snake, lower snake, Mixed snake'

7正则表达式最小匹配this

正则表达式默认是贪婪匹配,在正则匹配从左到右取值时,会尽可能专区知足匹配的最长字符串.编码

当须要最小匹配时,用"?",能够放在"*","+","?"后面
spa

贪婪匹配
>>> text2 = 'Computer says "no." Phone says "yes."'
>>> str_pat = re.compile(r'\"(.*)\"')
>>> str_pat.findall(text2)
['no." Phone says "yes.']
最小匹配
>>> str_pat = re.compile(r'\"(.*?)\"')
>>> str_pat.findall(text2)
['no.', 'yes.']

8正则表达式多重匹配code

正则表达式中"."能匹配任意字符,除了换行符
orm

>>> comment = re.compile(r'/\*(.*?)\*/')
>>> text1 = '/* this is a comment */'
>>> text2 = '''/* this is a
... multiline comment */
... '''
>>> comment.findall(text1)
[' this is a comment ']
>>> comment.findall(text2)
[]
这种状况下,应该在正则表达式中加入换行符
>>> comment = re.compile(r'/\*((?:.|\n)*?)\*/')
>>> comment.findall(text2)
[' this is a\nmultiline comment ']
这里(?:.|\n)定义了一个非捕捉组,若是不加?:,嵌套括号(.|\n)里的内容也会被捕捉出来,这里没有匹配,捕捉为空
>>> comment = re.compile(r'/\*((.|\n)*?)\*/')
>>> comment.findall(text2)
[(' this is a\nmultiline comment ', ' ')]
re.compile接受一个flag,re.DOTALL,能够使"."匹配任意字符,包括换行符
>>> comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
>>> comment.findall(text2)
[' this is a\nmultiline comment ']
可是,若是你的正则表达式比较复杂,或者多个独立的正则表达式组合到一块儿使用,可能会出现问题,
不建议使用该flag

9将Unicode转换为标准形式(仅适用python3)utf-8

3.X版本中python环境就只有unicode类型的字符串了，即全部程序中处理的都会自动转换成unicode字符串。ci

Unicode中某些字符能够有多种有效序列
unicode

python3默认输入是Unicode码,输出时会将Unicode转换为utf-8码
>>> s1 = 'Spicy Jalape\u00f1o'
>>> s2 = 'Spicy Jalapen\u0303o'
>>> s1
'Spicy Jalapeño'
>>> s2
'Spicy Jalapeño'
>>> s1==s2
False
这种状况下,对比字符串会出现问题,能够用unicodedata进行转换
>>> import unicodedata
>>> t1 = unicodedata.normalize('NFC', s1)
>>> t1
'Spicy Jalapeño'
>>> t2 = unicodedata.normalize('NFC', s2)
>>> t2
'Spicy Jalapeño'
>>> t1==t2
True
>>> print (t1)
Spicy Jalapeño
>>> print (ascii(t1))
'Spicy Jalape\xf1o'
>>> t3 = unicodedata.normalize('NFD', s1)
>>> t4 = unicodedata.normalize('NFD', s2)
>>> t3 == t4
True
>>> print(ascii(t3))
'Spicy Jalapen\u0303o'
normalize指明想要如何实现标准化,NFC指尽量用单个编码符,NFD尽可能用多个编码分解

10Unicode字符的正则表达式

>>> import re
>>> num = re.compile('\d+')
>>> num.match('123')
<_sre.SRE_Match object; span=(0, 3), match='123'>
>>> num.match('\u0661\u0662\u0663')#只有在python3中能够实现
<_sre.SRE_Match object; span=(0, 3), match='١٢٣'>
注意一些特例,一些Unicode没法匹配忽略大小写
>>> pat = re.compile('stra\u00dfe', re.IGNORECASE)
>>> s = 'straße'
>>> pat.match(s)
<_sre.SRE_Match object; span=(0, 6), match='straße'>
>>> pat.match(s.upper())#不匹配
>>> s.upper()
'STRASSE'