字符串操做

时间 2021-08-13

标签 html python 正则表达式 api app ide 函数工具 this 栏目 HTML 繁體版

原文原文链接

1.使用多个界定符分割字符串

string 对象的 split() 方法只适应于很是简单的字符串分割情形，它并不容许有多个分隔符或者是分隔符周围不肯定的空格。当你须要更加灵活的切割字符串的时候，最好使用 re.split() 方法：html

>>> line = 'asdf fjdk; afed, fjek,asdf, foo'>>> import re>>> re.split(r'[;,\s]\s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

2.字符串开头结尾匹配

检查字符串开头或结尾的一个简单方法是使用 str.startswith() 或者是 str.endswith() 方法。好比：python

>>> filename = 'spam.txt'>>> filename.endswith('.txt')
True>>> filename.startswith('file:')
False>>> url = 'http://www.python.org'
 >>> url.startswith('http:')
True

若是你想检查多种匹配可能，只须要将全部的匹配项放入到一个元组中去 [name for name in filenames if name.endswith(('.c', '.h')) ]正则表达式

3.你想使用 Unix Shell 中经常使用的通配符(好比 .py , Dat[0-9].csv 等)去匹配文本字符串

fnmatch 模块提供了两个函数—— fnmatch() 和 fnmatchcase() ，能够用来实现这样的匹配。用法以下：api

>>> from fnmatch import fnmatch, fnmatchcase>>> fnmatch('foo.txt', '*.txt')True>>> fnmatch('foo.txt', '?oo.txt')True>>> fnmatch('Dat45.csv', 'Dat[0-9]*')True

4正则表达式

match() 老是从字符串开始去匹配，若是你想查找字符串任意部分的模式出现位置，使用findall() 方法去代替。好比：app

>>> datepat = re.compile(r'\d+/\d+/\d+')>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'>>> datepat.findall(text)
['11/27/2012', '3/13/2013']
>>>

在定义正则式的时候，一般会利用括号去捕获分组。好比：ide

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)')

findall() 方法会搜索文本并以列表形式返回全部的匹配。若是你想以迭代方式返回匹配，可使用 finditer() 方法来代替函数

5 字符串的搜索和替换

对于简单的字面模式，直接使用 str.repalce() 方法便可，对于复杂的模式，请使用 re 模块中的 sub() 函数。工具

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'>>> import re>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)'Today is 2012-11-27. PyCon starts 2013-3-13.'

sub() 函数中的第一个参数是被匹配的模式，第二个参数是替换模式。反斜杠数字好比 \3 指向前面模式的捕获组号。若是除了替换后的结果外，你还想知道有多少替换发生了，可使用 re.subn() 来代替。好比：ui

>>> newtext, n = datepat.subn(r'\3-\1-\2', text)>>> newtext'Today is 2012-11-27. PyCon starts 2013-3-13.'>>> n2

6忽略大小写的搜索替换

为了在文本操做时忽略大小写，你须要在使用 re 模块的时候给这些操做提供 re.IGNORECASE 标志参数。好比：this

>>> text = 'UPPER PYTHON, lower python, Mixed Python'>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)'UPPER snake, lower snake, Mixed snake'

最后的那个例子揭示了一个小缺陷，替换字符串并不会自动跟被匹配字符串的大小写保持一致。为了修复这个，你可能须要一个辅助函数，就像下面的这样：

def matchcase(word):
    def replace(m):
        text = m.group()        if text.isupper():            return word.upper()        elif text.islower():            return word.lower()        elif text[0].isupper():            return word.capitalize()        else:            return wordreturn replace

下面是使用上述函数的方法：

>>> re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)'UPPER SNAKE, lower snake, Mixed Snake'

matchcase('snake') 返回了一个回调函数(参数必须是 match 对象),sub() 函数除了接受替换字符串外，还能接受一个回调函数。

7.多行匹配

（.*?）换成((?:.|\n)*?)在这个模式中， (?:.|\n) 指定了一个非捕获组 (也就是它定义了一个仅仅用来作匹配，而不能经过单独捕获或者编号的组) re.compile() 函数接受一个标志参数叫 re.DOTALL ，在这里很是有用。它可让正则表达式中的点(.)匹配包括换行符在内的任意字符。好比：

>>> comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)>>> comment.findall(text2)
[' this is a\n multiline comment ']

8.删除字符串中不须要的字符

strip() 方法能用于删除开始或结尾的字符。 lstrip() 和 rstrip() 分别从左和从右执行删除操做。默认状况下，这些方法会去除空白字符，可是你也能够指定其余字符。

9.审查清理文本字符串

str.translate()方法根据参数table给出的表(包含 256 个字符)转换字符串的字符, 要过滤掉的字符放到 del 参数中。语法：str.translate(table[, deletechars]);

#!/usr/bin/pythonfrom string import maketrans   # Required to call maketrans function.intab = "aeiou"outtab = "12345"trantab = maketrans(intab, outtab)
str = "this is string example....wow!!!";print str.translate(trantab, 'xm');

输出：th3s 3s str3ng 21pl2....w4w!!!

10.字符串对齐

对于基本的字符串对齐操做，可使用字符串的 ljust() , rjust() 和 center() 方法。好比：

>>> text = 'Hello World'>>> text.ljust(20)'Hello World         '>>> text.rjust(20)'         Hello World'>>> text.center(20)'    Hello World     '

全部这些方法都能接受一个可选的填充字符。好比：

>>> text.rjust(20,'=')'=========Hello World'

函数 format() 一样能够用来很容易的对齐字符串。你要作的就是使用 <,> 或者 ^ 字符后面紧跟一个指定的宽度。好比：

>>> format(text, '>20')'         Hello World'>>> format(text, '<20')'Hello World         '>>> format(text, '^20')'    Hello World     '

若是你想指定一个非空格的填充字符，将它写到对齐字符的前面便可： >>> format(text, '=>20s') 当格式化多个值的时候，这些格式代码也能够被用在 format() 方法中。好比：

>>> '{:>10s} {:>10s}'.format('Hello', 'World')'     Hello      World'

11.字符串中插入变量

format() 方法，结合使用 format_map() 和vars() vars()返回对象object的属性和属性值的字典对象。若是默认不输入参数，就打印当前调用位置的属性和属性值，至关于locals()的功能。若是有参数输入，就只打印这个参数相应的属性和属性值

'{name} has {n} messages.'.format(name='Guido', n=37) #'Guido has 37 messages.'>>> name = 'Guido'>>> n = 37>>> s.format_map(vars())'Guido has 37 messages.'

vars() 还有一个有意思的特性就是它也适用于对象实例。好比：

>>> class Info:...     def __init__(self, name, n):...         self.name = name...         self.n = n
...>>> a = Info('Guido',37)>>> s.format_map(vars(a))'Guido has 37 messages.'>>>

format 和 format_map() 的一个缺陷就是它们并不能很好的处理变量缺失的状况，好比：

>>> s.format(name='Guido')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>KeyError: 'n'

一种避免这种错误的方法是另外定义一个含有 missing() 方法的字典对象，就像下面这样：

class safesub(dict):"""防止key找不到"""
    def __missing__(self, key):
        return '{' + key + '}'

如今你能够利用这个类包装输入后传递给 format_map() ：

>>> del n # Make sure n is undefined>>> s.format_map(safesub(vars()))'Guido has {n} messages.'

若是你发现本身在代码中频繁的执行这些步骤，你能够将变量替换步骤用一个工具函数封装起来。就像下面这样：

import sysdef sub(text):
    return text.format_map(safesub(sys._getframe(1).f_locals))

如今你能够像下面这样写了：

>>> name = 'Guido'>>> n = 37>>> print(sub('Hello {name}'))
Hello Guido>>> print(sub('You have {n} messages.'))
You have 37 messages.>>> print(sub('Your favorite color is {color}'))
Your favorite color is {color}

sub() 函数使用 sys.getframe(1) 返回调用者的栈帧。能够从中访问属性 flocals 来得到局部变量。另外，值得注意的是 flocals 是一个复制调用函数的本地变量的字典。尽管你能够改变 flocals 的内容，可是这个修改对于后面的变量访问没有任何影响。

12.以指定列宽格式化字符串

使用 textwrap 模块来格式化字符串的输出。好比，假如你有下列的长字符串：

s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."

下面演示使用 textwrap 格式化字符串的多种方式：textwrap帮助文档

>>> import textwrap
>>> print(textwrap.fill(s, 70))
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,not around the eyes, don't look around the eyes, look into my eyes,
you're under.
>>> print(textwrap.fill(s, 40))
Look into my eyes, look into my eyes,
the eyes, the eyes, the eyes, not around
the eyes, don't look around the eyes,
look into my eyes, you're under.
>>> print(textwrap.fill(s, 40, initial_indent='    '))
    Look into my eyes, look into myeyes, the eyes, the eyes, the eyes, notaround the eyes, don't look around the
eyes, look into my eyes, you're under.
>>> print(textwrap.fill(s, 40, subsequent_indent='    '))
Look into my eyes, look into my eyes,
    the eyes, the eyes, the eyes, not
    around the eyes, don't look around
    the eyes, look into my eyes, you're 
    under.

textwrap 模块对于字符串打印是很是有用的，特别是当你但愿输出自动匹配终端大小的时候。你可使用 os.getterminalsize() 方法来获取终端的大小尺寸。好比：

>>> import os>>> os.get_terminal_size().columns80

13.在字符串中处理html和xml 将HTML或者XML实体如 &entity; 或 &#code; 替换为对应的文本。再者，你须要转换文本中特定的字符(好比<, >, 或 &)。若是你想替换文本字符串中的 ‘<’ 或者 ‘>’ ，使用 html.escape() 函数能够很容易的完成。

>>>s = 'Elements are written as "<tag>text</tag>".'>>>import html
>>>print(html.escape(s))
Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.
>>>print(html.escape(s, quote=False))#引号不转换Elements are written as "&lt;tag&gt;text&lt;/tag&gt;".>>> s = 'Spicy &quot;Jalape&#241;o&quot.'>>> from html.parser import HTMLParser>>> p = HTMLParser()>>> p.unescape(s)'Spicy "Jalapeo".'>>> t = 'The prompt is &gt;&gt;&gt;'>>> from xml.sax.saxutils import unescape>>> unescape(t)'The prompt is >>>'