python cookbook 2字符串(4)

时间 2019-12-05

原文原文链接

16.转换文本为固定大小的列,文本的排版html

textwrappython

>> s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
... the eyes, not around the eyes, don't look around the eyes, \
... look into my eyes, you're under."
>>> import textwrap
>>> print textwrap.fill(s,70)
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes, don't look around the eyes, look into my eyes,
you're under.
textwrap.fill()能够设定每行最大字符个数,但不会对单词进行分割,
initial_indent,subsequent_indent,标志位能够设定起始和终止字符
>>> print textwrap.fill(s,40,initial_indent='    ')
    Look into my eyes, look into my
eyes, the eyes, the eyes, the eyes, not
around the eyes, don't look around the
eyes, look into my eyes, you're under.

17.处理文本中的HTML 和XML字符(仅适用python3)正则表达式

Python 2有两种字符串类型：Unicode字符串和非Unicode字符串。Python 3只有一种类型：Unicode字符串(Unicode strings)spa

若是你想要用HTML和XML的通讯文本取代他们的字符如&entity;或者 &#code;,你须要生成文本并跳过某些字符
code

用html.escape能够取代某些特殊字符如'<','>'
>>> s = 'Elements are written as "<tag>text</tag>".'
>>> import html
>>> print(s)
Elements are written as "<tag>text</tag>".
>>> print(html.escape(s))
Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.
>>> # 关闭escape的quote
>>> print(html.escape(s, quote=False))
Elements are written as "&lt;tag&gt;text&lt;/tag&gt;".
若是你要生成ASCII字符,能够用 errors='xmlcharrefreplace'以便不一样的IO功能处理
>>> s = 'Spicy Jalapeño'
>>> s.encode('ascii', errors='xmlcharrefreplace')
b'Spicy Jalape&#241;o'
若是因为某些缘由,你收到一些包含一些字符的原始文本,想要手动替换,你能够用不一样的html或xml相关的语法处理
>>> s = 'Spicy &quot;Jalape&#241;o&quot.'
>>> from html.parser import HTMLParser
>>> p = HTMLParser()
>>> p.unescape(s)
'Spicy "Jalapeño".'
>>>
>>> t = 'The prompt is &gt;&gt;&gt;'
>>> from xml.sax.saxutils import unescape
>>> unescape(t)
'The prompt is >>>'

18.切分文本orm

若是你有一个字符文本
text = 'foo = 23 + 42 * 10'
为了切分文本,你不只须要匹配文本,还要能识别要替换的文本
tokens = [('NAME', 'foo'), ('EQ','='), ('NUM', '23'), ('PLUS','+'),
('NUM', '42'), ('TIMES', '*'), ('NUM', '10')]
用于捕捉的的正则表达式以下
import re
NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ= r'(?P<EQ>=)'
WS= r'(?P<WS>\s+)'
master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))

?P<TOKENNAME>语法用于给正则表达式命名xml

scanner()能够生成一个扫瞄器对象,在一次扫描中对提供的文本屡次调用match()方法,若是中间有未匹配到的字符会返回Nonehtm

正则表达式的顺序也很重要,你须要确保长的匹配表达在前,
对象

>>> scanner = master_pat.scanner('foo = 42')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()#'_'表示上一次执行的返回值,这里指scanner.match()
('NAME', 'foo')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677738>
>>> _.lastgroup, _.group()
('WS', ' ')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677759>
>>> _.lastgroup, _.group()
('EQ', '=')
>>> scanner.match()
<_sre.SRE_Match object at 0x100677768>
>>> _.lastgroup, _.group()
('WS', ' ')
>>> scanner.match()
<_sre.SRE_Match object at 0x1006777390>
>>> _.lastgroup, _.group()
('NUM', '42')
>>> scanner.match()

20.byte字符的文本处理(仅python3支持)token

byte字符一般支持大多数文本操做,大多数操做对byte字符一样有效,但也有例外

>>> b = b'Hello World'
>>> b
b'Hello World'
>>> b[0]
72
>>> b[1]
101
byte字符一般也没法进行字符格式化操做
>>> b'%10s %10d %10.2f' % (b'ACME', 100, 490.1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple'
但能以文本字符的方式进行格式操做
>>> '%10s %10d %10.2f' % (b'ACME', 100, 490.1)
"   b'ACME'        100     490.10"