思路:先生成一个以列表为键,出现次数为值的字典,再进行字典的排序html
>>> from random import randint >>> data = [randint(1,21) for _ in xrange(30)] >>> data [18, 5, 21, 6, 13, 18, 3, 20, 14, 3, 8, 20, 12, 16, 21, 11, 9, 17, 14, 1, 19, 2, 6, 9, 6, 20, 8, 14, 18, 2]
>>> dicData = dict.fromkeys(data,0) >>> dicData {1: 0, 2: 0, 3: 0, 5: 0, 6: 0, 8: 0, 9: 0, 11: 0, 12: 0, 13: 0, 14: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}
>>> for x in data: dicData[x] += 1 >>> dicData {1: 1, 2: 2, 3: 2, 5: 1, 6: 3, 8: 2, 9: 2, 11: 1, 12: 1, 13: 1, 14: 3, 16: 1, 17: 1, 18: 3, 19: 1, 20: 3, 21: 2}
>>> sortDicData = sorted(dicData.iteritems(),key=lambda x:x[1],reverse=True) >>> sortDicData [(6, 3), (14, 3), (18, 3), (20, 3), (2, 2), (3, 2), (8, 2), (9, 2), (21, 2), (1, 1), (5, 1), (11, 1), (12, 1), (13, 1), (16, 1), (17, 1), (19, 1)]
>>> newdicData = dict(sortDicData[:4]) >>> newdicData {18: 3, 20: 3, 14: 3, 6: 3}
使用和上例相同的列表,Counter一个字典dict的子类。python
>>> data [18, 5, 21, 6, 13, 18, 3, 20, 14, 3, 8, 20, 12, 16, 21, 11, 9, 17, 14, 1, 19, 2, 6, 9, 6, 20, 8, 14, 18, 2] >>> from collections import Counter >>> dict1 = Counter(data) >>> dict1 Counter({6: 3, 14: 3, 18: 3, 20: 3, 2: 2, 3: 2, 8: 2, 9: 2, 21: 2, 1: 1, 5: 1, 11: 1, 12: 1, 13: 1, 16: 1, 17: 1, 19: 1})
>>> dict1[6] 3 >>> dict1[20] 3 >>> dict1[2] 2
>>> dict1.most_common(3)
[(6, 3), (14, 3), (18, 3)]
思路:将文章读入成字符串,再使用正则表达式模块的分割,使用正则表达式的分割模块,将每一个单词分割分来。git
>>> from collections import Counter正则表达式
>>> import re #正则表达式模块shell
#注意word文档doc不能像文本文件读,须要使用有专用于读doc文件的doc模块express
#打开collections.txt文件,并将该文件读出,赋给txt,txt就是一个很长的字符串编程
>>> txt = open("C:\视频\python高效实践技巧笔记\collections.txt").read()app
#而后用正则表达式分割,用非字母对整个字符串进行分割,就分割出了由各单词组成的列表re.split('\W+',txt)。再用Counter()对该列表词频统计,如上面介绍dom
>>> c3 =Counter(re.split('\W+',txt))编程语言
#获得频度最高的10个单词的列表
>>> c3.most_common(10)
[('the', 177), ('a', 126), ('to', 96), ('and', 93), ('is', 73), ('d', 73), ('in', 72), ('for', 69), ('of', 64), ('2', 53)]
>>> help(dict) Help on class dict in module __builtin__: class dict(object) | dict() -> new empty dictionary | dict(mapping) -> new dictionary initialized from a mapping object's | (key, value) pairs | dict(iterable) -> new dictionary initialized as if via: | d = {} | for k, v in iterable: | d[k] = v | dict(**kwargs) -> new dictionary initialized with the name=value pairs | in the keyword argument list. For example: dict(one=1, two=2) | | Methods defined here: | | __cmp__(...) | x.__cmp__(y) <==> cmp(x,y) | | __contains__(...) | D.__contains__(k) -> True if D has a key k, else False | | __delitem__(...) | x.__delitem__(y) <==> del x[y] | | __eq__(...) | x.__eq__(y) <==> x==y | | __ge__(...) | x.__ge__(y) <==> x>=y | | __getattribute__(...) | x.__getattribute__('name') <==> x.name | | __getitem__(...) | x.__getitem__(y) <==> x[y] | | __gt__(...) | x.__gt__(y) <==> x>y | | __init__(...) | x.__init__(...) initializes x; see help(type(x)) for signature | | __iter__(...) | x.__iter__() <==> iter(x) | | __le__(...) | x.__le__(y) <==> x<=y | | __len__(...) | x.__len__() <==> len(x) | | __lt__(...) | x.__lt__(y) <==> x<y | | __ne__(...) | x.__ne__(y) <==> x!=y | | __repr__(...) | x.__repr__() <==> repr(x) | | __setitem__(...) | x.__setitem__(i, y) <==> x[i]=y | | __sizeof__(...) | D.__sizeof__() -> size of D in memory, in bytes | | clear(...) | D.clear() -> None. Remove all items from D. | | copy(...) | D.copy() -> a shallow copy of D | | fromkeys(...) | dict.fromkeys(S[,v]) -> New dict with keys from S and values equal to v. | v defaults to None. | | get(...) | D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None. | | has_key(...) | D.has_key(k) -> True if D has a key k, else False | | items(...) | D.items() -> list of D's (key, value) pairs, as 2-tuples | | iteritems(...) | D.iteritems() -> an iterator over the (key, value) items of D | | iterkeys(...) | D.iterkeys() -> an iterator over the keys of D | | itervalues(...) | D.itervalues() -> an iterator over the values of D | | keys(...) | D.keys() -> list of D's keys | | pop(...) | D.pop(k[,d]) -> v, remove specified key and return the corresponding value. | If key is not found, d is returned if given, otherwise KeyError is raised | | popitem(...) | D.popitem() -> (k, v), remove and return some (key, value) pair as a | 2-tuple; but raise KeyError if D is empty. | | setdefault(...) | D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D | | update(...) | D.update([E, ]**F) -> None. Update D from dict/iterable E and F. | If E present and has a .keys() method, does: for k in E: D[k] = E[k] | If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v | In either case, this is followed by: for k in F: D[k] = F[k] | | values(...) | D.values() -> list of D's values | | viewitems(...) | D.viewitems() -> a set-like object providing a view on D's items | | viewkeys(...) | D.viewkeys() -> a set-like object providing a view on D's keys | | viewvalues(...) | D.viewvalues() -> an object providing a view on D's values | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | __hash__ = None | | __new__ = <built-in method __new__ of type object> | T.__new__(S, ...) -> a new object with type S, a subtype of T
| fromkeys(...) | dict.fromkeys(S[,v]) -> New dict with keys from S and values equal to v. | v defaults to None. 将序列的值,作为字典的键,生成字典。 >>> data = [3,1,56] >>> data1 = dict.fromkeys(data) >>> data1 {56: None, 1: None, 3: None} >>> data2 = dict.fromkeys(data,3) >>> data2 {56: 3, 1: 3, 3: 3} >>>
| iteritems(...) | D.iteritems() -> an iterator over the (key, value) items of D 接上例:能够看出这是一个键、值的迭代器 >>> data2.iteritems() <dictionary-itemiterator object at 0x02D812A0>
| iterkeys(...) | D.iterkeys() -> an iterator over the keys of D 接上例:能够看出这是一个键的迭代器 >>> data2.iterkeys <built-in method iterkeys of dict object at 0x02E3BDB0> >>> data2.iterkeys() <dictionary-keyiterator object at 0x02E27F00>
| D.itervalues() -> an iterator over the values of D 接上例:能够看出这是一个值的迭代器 >>> data2.itervalues() <dictionary-valueiterator object at 0x02D81810>
>>> import collections
>>> help(collections)
结果把整个官方在线文档给输出了,学习资料最方便的资料仍是官方文档
在《2-2 为元组中的元素命名》有作介绍
>>> import collections >>> help(collections.namedtuple) Help on function namedtuple in module collections: namedtuple(typename, field_names, verbose=False, rename=False) Returns a new subclass of tuple with named fields. >>> Point = namedtuple('Point', ['x', 'y']) >>> Point.__doc__ # docstring for the new class 'Point(x, y)' >>> p = Point(11, y=22) # instantiate with positional args or keywords >>> p[0] + p[1] # indexable like a plain tuple 33 >>> x, y = p # unpack like a regular tuple >>> x, y (11, 22) >>> p.x + p.y # fields also accessible by name 33 >>> d = p._asdict() # convert to a dictionary >>> d['x'] 11 >>> Point(**d) # convert from a dictionary Point(x=11, y=22) >>> p._replace(x=100) # _replace() is like str.replace() but targets named fields Point(x=100, y=22)
namedtuple是一个函数,它用来建立一个自定义的tuple对象,而且规定了tuple元素的个数,并能够用属性而不是索引来引用tuple的某个元素。
这样一来,咱们用namedtuple能够很方便地定义一种数据类型,它具有tuple的不变性,又能够根据属性来引用,使用十分方便。
>>> import collections
>>> help(collections.Counter)
打印出的说明文档好多。
most_common() | most_common(self, n=None) | List the n most common elements and their counts from the most | common to the least. If n is None, then list all element counts. | | >>> Counter('abcdeabcdabcaba').most_common(3) | [('a', 5), ('b', 4), ('c', 3)]
官方文档:
Py2.7:https://docs.python.org/2.7/library/re.html
Py3 :https://docs.python.org/3/library/re.html
>>> help(re) Help on module re: NAME re - Support for regular expressions (RE). FILE c:\python27\lib\re.py DESCRIPTION This module provides regular expression matching operations similar to those found in Perl. It supports both 8-bit and Unicode strings; both the pattern and the strings being processed can contain null bytes and characters outside the US ASCII range. Regular expressions can contain both special and ordinary characters. Most ordinary characters, like "A", "a", or "0", are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'. The special characters are: "." Matches any character except a newline. "^" Matches the start of the string. "$" Matches the end of the string or just before the newline at the end of the string. "*" Matches 0 or more (greedy) repetitions of the preceding RE. Greedy means that it will match as many repetitions as possible. "+" Matches 1 or more (greedy) repetitions of the preceding RE. "?" Matches 0 or 1 (greedy) of the preceding RE. *?,+?,?? Non-greedy versions of the previous three special characters. {m,n} Matches from m to n repetitions of the preceding RE. {m,n}? Non-greedy version of the above. "\\" Either escapes special characters or signals a special sequence. [] Indicates a set of characters. A "^" as the first character indicates a complementing set. "|" A|B, creates an RE that will match either A or B. (...) Matches the RE inside the parentheses. The contents can be retrieved or matched later in the string. (?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below). (?:...) Non-grouping version of regular parentheses. (?P<name>...) The substring matched by the group is accessible by name. (?P=name) Matches the text matched earlier by the group named name. (?#...) A comment; ignored. (?=...) Matches if ... matches next, but doesn't consume the string. (?!...) Matches if ... doesn't match next. (?<=...) Matches if preceded by ... (must be fixed length). (?<!...) Matches if not preceded by ... (must be fixed length). (?(id/name)yes|no) Matches yes pattern if the group with id/name matched, the (optional) no pattern otherwise. The special sequences consist of "\\" and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. \number Matches the contents of the group of the same number. \A Matches only at the start of the string. \Z Matches only at the end of the string. \b Matches the empty string, but only at the start or end of a word. \B Matches the empty string, but not at the start or end of a word. \d Matches any decimal digit; equivalent to the set [0-9]. \D Matches any non-digit character; equivalent to the set [^0-9]. \s Matches any whitespace character; equivalent to [ \t\n\r\f\v]. \S Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v]. \w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus characters defined as letters for the current locale. \W Matches the complement of \w. \\ Matches a literal backslash. This module exports the following functions: match Match a regular expression pattern to the beginning of a string. search Search a string for the presence of a pattern. sub Substitute occurrences of a pattern found in a string. subn Same as sub, but also return the number of substitutions made. split Split a string by the occurrences of a pattern. findall Find all occurrences of a pattern in a string. finditer Return an iterator yielding a match object for each match. compile Compile a pattern into a RegexObject. purge Clear the regular expression cache. escape Backslash all non-alphanumerics in a string. Some of the functions in this module takes flags as optional parameters: I IGNORECASE Perform case-insensitive matching. L LOCALE Make \w, \W, \b, \B, dependent on the current locale. M MULTILINE "^" matches the beginning of lines (after a newline) as well as the string. "$" matches the end of lines (before a newline) as well as the end of the string. S DOTALL "." matches any character at all, including the newline. X VERBOSE Ignore whitespace and comments for nicer looking RE's. U UNICODE Make \w, \W, \b, \B, dependent on the Unicode locale. This module also defines an exception 'error'. CLASSES exceptions.Exception(exceptions.BaseException) sre_constants.error class error(exceptions.Exception) | Method resolution order: | error | exceptions.Exception | exceptions.BaseException | __builtin__.object | | Data descriptors defined here: | | __weakref__ | list of weak references to the object (if defined) | | ---------------------------------------------------------------------- | Methods inherited from exceptions.Exception: | | __init__(...) | x.__init__(...) initializes x; see help(type(x)) for signature | | ---------------------------------------------------------------------- | Data and other attributes inherited from exceptions.Exception: | | __new__ = <built-in method __new__ of type object> | T.__new__(S, ...) -> a new object with type S, a subtype of T | | ---------------------------------------------------------------------- | Methods inherited from exceptions.BaseException: | | __delattr__(...) | x.__delattr__('name') <==> del x.name | | __getattribute__(...) | x.__getattribute__('name') <==> x.name | | __getitem__(...) | x.__getitem__(y) <==> x[y] | | __getslice__(...) | x.__getslice__(i, j) <==> x[i:j] | | Use of negative indices is not supported. | | __reduce__(...) | | __repr__(...) | x.__repr__() <==> repr(x) | | __setattr__(...) | x.__setattr__('name', value) <==> x.name = value | | __setstate__(...) | | __str__(...) | x.__str__() <==> str(x) | | __unicode__(...) | | ---------------------------------------------------------------------- | Data descriptors inherited from exceptions.BaseException: | | __dict__ | | args | | message FUNCTIONS compile(pattern, flags=0) Compile a regular expression pattern, returning a pattern object. escape(pattern) Escape all non-alphanumeric characters in pattern. findall(pattern, string, flags=0) Return a list of all non-overlapping matches in the string. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result. finditer(pattern, string, flags=0) Return an iterator over all non-overlapping matches in the string. For each match, the iterator returns a match object. Empty matches are included in the result. match(pattern, string, flags=0) Try to apply the pattern at the start of the string, returning a match object, or None if no match was found. purge() Clear the regular expression cache search(pattern, string, flags=0) Scan through string looking for a match to the pattern, returning a match object, or None if no match was found. split(pattern, string, maxsplit=0, flags=0) Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings. sub(pattern, repl, string, count=0, flags=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the match object and must return a replacement string to be used. subn(pattern, repl, string, count=0, flags=0) Return a 2-tuple containing (new_string, number). new_string is the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in the source string by the replacement repl. number is the number of substitutions that were made. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the match object and must return a replacement string to be used. template(pattern, flags=0) Compile a template pattern, returning a pattern object DATA DOTALL = 16 I = 2 IGNORECASE = 2 L = 4 LOCALE = 4 M = 8 MULTILINE = 8 S = 16 U = 32 UNICODE = 32 VERBOSE = 64 X = 64 __all__ = ['match', 'search', 'sub', 'subn', 'split', 'findall', 'comp... __version__ = '2.2.1' VERSION 2.2.1
引用地址:http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
正则表达式是用于处理字符串的强大工具,拥有本身独特的语法以及一个独立的处理引擎,效率上可能不如str自带的方法,但功能十分强大。得益于这一点,在提供了正则表达式的语言里,正则表达式的语法都是同样的,区别只在于不一样的编程语言实现支持的语法数量不一样;但不用担忧,不被支持的语法一般是不经常使用的部分。
下图展现了使用正则表达式进行匹配的流程:
下图列出了Python支持的正则表达式元字符和语法:
正则表达式一般用于在文本中查找匹配的字符串。Python里数量词默认是贪婪的(在少数语言里也多是默认非贪婪),老是尝试匹配尽量多的字符;非贪婪的则相反,老是尝试匹配尽量少的字符。例如:正则表达式"ab*"若是用于查找"abbbc",将找到"abbb"。而若是使用非贪婪的数量词"ab*?",将找到"a"。
测试:
>>> print re.match('ab*','abbbc').group() abbb >>> print re.match('ab*?','abbbc').group() a
与大多数编程语言相同,正则表达式里使用"\"做为转义字符,这就可能形成反斜杠困扰。假如你须要匹配文本中的字符"\",那么使用编程语言表示的正则表达式里将须要4个反斜杠"\\\\":前两个和后两个分别用于在编程语言里转义成反斜杠,转换成两个反斜杠后再在正则表达式里转义成一个反斜杠。Python里的原生字符串很好地解决了这个问题,这个例子中的正则表达式可使用r"\\"表示。一样,匹配一个数字的"\\d"能够写成r"\d"。有了原生字符串,你不再用担忧是否是漏写了反斜杠,写出来的表达式也更直观。
正则表达式提供了一些可用的匹配模式,好比忽略大小写、多行匹配等,这部份内容将在Pattern类的工厂方法re.compile(pattern[, flags])中一块儿介绍。
Python经过re模块提供对正则表达式的支持。使用re的通常步骤是先将正则表达式的字符串形式编译为Pattern实例,而后使用Pattern实例处理文本并得到匹配结果(一个Match实例),最后使用Match实例得到信息,进行其余的操做。
# 将正则表达式编译成Pattern对象 >>> pattern = re.compile(r'hello') # 使用Pattern匹配文本,得到匹配结果,没法匹配时将返回None >>> match = pattern.match('hello word!') # 使用Match得到分组信息 >>> print (match.group()) hello
此种方法多用在写脚本或模块时,对于较复杂的匹配规则或会常常被使用的匹配规则先作编译,再使用。
>>> help(re.compile) Help on function compile in module re: compile(pattern, flags=0) Compile a regular expression pattern, returning a pattern object.
re.compile(strPattern[, flag]):
这个方法是Pattern类的工厂方法,用于将字符串形式的正则表达式编译为Pattern对象。 第二个参数flag是匹配模式,取值可使用按位或运算符'|'表示同时生效,好比re.I | re.M。另外,你也能够在规则字符串中指定模式,好比re.compile('pattern', re.I | re.M)与re.compile('(?im)pattern')是等价的。 (参看特殊构造(不做为分组部分))
可选值有:
a = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits""", re.X) b = re.compile(r"\d+\.\d*")
>>> help(re.match) Help on function match in module re: match(pattern, string, flags=0) Try to apply the pattern at the start of the string, returning a match object, or None if no match was found. >>> m = re.match(r'hello', 'hello world!') >>> m.group() 'hello'
Match对象是一次匹配的结果,包含了不少关于这次匹配的信息,可使用Match提供的可读属性或方法来获取这些信息。
属性:
(1)string: 匹配时使用的文本。
(2)re: 匹配时使用的Pattern对象。
(3)pos: 文本中正则表达式开始搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
(4)endpos: 文本中正则表达式结束搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
(5)lastindex: 最后一个被捕获的分组在文本中的索引。若是没有被捕获的分组,将为None。
(6)lastgroup: 最后一个被捕获的分组的别名。若是这个分组没有别名或者没有被捕获的分组,将为None。
>>> m.string 'hello world!' >>> m.re <_sre.SRE_Pattern object at 0x02CC6D40> >>> m.pos 0 >>> m.endpos 12 >>> m.lastindex >>> m.lastgroup >>>
方法:
(1)group([group1, …]):
得到一个或多个分组截获的字符串;指定多个参数时将以元组形式返回。group1可使用编号也可使用别名;编号0表明整个匹配的子串;不填写参数时,返回group(0);没有截获字符串的组返回None;截获了屡次的组返回最后一次截获的子串。
(2)groups([default]):
以元组形式返回所有分组截获的字符串。至关于调用group(1,2,…last)。default表示没有截获字符串的组以这个值替代,默认为None。
(3)groupdict([default]):
返回以有别名的组的别名为键、以该组截获的子串为值的字典,没有别名的组不包含在内。default含义同上。
(4)start([group]):
返回指定的组截获的子串在string中的起始索引(子串第一个字符的索引)。group默认值为0。
(5)end([group]):
返回指定的组截获的子串在string中的结束索引(子串最后一个字符的索引+1)。group默认值为0。
(6)span([group]):
返回(start(group), end(group))。
(7)expand(template):
将匹配到的分组代入template中而后返回。template中可使用\id或\g<id>、\g<name>引用分组,但不能使用编号0。\id与\g<id>是等价的;但\10将被认为是第10个分组,若是你想表达\1以后是字符'0',只能使用\g<1>0。
举例说明:
匹配3个分组,(1)1或无限个字符,(2)1或无限个字符(3)具备额外别名“sign”的分组,任意符号0或无限个。要匹配的字符串为”hello world!”
>>> m2 = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!') >>> m2.string #匹配时使用的文本,即要匹配的字符串 'hello world!' >>> m2.re #匹配时使用的Pattern对象,即编译的匹配规则 <_sre.SRE_Pattern object at 0x02CB8B00> >>> m2.pos #文本中正则表达式开始搜索的索引 0 >>> m2.endpos #文本中正则表达式结束搜索的索引 12 >>> m2.lastindex #最后一个被捕获的分组在文本中的索引 3 >>> m2.lastgroup #最后一个被捕获的分组的别名,若是这个分组没有别名或者没有被捕获的分组,将为None。即只在有捕获并有别名时才会有输出。 'sign' >>> m3 = re.match(r'(\w+) (\w+)(.*)', 'hello world!') >>> m3.lastgroup >>> >>> m2.group() #得到一个或多个分组截获的字符串;指定多个参数时将以元组形式返回。 'hello world!' >>> m2.group(0) 'hello world!' >>> m2.group(1) 'hello' >>> m2.group(2) 'world' >>> m2.group(3) '!' >>> m2.group(1,2) ('hello', 'world') >>> m2.group(1,3) ('hello', '!') >>> m2.group(1,2,3) ('hello', 'world', '!') >>> m2.groups() #以元组形式返回所有分组截获的字符串。 ('hello', 'world', '!') >>> m2.groups(1) ('hello', 'world', '!') >>> m2.groups(2) ('hello', 'world', '!') #返回以有别名的组的别名为键、以该组截获的子串为值的字典,没有别名的组不包含在内。 >>> m2.groupdict() {'sign': '!'} #返回指定的组截获的子串在string中的起始索引(子串第一个字符的索引) >>> m2.start() 0 >>> m2.start(0) 0 >>> m2.start(1) 0 >>> m2.start(2) 6 >>> m2.start(3) 11 #返回指定的组截获的子串在string中的结束索引(子串最后一个字符的索引+1) >>> m2.end() 12 >>> m2.end(0) 12 >>> m2.end(1) 5 >>> m2.end(2) 11 >>> m2.end(3) 12 将匹配到的分组代入参数中而后按从新排列的顺序返回 >>> m2.expand(r'\3\2\1') '!worldhello' >>> m2.expand(r'\3 \2 \1') '! world hello'
Pattern对象是一个编译好的正则表达式,经过Pattern提供的一系列方法能够对文本进行匹配查找。
>>> help(m2.re) Help on SRE_Pattern object: class SRE_Pattern(__builtin__.object) | Compiled regular expression objects | | Methods defined here: | | __copy__(...) | | __deepcopy__(...) | | findall(...) | findall(string[, pos[, endpos]]) --> list. | Return a list of all non-overlapping matches of pattern in string. | | finditer(...) | finditer(string[, pos[, endpos]]) --> iterator. | Return an iterator over all non-overlapping matches for the | RE pattern in string. For each match, the iterator returns a | match object. | | match(...) | match(string[, pos[, endpos]]) --> match object or None. | Matches zero or more characters at the beginning of the string | | scanner(...) | | search(...) | search(string[, pos[, endpos]]) --> match object or None. | Scan through string looking for a match, and return a corresponding | match object instance. Return None if no position in the string matches. | | split(...) | split(string[, maxsplit = 0]) --> list. | Split string by the occurrences of pattern. | | sub(...) | sub(repl, string[, count = 0]) --> newstring | Return the string obtained by replacing the leftmost non-overlapping | occurrences of pattern in string by the replacement repl. | | subn(...) | subn(repl, string[, count = 0]) --> (newstring, number of subs) | Return the tuple (new_string, number_of_subs_made) found by replacing | the leftmost non-overlapping occurrences of pattern with the | replacement repl. | | ---------------------------------------------------------------------- | Data descriptors defined here: | | flags | | groupindex | | groups | | pattern
Pattern不能直接实例化,必须使用re.compile()进行构造。
(1)pattern: 编译时用的表达式字符串。
(2)flags: 编译时用的匹配模式。数字形式。
(3)groups: 表达式中分组的数量。
(4)groupindex: 以表达式中有别名的组的别名为键、以该组对应的编号为值的字典,没有别名的组不包含在内。
import re p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL) print "p.pattern:", p.pattern print "p.flags:", p.flags print "p.groups:", p.groups print "p.groupindex:", p.groupindex ### output ### # p.pattern: (\w+) (\w+)(?P<sign>.*) # p.flags: 16 # p.groups: 3 # p.groupindex: {'sign': 3}
3.3.2.3.2实例方法[ | re模块方法]:
1、match(string[, pos[, endpos]]) | re.match(pattern, string[, flags]):
| match(...) | match(string[, pos[, endpos]]) --> match object or None. |
这个方法将从string的pos下标处起尝试匹配pattern;若是pattern结束时仍可匹配,则返回一个Match对象;若是匹配过程当中pattern没法匹配,或者匹配未结束就已到达endpos,则返回None。
pos和endpos的默认值分别为0和len(string);re.match()没法指定这两个参数,参数flags用于编译pattern时指定匹配模式。
注意:这个方法并非彻底匹配。当pattern结束时若string还有剩余字符,仍然视为成功。想要彻底匹配,能够在表达式末尾加上边界匹配符'$'。
示例参见3.3.2.1小节。
2、search(string[, pos[, endpos]]) | re.search(pattern, string[, flags]):
这个方法用于查找字符串中能够匹配成功的子串。从string的pos下标处起尝试匹配pattern,若是pattern结束时仍可匹配,则返回一个Match对象;若没法匹配,则将pos加1后从新尝试匹配;直到pos=endpos时仍没法匹配则返回None。
pos和endpos的默认值分别为0和len(string));re.search()没法指定这两个参数,参数flags用于编译pattern时指定匹配模式。
# 将正则表达式编译成Pattern对象 >>> pattern = re.compile(r'world') # 使用search()查找匹配的子串,不存在能匹配的子串时将返回None # 这个例子中使用match()没法成功匹配 hello可以match()成功*** >>> match = pattern.search('hello world!') # 使用Match得到分组信息 >>> match.group() 'world' >>>
3、split(string[, maxsplit]) | re.split(pattern, string[, maxsplit]):
按照可以匹配的子串将string分割后返回列表。maxsplit用于指定最大分割次数,不指定将所有分割。
| split(...) | split(string[, maxsplit = 0]) --> list. | Split string by the occurrences of pattern. >>> help(re.split) Help on function split in module re: split(pattern, string, maxsplit=0, flags=0) Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings. >>> p = re.compile(r'\d+') >>> p <_sre.SRE_Pattern object at 0x02D53F70> >>> p.split('one1two2three3four4five5six6seven7eight8nine9ten10') ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', '']
4、findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags]):
搜索string,以列表形式返回所有能匹配的子串。
>>> p = re.compile(r'\d+') >>> p.findall('one1two2three3four4five5six6seven7eight8nine9ten10') ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
5、finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]):
搜索string,返回一个顺序访问每个匹配结果(Match对象)的迭代器。
>>> p = re.compile(r'\d+') >>> piter = p.finditer('one1two2three3four4') >>> piter <callable-iterator object at 0x02E153B0> >>> for x in piter: print x <_sre.SRE_Match object at 0x02EAE800> <_sre.SRE_Match object at 0x02EAE838> <_sre.SRE_Match object at 0x02EAE800> <_sre.SRE_Match object at 0x02EAE838> >>> piter = p.finditer('one1two2three3four4') >>> for x in piter: print x.group(), 1 2 3 4
6、sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):
使用repl替换string中每个匹配的子串后返回替换后的字符串。
当repl是一个字符串时,可使用\id或\g<id>、\g<name>引用分组,但不能使用编号0。
当repl是一个方法时,这个方法应当只接受一个参数(Match对象),并返回一个字符串用于替换(返回的字符串中不能再引用分组)。
count用于指定最多替换次数,不指定时所有替换。
(1)字符串时
>>> p = re.compile(r'(\w+) (\w+)') >>> s = 'i say, hello world' >>> p.sub(r'\2 \1',s) 'say i, world hello'
注:只有两个匹配,使用序号超过匹配分组时,抛出异常
>>> p.sub(r'\3 \1',s) Traceback (most recent call last): File "<pyshell#207>", line 1, in <module> p.sub(r'\3 \1',s) File "C:\Python27\lib\re.py", line 291, in filter return sre_parse.expand_template(template, match) File "C:\Python27\lib\sre_parse.py", line 833, in expand_template raise error, "invalid group reference" error: invalid group reference
(2)方法时
>>> def fun(m): return m.group(1).title()+ ' ' + m.group(2).title() >>> p.sub(fun,s) 'I Say, Hello World' >>> help(str.title) Help on method_descriptor: title(...) S.title() -> string Return a titlecased version of S, i.e. words start with uppercase characters, all remaining cased characters have lowercase. 返回字符串首字母大写。
7、subn(repl, string[, count]) |re.sub(pattern, repl, string[, count]):
返回 (sub(repl, string[, count]), 替换次数)。
>>> help(p.subn) Help on built-in function subn: subn(...) subn(repl, string[, count = 0]) --> (newstring, number of subs) Return the tuple (new_string, number_of_subs_made) found by replacing the leftmost non-overlapping occurrences of pattern with the replacement repl. >>> p = re.compile(r'(\w+) (\w+)') >>> s = 'i say, hello world!' >>> p.subn(r'\2 \1', s) ('say i, world hello!', 2) >>> p.subn(r'\2',s) ('say, world!', 2) >>> p.subn(r'\1',s) ('i, hello!', 2) >>> p.subn(r'\1 \2',s) ('i say, hello world!', 2) >>> def funn(m): print(m.group(1)+' '+ m.group(2)) >>> p.subn(funn,s) i say hello world (', !', 2) >>> def funn(m): return(m.group(1)+' '+ m.group(2)) >>> p.subn(funn,s) ('i say, hello world!', 2)