2-3统计序列中元素的出现频度

时间 2019-11-11

标签统计序列元素出现频度繁體版

原文原文链接

一、序列出现次数的实现方法

1.1使用fromkey方法初始化一个dict，而后经过for循环迭代统计次数。

思路：先生成一个以列表为键，出现次数为值的字典，再进行字典的排序html

(1)生成30个随机数在1~20的列表

>>> from random import randint
>>> data = [randint(1,21) for _ in xrange(30)] 
>>> data
[18, 5, 21, 6, 13, 18, 3, 20, 14, 3, 8, 20, 12, 16, 21, 11, 9, 17, 14, 1, 19, 2, 6, 9, 6, 20, 8, 14, 18, 2]

生成随机整数

(2)将列表做为键，生成值全为0的字典，字典会将列表中重复的值过滤掉

>>> dicData = dict.fromkeys(data,0) 
>>> dicData
{1: 0, 2: 0, 3: 0, 5: 0, 6: 0, 8: 0, 9: 0, 11: 0, 12: 0, 13: 0, 14: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}

由列表为键生成字典

(3)迭代计算原列表中出现的次数，做为生成字典的值

>>> for x in data:
    dicData[x] += 1
>>> dicData
{1: 1, 2: 2, 3: 2, 5: 1, 6: 3, 8: 2, 9: 2, 11: 1, 12: 1, 13: 1, 14: 3, 16: 1, 17: 1, 18: 3, 19: 1, 20: 3, 21: 2}

View Code

(4)将字典排序，以值为key进行排序，同时采用逆序，生成元组列表

>>> sortDicData = sorted(dicData.iteritems(),key=lambda x:x[1],reverse=True)
>>> sortDicData
[(6, 3), (14, 3), (18, 3), (20, 3), (2, 2), (3, 2), (8, 2), (9, 2), (21, 2), (1, 1), (5, 1), (11, 1), (12, 1), (13, 1), (16, 1), (17, 1), (19, 1)]

View Code

(5)将生成的元组列表，用切片的方式取前3个，再转为字典。

>>> newdicData = dict(sortDicData[:4])
>>> newdicData
{18: 3, 20: 3, 14: 3, 6: 3}

View Code

1.2使用collections.Counter对象

使用和上例相同的列表，Counter一个字典dict的子类。python

(1)将序列传入Counter的构造器，获得Counter对象是元素频度的字典

>>> data
[18, 5, 21, 6, 13, 18, 3, 20, 14, 3, 8, 20, 12, 16, 21, 11, 9, 17, 14, 1, 19, 2, 6, 9, 6, 20, 8, 14, 18, 2]
>>> from collections import Counter
>>> dict1 = Counter(data)
>>> dict1
Counter({6: 3, 14: 3, 18: 3, 20: 3, 2: 2, 3: 2, 8: 2, 9: 2, 21: 2, 1: 1, 5: 1, 11: 1, 12: 1, 13: 1, 16: 1, 17: 1, 19: 1})

View Code

(2)查看字典键所对应的值（即出现次数）

>>> dict1[6]
3
>>> dict1[20]
3
>>> dict1[2]
2

View Code

(3)使用dict1对象的most_common(n)方法，获得频度最高的n个元素的列表

>>> dict1.most_common(3)
[(6, 3), (14, 3), (18, 3)]

View Code

二、对英文文章词频的统计

思路：将文章读入成字符串，再使用正则表达式模块的分割，使用正则表达式的分割模块，将每一个单词分割分来。git

>>> from collections import Counter正则表达式

>>> import re #正则表达式模块shell

#注意word文档doc不能像文本文件读，须要使用有专用于读doc文件的doc模块express

#打开collections.txt文件，并将该文件读出，赋给txt，txt就是一个很长的字符串编程

>>> txt = open("C:\视频\python高效实践技巧笔记\collections.txt").read()app

#而后用正则表达式分割，用非字母对整个字符串进行分割，就分割出了由各单词组成的列表re.split('\W+',txt)。再用Counter()对该列表词频统计，如上面介绍dom

>>> c3 =Counter(re.split('\W+',txt))编程语言

#获得频度最高的10个单词的列表

>>> c3.most_common(10)

[('the', 177), ('a', 126), ('to', 96), ('and', 93), ('is', 73), ('d', 73), ('in', 72), ('for', 69), ('of', 64), ('2', 53)]

3扩展知识

3.1字典的相关知识

>>> help(dict)
Help on class dict in module __builtin__:

class dict(object)
 |  dict() -> new empty dictionary
 |  dict(mapping) -> new dictionary initialized from a mapping object's
 |      (key, value) pairs
 |  dict(iterable) -> new dictionary initialized as if via:
 |      d = {}
 |      for k, v in iterable:
 |          d[k] = v
 |  dict(**kwargs) -> new dictionary initialized with the name=value pairs
 |      in the keyword argument list.  For example:  dict(one=1, two=2)
 |  
 |  Methods defined here:
 |  
 |  __cmp__(...)
 |      x.__cmp__(y) <==> cmp(x,y)
 |  
 |  __contains__(...)
 |      D.__contains__(k) -> True if D has a key k, else False
 |  
 |  __delitem__(...)
 |      x.__delitem__(y) <==> del x[y]
 |  
 |  __eq__(...)
 |      x.__eq__(y) <==> x==y
 |  
 |  __ge__(...)
 |      x.__ge__(y) <==> x>=y
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(...)
 |      x.__gt__(y) <==> x>y
 |  
 |  __init__(...)
 |      x.__init__(...) initializes x; see help(type(x)) for signature
 |  
 |  __iter__(...)
 |      x.__iter__() <==> iter(x)
 |  
 |  __le__(...)
 |      x.__le__(y) <==> x<=y
 |  
 |  __len__(...)
 |      x.__len__() <==> len(x)
 |  
 |  __lt__(...)
 |      x.__lt__(y) <==> x<y
 |  
 |  __ne__(...)
 |      x.__ne__(y) <==> x!=y
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __setitem__(...)
 |      x.__setitem__(i, y) <==> x[i]=y
 |  
 |  __sizeof__(...)
 |      D.__sizeof__() -> size of D in memory, in bytes
 |  
 |  clear(...)
 |      D.clear() -> None.  Remove all items from D.
 |  
 |  copy(...)
 |      D.copy() -> a shallow copy of D
 |  
 |  fromkeys(...)
 |      dict.fromkeys(S[,v]) -> New dict with keys from S and values equal to v.
 |      v defaults to None.
 |  
 |  get(...)
 |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
 |  
 |  has_key(...)
 |      D.has_key(k) -> True if D has a key k, else False
 |  
 |  items(...)
 |      D.items() -> list of D's (key, value) pairs, as 2-tuples
 |  
 |  iteritems(...)
 |      D.iteritems() -> an iterator over the (key, value) items of D
 |  
 |  iterkeys(...)
 |      D.iterkeys() -> an iterator over the keys of D
 |  
 |  itervalues(...)
 |      D.itervalues() -> an iterator over the values of D
 |  
 |  keys(...)
 |      D.keys() -> list of D's keys
 |  
 |  pop(...)
 |      D.pop(k[,d]) -> v, remove specified key and return the corresponding value.
 |      If key is not found, d is returned if given, otherwise KeyError is raised
 |  
 |  popitem(...)
 |      D.popitem() -> (k, v), remove and return some (key, value) pair as a
 |      2-tuple; but raise KeyError if D is empty.
 |  
 |  setdefault(...)
 |      D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D
 |  
 |  update(...)
 |      D.update([E, ]**F) -> None.  Update D from dict/iterable E and F.
 |      If E present and has a .keys() method, does:     for k in E: D[k] = E[k]
 |      If E present and lacks .keys() method, does:     for (k, v) in E: D[k] = v
 |      In either case, this is followed by: for k in F: D[k] = F[k]
 |  
 |  values(...)
 |      D.values() -> list of D's values
 |  
 |  viewitems(...)
 |      D.viewitems() -> a set-like object providing a view on D's items
 |  
 |  viewkeys(...)
 |      D.viewkeys() -> a set-like object providing a view on D's keys
 |  
 |  viewvalues(...)
 |      D.viewvalues() -> an object providing a view on D's values
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T

help(dict)

3.1.1 fromkeys

 |  fromkeys(...)
 |      dict.fromkeys(S[,v]) -> New dict with keys from S and values equal to v.
 |      v defaults to None.
将序列的值，作为字典的键，生成字典。
>>> data = [3,1,56]
>>> data1 = dict.fromkeys(data)
>>> data1
{56: None, 1: None, 3: None}
>>> data2 = dict.fromkeys(data,3)
>>> data2
{56: 3, 1: 3, 3: 3}
>>>

View Code

3.1.2 iteritems

 |  iteritems(...)
 |      D.iteritems() -> an iterator over the (key, value) items of D
接上例：能够看出这是一个键、值的迭代器
>>> data2.iteritems()
<dictionary-itemiterator object at 0x02D812A0>

View Code

3.1.3 iterkeys

 |  iterkeys(...)
 |      D.iterkeys() -> an iterator over the keys of D

接上例：能够看出这是一个键的迭代器
>>> data2.iterkeys
<built-in method iterkeys of dict object at 0x02E3BDB0>
>>> data2.iterkeys()
<dictionary-keyiterator object at 0x02E27F00>

View Code

3.1.4 itervalues

 |      D.itervalues() -> an iterator over the values of D
接上例：能够看出这是一个值的迭代器
>>> data2.itervalues()
<dictionary-valueiterator object at 0x02D81810>

View Code

3.2 collections

>>> import collections

>>> help(collections)

结果把整个官方在线文档给输出了，学习资料最方便的资料仍是官方文档

3.2.1 namedtuple

在《2-2 为元组中的元素命名》有作介绍

>>> import collections
>>> help(collections.namedtuple)
Help on function namedtuple in module collections:

namedtuple(typename, field_names, verbose=False, rename=False)
    Returns a new subclass of tuple with named fields.
    
    >>> Point = namedtuple('Point', ['x', 'y'])
    >>> Point.__doc__                   # docstring for the new class
    'Point(x, y)'
    >>> p = Point(11, y=22)             # instantiate with positional args or keywords
    >>> p[0] + p[1]                     # indexable like a plain tuple
    33
    >>> x, y = p                        # unpack like a regular tuple
    >>> x, y
    (11, 22)
    >>> p.x + p.y                       # fields also accessible by name
    33
    >>> d = p._asdict()                 # convert to a dictionary
    >>> d['x']
    11
    >>> Point(**d)                      # convert from a dictionary
    Point(x=11, y=22)
    >>> p._replace(x=100)               # _replace() is like str.replace() but targets named fields
Point(x=100, y=22)

help(collections.namedtuple)

namedtuple是一个函数，它用来建立一个自定义的tuple对象，而且规定了tuple元素的个数，并能够用属性而不是索引来引用tuple的某个元素。

这样一来，咱们用namedtuple能够很方便地定义一种数据类型，它具有tuple的不变性，又能够根据属性来引用，使用十分方便。

3.2.2 Counter

>>> import collections

>>> help(collections.Counter)

打印出的说明文档好多。

most_common()

|  most_common(self, n=None)

 |      List the n most common elements and their counts from the most

 |      common to the least.  If n is None, then list all element counts.

 |      

 |      >>> Counter('abcdeabcdabcaba').most_common(3)

 |      [('a', 5), ('b', 4), ('c', 3)]

most_common()

3.3正则表达式re模块

官方文档：

Py2.7:https://docs.python.org/2.7/library/re.html

Py3 :https://docs.python.org/3/library/re.html

>>> help(re)
Help on module re:

NAME
    re - Support for regular expressions (RE).

FILE
    c:\python27\lib\re.py

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last matches the string 'last'.
    
    The special characters are:
        "."      Matches any character except a newline.
        "^"      Matches the start of the string.
        "$"      Matches the end of the string or just before the newline at
                 the end of the string.
        "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
                 Greedy means that it will match as many repetitions as possible.
        "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
        "?"      Matches 0 or 1 (greedy) of the preceding RE.
        *?,+?,?? Non-greedy versions of the previous three special characters.
        {m,n}    Matches from m to n repetitions of the preceding RE.
        {m,n}?   Non-greedy version of the above.
        "\\"     Either escapes special characters or signals a special sequence.
        []       Indicates a set of characters.
                 A "^" as the first character indicates a complementing set.
        "|"      A|B, creates an RE that will match either A or B.
        (...)    Matches the RE inside the parentheses.
                 The contents can be retrieved or matched later in the string.
        (?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below).
        (?:...)  Non-grouping version of regular parentheses.
        (?P<name>...) The substring matched by the group is accessible by name.
        (?P=name)     Matches the text matched earlier by the group named name.
        (?#...)  A comment; ignored.
        (?=...)  Matches if ... matches next, but doesn't consume the string.
        (?!...)  Matches if ... doesn't match next.
        (?<=...) Matches if preceded by ... (must be fixed length).
        (?<!...) Matches if not preceded by ... (must be fixed length).
        (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
                           the (optional) no pattern otherwise.
    
    The special sequences consist of "\\" and a character from the list
    below.  If the ordinary character is not on the list, then the
    resulting RE will match the second character.
        \number  Matches the contents of the group of the same number.
        \A       Matches only at the start of the string.
        \Z       Matches only at the end of the string.
        \b       Matches the empty string, but only at the start or end of a word.
        \B       Matches the empty string, but not at the start or end of a word.
        \d       Matches any decimal digit; equivalent to the set [0-9].
        \D       Matches any non-digit character; equivalent to the set [^0-9].
        \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v].
        \S       Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v].
        \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
                 With LOCALE, it will match the set [0-9_] plus characters defined
                 as letters for the current locale.
        \W       Matches the complement of \w.
        \\       Matches a literal backslash.
    
    This module exports the following functions:
        match    Match a regular expression pattern to the beginning of a string.
        search   Search a string for the presence of a pattern.
        sub      Substitute occurrences of a pattern found in a string.
        subn     Same as sub, but also return the number of substitutions made.
        split    Split a string by the occurrences of a pattern.
        findall  Find all occurrences of a pattern in a string.
        finditer Return an iterator yielding a match object for each match.
        compile  Compile a pattern into a RegexObject.
        purge    Clear the regular expression cache.
        escape   Backslash all non-alphanumerics in a string.
    
    Some of the functions in this module takes flags as optional parameters:
        I  IGNORECASE  Perform case-insensitive matching.
        L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
        M  MULTILINE   "^" matches the beginning of lines (after a newline)
                       as well as the string.
                       "$" matches the end of lines (before a newline) as well
                       as the end of the string.
        S  DOTALL      "." matches any character at all, including the newline.
        X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
        U  UNICODE     Make \w, \W, \b, \B, dependent on the Unicode locale.
    
    This module also defines an exception 'error'.

CLASSES
    exceptions.Exception(exceptions.BaseException)
        sre_constants.error
    
    class error(exceptions.Exception)
     |  Method resolution order:
     |      error
     |      exceptions.Exception
     |      exceptions.BaseException
     |      __builtin__.object
     |  
     |  Data descriptors defined here:
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from exceptions.Exception:
     |  
     |  __init__(...)
     |      x.__init__(...) initializes x; see help(type(x)) for signature
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from exceptions.Exception:
     |  
     |  __new__ = <built-in method __new__ of type object>
     |      T.__new__(S, ...) -> a new object with type S, a subtype of T
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from exceptions.BaseException:
     |  
     |  __delattr__(...)
     |      x.__delattr__('name') <==> del x.name
     |  
     |  __getattribute__(...)
     |      x.__getattribute__('name') <==> x.name
     |  
     |  __getitem__(...)
     |      x.__getitem__(y) <==> x[y]
     |  
     |  __getslice__(...)
     |      x.__getslice__(i, j) <==> x[i:j]
     |      
     |      Use of negative indices is not supported.
     |  
     |  __reduce__(...)
     |  
     |  __repr__(...)
     |      x.__repr__() <==> repr(x)
     |  
     |  __setattr__(...)
     |      x.__setattr__('name', value) <==> x.name = value
     |  
     |  __setstate__(...)
     |  
     |  __str__(...)
     |      x.__str__() <==> str(x)
     |  
     |  __unicode__(...)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from exceptions.BaseException:
     |  
     |  __dict__
     |  
     |  args
     |  
     |  message

FUNCTIONS
    compile(pattern, flags=0)
        Compile a regular expression pattern, returning a pattern object.
    
    escape(pattern)
        Escape all non-alphanumeric characters in pattern.
    
    findall(pattern, string, flags=0)
        Return a list of all non-overlapping matches in the string.
        
        If one or more groups are present in the pattern, return a
        list of groups; this will be a list of tuples if the pattern
        has more than one group.
        
        Empty matches are included in the result.
    
    finditer(pattern, string, flags=0)
        Return an iterator over all non-overlapping matches in the
        string.  For each match, the iterator returns a match object.
        
        Empty matches are included in the result.
    
    match(pattern, string, flags=0)
        Try to apply the pattern at the start of the string, returning
        a match object, or None if no match was found.
    
    purge()
        Clear the regular expression cache
    
    search(pattern, string, flags=0)
        Scan through string looking for a match to the pattern, returning
        a match object, or None if no match was found.
    
    split(pattern, string, maxsplit=0, flags=0)
        Split the source string by the occurrences of the pattern,
        returning a list containing the resulting substrings.
    
    sub(pattern, repl, string, count=0, flags=0)
        Return the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in string by the
        replacement repl.  repl can be either a string or a callable;
        if a string, backslash escapes in it are processed.  If it is
        a callable, it's passed the match object and must return
        a replacement string to be used.
    
    subn(pattern, repl, string, count=0, flags=0)
        Return a 2-tuple containing (new_string, number).
        new_string is the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in the source
        string by the replacement repl.  number is the number of
        substitutions that were made. repl can be either a string or a
        callable; if a string, backslash escapes in it are processed.
        If it is a callable, it's passed the match object and must
        return a replacement string to be used.
    
    template(pattern, flags=0)
        Compile a template pattern, returning a pattern object

DATA
    DOTALL = 16
    I = 2
    IGNORECASE = 2
    L = 4
    LOCALE = 4
    M = 8
    MULTILINE = 8
    S = 16
    U = 32
    UNICODE = 32
    VERBOSE = 64
    X = 64
    __all__ = ['match', 'search', 'sub', 'subn', 'split', 'findall', 'comp...
    __version__ = '2.2.1'

VERSION
    2.2.1

help(re)

Python正则表达式指南

引用地址：http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

3.3.1正则表达式基础

3.3.1.1简单介绍

正则表达式是用于处理字符串的强大工具，拥有本身独特的语法以及一个独立的处理引擎，效率上可能不如str自带的方法，但功能十分强大。得益于这一点，在提供了正则表达式的语言里，正则表达式的语法都是同样的，区别只在于不一样的编程语言实现支持的语法数量不一样；但不用担忧，不被支持的语法一般是不经常使用的部分。

下图展现了使用正则表达式进行匹配的流程：

下图列出了Python支持的正则表达式元字符和语法：

3.3.1.2数量词的贪婪模式与非贪婪模式

正则表达式一般用于在文本中查找匹配的字符串。Python里数量词默认是贪婪的（在少数语言里也多是默认非贪婪），老是尝试匹配尽量多的字符；非贪婪的则相反，老是尝试匹配尽量少的字符。例如：正则表达式"ab*"若是用于查找"abbbc"，将找到"abbb"。而若是使用非贪婪的数量词"ab*?"，将找到"a"。

测试：

>>> print re.match('ab*','abbbc').group()

abbb

>>> print re.match('ab*?','abbbc').group()

a

View Code

3.3.1.3反斜杠的困扰

与大多数编程语言相同，正则表达式里使用"\"做为转义字符，这就可能形成反斜杠困扰。假如你须要匹配文本中的字符"\"，那么使用编程语言表示的正则表达式里将须要4个反斜杠"\\\\"：前两个和后两个分别用于在编程语言里转义成反斜杠，转换成两个反斜杠后再在正则表达式里转义成一个反斜杠。Python里的原生字符串很好地解决了这个问题，这个例子中的正则表达式可使用r"\\"表示。一样，匹配一个数字的"\\d"能够写成r"\d"。有了原生字符串，你不再用担忧是否是漏写了反斜杠，写出来的表达式也更直观。

3.3.1.4匹配模式

正则表达式提供了一些可用的匹配模式，好比忽略大小写、多行匹配等，这部份内容将在Pattern类的工厂方法re.compile(pattern[, flags])中一块儿介绍。

3.3.2re模块

3.3.2.1re.compile

Python经过re模块提供对正则表达式的支持。使用re的通常步骤是先将正则表达式的字符串形式编译为Pattern实例，而后使用Pattern实例处理文本并得到匹配结果（一个Match实例），最后使用Match实例得到信息，进行其余的操做。

# 将正则表达式编译成Pattern对象
>>> pattern = re.compile(r'hello')
# 使用Pattern匹配文本，得到匹配结果，没法匹配时将返回None
>>> match = pattern.match('hello word!')
# 使用Match得到分组信息
>>> print (match.group())
hello

View Code

此种方法多用在写脚本或模块时，对于较复杂的匹配规则或会常常被使用的匹配规则先作编译，再使用。

>>> help(re.compile)
Help on function compile in module re:

compile(pattern, flags=0)
Compile a regular expression pattern, returning a pattern object.

help(re.compile)

re.compile(strPattern[, flag]):

这个方法是Pattern类的工厂方法，用于将字符串形式的正则表达式编译为Pattern对象。第二个参数flag是匹配模式，取值可使用按位或运算符'|'表示同时生效，好比re.I | re.M。另外，你也能够在规则字符串中指定模式，好比re.compile('pattern', re.I | re.M)与re.compile('(?im)pattern')是等价的。（参看特殊构造（不做为分组部分））
可选值有：

re.I(re.IGNORECASE): 忽略大小写（括号内是完整写法，下同）
re.M(re.MULTILINE): 多行模式，改变'^'和'$'的行为（参见上图）
re.S(re.DOTALL): 点任意匹配模式，改变'.'的行为
re.L(re.LOCALE): 使预约字符类 \w \W \b \B \s \S 取决于当前区域设定
re.U(re.UNICODE): 使预约字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
re.X(re.VERBOSE): 详细模式。这个模式下正则表达式能够是多行，忽略空白字符，并能够加入注释。如下两个正则表达式是等价的：

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")

View Code

3.3.2.2re.match

>>> help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
>>> m = re.match(r'hello', 'hello world!')
>>> m.group()
'hello'

help(re.match)

Match对象是一次匹配的结果，包含了不少关于这次匹配的信息，可使用Match提供的可读属性或方法来获取这些信息。

属性：

（1）string: 匹配时使用的文本。

（2）re: 匹配时使用的Pattern对象。

（3）pos: 文本中正则表达式开始搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。

（4）endpos: 文本中正则表达式结束搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。

（5）lastindex: 最后一个被捕获的分组在文本中的索引。若是没有被捕获的分组，将为None。

（6）lastgroup: 最后一个被捕获的分组的别名。若是这个分组没有别名或者没有被捕获的分组，将为None。

>>> m.string
'hello world!'
>>> m.re
<_sre.SRE_Pattern object at 0x02CC6D40>
>>> m.pos
0
>>> m.endpos
12
>>> m.lastindex
>>> m.lastgroup
>>>

测试

方法：

（1）group([group1, …]):
得到一个或多个分组截获的字符串；指定多个参数时将以元组形式返回。group1可使用编号也可使用别名；编号0表明整个匹配的子串；不填写参数时，返回group(0)；没有截获字符串的组返回None；截获了屡次的组返回最后一次截获的子串。

（2）groups([default]):
以元组形式返回所有分组截获的字符串。至关于调用group(1,2,…last)。default表示没有截获字符串的组以这个值替代，默认为None。

（3）groupdict([default]):
返回以有别名的组的别名为键、以该组截获的子串为值的字典，没有别名的组不包含在内。default含义同上。

（4）start([group]):
返回指定的组截获的子串在string中的起始索引（子串第一个字符的索引）。group默认值为0。

（5）end([group]):
返回指定的组截获的子串在string中的结束索引（子串最后一个字符的索引+1）。group默认值为0。

（6）span([group]):
返回(start(group), end(group))。

（7）expand(template):
将匹配到的分组代入template中而后返回。template中可使用\id或\g<id>、\g<name>引用分组，但不能使用编号0。\id与\g<id>是等价的；但\10将被认为是第10个分组，若是你想表达\1以后是字符'0'，只能使用\g<1>0。

举例说明：

匹配3个分组，（1）1或无限个字符，（2）1或无限个字符（3）具备额外别名“sign”的分组,任意符号0或无限个。要匹配的字符串为”hello world!”

>>> m2 = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')
>>> m2.string  #匹配时使用的文本，即要匹配的字符串
'hello world!'
>>> m2.re     #匹配时使用的Pattern对象，即编译的匹配规则
<_sre.SRE_Pattern object at 0x02CB8B00>
>>> m2.pos  #文本中正则表达式开始搜索的索引
0
>>> m2.endpos #文本中正则表达式结束搜索的索引
12
>>> m2.lastindex  #最后一个被捕获的分组在文本中的索引
3
>>> m2.lastgroup  #最后一个被捕获的分组的别名，若是这个分组没有别名或者没有被捕获的分组，将为None。即只在有捕获并有别名时才会有输出。
'sign'
>>> m3 = re.match(r'(\w+) (\w+)(.*)', 'hello world!')
>>> m3.lastgroup
>>> 


>>> m2.group() #得到一个或多个分组截获的字符串；指定多个参数时将以元组形式返回。
'hello world!'
>>> m2.group(0)
'hello world!'
>>> m2.group(1)
'hello'
>>> m2.group(2)
'world'
>>> m2.group(3)
'!'
>>> m2.group(1,2)
('hello', 'world')
>>> m2.group(1,3)
('hello', '!')
>>> m2.group(1,2,3)
('hello', 'world', '!')

>>> m2.groups()  #以元组形式返回所有分组截获的字符串。
('hello', 'world', '!')
>>> m2.groups(1)
('hello', 'world', '!')
>>> m2.groups(2)
('hello', 'world', '!')
#返回以有别名的组的别名为键、以该组截获的子串为值的字典，没有别名的组不包含在内。
>>> m2.groupdict()  
{'sign': '!'}

#返回指定的组截获的子串在string中的起始索引（子串第一个字符的索引）
>>> m2.start()
0
>>> m2.start(0)
0
>>> m2.start(1)
0
>>> m2.start(2)
6
>>> m2.start(3)
11

#返回指定的组截获的子串在string中的结束索引（子串最后一个字符的索引+1）
>>> m2.end()
12
>>> m2.end(0)
12
>>> m2.end(1)
5
>>> m2.end(2)
11
>>> m2.end(3)
12
将匹配到的分组代入参数中而后按从新排列的顺序返回
>>> m2.expand(r'\3\2\1')
'!worldhello'
>>> m2.expand(r'\3 \2 \1')
'! world hello'

View Code

3.3.2.3Pattern

Pattern对象是一个编译好的正则表达式，经过Pattern提供的一系列方法能够对文本进行匹配查找。

>>> help(m2.re)
Help on SRE_Pattern object:

class SRE_Pattern(__builtin__.object)
 |  Compiled regular expression objects
 |  
 |  Methods defined here:
 |  
 |  __copy__(...)
 |  
 |  __deepcopy__(...)
 |  
 |  findall(...)
 |      findall(string[, pos[, endpos]]) --> list.
 |      Return a list of all non-overlapping matches of pattern in string.
 |  
 |  finditer(...)
 |      finditer(string[, pos[, endpos]]) --> iterator.
 |      Return an iterator over all non-overlapping matches for the 
 |      RE pattern in string. For each match, the iterator returns a
 |      match object.
 |  
 |  match(...)
 |      match(string[, pos[, endpos]]) --> match object or None.
 |      Matches zero or more characters at the beginning of the string
 |  
 |  scanner(...)
 |  
 |  search(...)
 |      search(string[, pos[, endpos]]) --> match object or None.
 |      Scan through string looking for a match, and return a corresponding
 |      match object instance. Return None if no position in the string matches.
 |  
 |  split(...)
 |      split(string[, maxsplit = 0])  --> list.
 |      Split string by the occurrences of pattern.
 |  
 |  sub(...)
 |      sub(repl, string[, count = 0]) --> newstring
 |      Return the string obtained by replacing the leftmost non-overlapping
 |      occurrences of pattern in string by the replacement repl.
 |  
 |  subn(...)
 |      subn(repl, string[, count = 0]) --> (newstring, number of subs)
 |      Return the tuple (new_string, number_of_subs_made) found by replacing
 |      the leftmost non-overlapping occurrences of pattern with the
 |      replacement repl.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  flags
 |  
 |  groupindex
 |  
 |  groups
 |  
 |  pattern

help(m2.re)

Pattern不能直接实例化，必须使用re.compile()进行构造。

3.3.2.3.1Pattern提供了几个可读属性用于获取表达式的相关信息：

（1）pattern: 编译时用的表达式字符串。

（2）flags: 编译时用的匹配模式。数字形式。

（3）groups: 表达式中分组的数量。

（4）groupindex: 以表达式中有别名的组的别名为键、以该组对应的编号为值的字典，没有别名的组不包含在内。

import re
p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL)
 
print "p.pattern:", p.pattern
print "p.flags:", p.flags
print "p.groups:", p.groups
print "p.groupindex:", p.groupindex
 
### output ###
# p.pattern: (\w+) (\w+)(?P<sign>.*)
# p.flags: 16
# p.groups: 3
# p.groupindex: {'sign': 3}

测试

3.3.2.3.2实例方法[ | re模块方法]：

1、match(string[, pos[, endpos]]) | re.match(pattern, string[, flags]):

 |  match(...)
 |      match(string[, pos[, endpos]]) --> match object or None.
 |

View Code

这个方法将从string的pos下标处起尝试匹配pattern；若是pattern结束时仍可匹配，则返回一个Match对象；若是匹配过程当中pattern没法匹配，或者匹配未结束就已到达endpos，则返回None。
pos和endpos的默认值分别为0和len(string)；re.match()没法指定这两个参数，参数flags用于编译pattern时指定匹配模式。
注意：这个方法并非彻底匹配。当pattern结束时若string还有剩余字符，仍然视为成功。想要彻底匹配，能够在表达式末尾加上边界匹配符'$'。

示例参见3.3.2.1小节。

2、search(string[, pos[, endpos]]) | re.search(pattern, string[, flags]):

这个方法用于查找字符串中能够匹配成功的子串。从string的pos下标处起尝试匹配pattern，若是pattern结束时仍可匹配，则返回一个Match对象；若没法匹配，则将pos加1后从新尝试匹配；直到pos=endpos时仍没法匹配则返回None。
pos和endpos的默认值分别为0和len(string))；re.search()没法指定这两个参数，参数flags用于编译pattern时指定匹配模式。

# 将正则表达式编译成Pattern对象
>>> pattern  = re.compile(r'world')
# 使用search()查找匹配的子串，不存在能匹配的子串时将返回None 
# 这个例子中使用match()没法成功匹配   hello可以match()成功***
>>> match = pattern.search('hello world!')
# 使用Match得到分组信息
>>> match.group()
'world'
>>>

View Code

3、split(string[, maxsplit]) | re.split(pattern, string[, maxsplit]):

按照可以匹配的子串将string分割后返回列表。maxsplit用于指定最大分割次数，不指定将所有分割。

|  split(...)
 |      split(string[, maxsplit = 0])  --> list.
 |      Split string by the occurrences of pattern.

>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.

>>> p = re.compile(r'\d+')
>>> p
<_sre.SRE_Pattern object at 0x02D53F70>
>>> p.split('one1two2three3four4five5six6seven7eight8nine9ten10')
['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', '']

View Code

4、findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags]):

搜索string，以列表形式返回所有能匹配的子串。

>>> p = re.compile(r'\d+')
>>> p.findall('one1two2three3four4five5six6seven7eight8nine9ten10')
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

View Code

5、finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]):

搜索string，返回一个顺序访问每个匹配结果（Match对象）的迭代器。

>>> p = re.compile(r'\d+')
>>> piter = p.finditer('one1two2three3four4')
>>> piter
<callable-iterator object at 0x02E153B0>
>>> for x in piter:
    print x

    
<_sre.SRE_Match object at 0x02EAE800>
<_sre.SRE_Match object at 0x02EAE838>
<_sre.SRE_Match object at 0x02EAE800>
<_sre.SRE_Match object at 0x02EAE838>

>>> piter = p.finditer('one1two2three3four4')
>>> for x in piter:
    print x.group(),
1 2 3 4

View Code

6、sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):

使用repl替换string中每个匹配的子串后返回替换后的字符串。
当repl是一个字符串时，可使用\id或\g<id>、\g<name>引用分组，但不能使用编号0。
当repl是一个方法时，这个方法应当只接受一个参数（Match对象），并返回一个字符串用于替换（返回的字符串中不能再引用分组）。
count用于指定最多替换次数，不指定时所有替换。

（1）字符串时

>>> p = re.compile(r'(\w+) (\w+)')
>>> s = 'i say, hello world'
>>> p.sub(r'\2 \1',s)
'say i, world hello'

View Code

注：只有两个匹配，使用序号超过匹配分组时，抛出异常

>>> p.sub(r'\3 \1',s)

Traceback (most recent call last):
  File "<pyshell#207>", line 1, in <module>
    p.sub(r'\3 \1',s)
  File "C:\Python27\lib\re.py", line 291, in filter
    return sre_parse.expand_template(template, match)
  File "C:\Python27\lib\sre_parse.py", line 833, in expand_template
    raise error, "invalid group reference"
error: invalid group reference

View Code

（2）方法时

>>> def fun(m):
    return m.group(1).title()+ ' ' + m.group(2).title()

>>> p.sub(fun,s)
'I Say, Hello World'

>>> help(str.title)
Help on method_descriptor:

title(...)
    S.title() -> string
    
    Return a titlecased version of S, i.e. words start with uppercase
characters, all remaining cased characters have lowercase.
返回字符串首字母大写。

View Code

7、subn(repl, string[, count]) |re.sub(pattern, repl, string[, count]):

返回 (sub(repl, string[, count]), 替换次数)。

>>> help(p.subn)
Help on built-in function subn:

subn(...)
    subn(repl, string[, count = 0]) --> (newstring, number of subs)
    Return the tuple (new_string, number_of_subs_made) found by replacing
    the leftmost non-overlapping occurrences of pattern with the
replacement repl.
>>> p = re.compile(r'(\w+) (\w+)')
>>> s = 'i say, hello world!'
>>> p.subn(r'\2 \1', s)
('say i, world hello!', 2)

>>> p.subn(r'\2',s)
('say, world!', 2)
>>> p.subn(r'\1',s)
('i, hello!', 2)
>>> p.subn(r'\1 \2',s)
('i say, hello world!', 2)

>>> def funn(m):
    print(m.group(1)+' '+ m.group(2))

    
>>> p.subn(funn,s)
i say
hello world
(', !', 2)

>>> def funn(m):
    return(m.group(1)+' '+ m.group(2))

>>> p.subn(funn,s)
('i say, hello world!', 2)

help(p.subn)