nginx正则表达式(上篇)

时间 2019-11-17

原文原文链接

微信公众号：郑尔多斯

关注「郑尔多斯」公众号，回复「领取资源」，获取IT资源500G干货。
升职加薪、当上总经理、出任CEO、迎娶白富美、走上人生巅峰！想一想还有点小激动

关注可了解更多的Nginx知识。任何问题或建议，请公众号留言;
关注公众号，有趣有内涵的文章第一时间送达！

前言

在Nginx中location, server_name,rewrite等模块使用了大量的正则表达式，经过正则表达式能够完整很是强悍的功能，可是这部分对咱们阅读源码也产生了很是大的困惑。本文就集中精力来学习一下Nginx中的正则表达式，帮助咱们更透彻的理解nginx中的功能。html

起源

Nginx中的正则表达式使用了pcre格式，而且封装了pcre函数库的几个经常使用函数，咱们学习一下这几个函数，经过它们就能够透彻的理解nginx中的正则表达式。nginx

编译正则表达式

正则表达式在使用以前要首先通过编译(compile)，获得一个编译以后的数据结构，而后经过这个数据结构进行正则匹配和其余各类信息的获取。
PCRE中进行编译的函数有两个，分别为pcre_compile()和pcre_compile2()，这两个函数的功能相似，Nginx使用了前者，因此咱们对pcre_compile进行分析。正则表达式

pcre *pcre_compile(
     const char *pattern, 
     int options, 
     const char **errptr, 
     int *erroffset, 
     const unsigned char *tableptr
);
复制代码

参数说明：
pattern: 将要被编译的正则表达式。
options: 编译过程当中使用到的选项。在Nginx中，只使用到了PCRE_CASELESS选项，表示匹配过程当中不区分大小写。
errptr:保存编译过程当中遇到的错误。该字段若是为NULL，那么pcre_compile()会中止编译，直接返回NULL.
erroffset:该字段保存编译过程当中发生错误的字符在pattern中的偏移量。
tableptr:这个参数的做用不清楚，可是文档中说能够为NULL，而且Nginx中也确实设置为NULL,因此能够忽略这个字段。express

返回值：
该函数返回一个pcre *指针，表示编译信息，经过这个返回值能够获取与编译有关的信息，该结构体也用于pcre_exec()函数中，完整匹配操做。api

获取编译信息

经过上述的编译返回的结构体，能够获取当前pattern的许多信息，好比捕获分组的信息等，下面的函数就是完成这个功能的。数组

int pcre_fullinfo(
      const pcre *code, 
      const pcre_extra *extra, 
      int what, 
      void *where
);
复制代码

参数说明：
code : 这个参数就是上面的pcre_compile()返回的结构体。
extra: 这个参数是pcre_study()返回的结构体，若是没有，能够为NULL.
what : 咱们要获取什么信息
where: 保存返回的数据微信

返回值：
若是函数执行成功，返回0.
nginx中经过该函数获取了以下信息：数据结构

PCRE_INFO_CAPTURECOUNT: 获得的是全部子模式的个数,包含命名捕获分组和非命名捕获分组;app

PCRE_INFO_NAMECOUNT: 获得的是命名子模式的个数,不包括非命名子模式的个数;ide

在这里要说明一个状况：PCRE容许使用命名捕获分组，也容许使用匿名捕获分组（即分组用数字来表示），其实命名捕获分组只是用来标识分组的另外一种方式，命名捕获分组也会得到一个数字分组名称。PCRE提供了一些方法能够经过命名捕获分组的名称来快速获取捕获分组内容的函数，好比：pcre_get_named_substring() .
也能够经过如下步骤来获取捕获分组的信息：

将命名捕获分组的名称转换为数字。
经过上一步的数字来获取分组的信息。
这里就牵涉到了一个 name to number 的转换过程，PCRE维护了一个 name-to-number 的map，咱们能够根据这个map完成转换功能，这个map有如下三个属性：

PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE

这个map包含了若干个固定大小的记录，能够经过PCRE_INFO_NAMECOUNT参数来获取这个map的记录数量(其实就是命名捕获分组的数量)，经过PCRE_INFO_NAMEENTRYSIZE来获取每一个记录的大小，这两种状况下，最后一个参数都是一个int类型的指针。其中每一个每一个记录的大小是由最长的捕获分组的名称来确立的。The entry size depends on the length of the longest name.

PCRE_INFO_NAMETABLE 返回一个指向这个map的第一条记录的指针（一个char类型的指针），每条记录的前两个字节是命名捕获分组所对应的数字分组值，剩下的内容是命名捕获分组的name，以'\0'结束。返回的map的顺序是命名捕获分组的字母顺序。

下面是PCRE官方文档中的一个例子：

When PCRE_DUPNAMES is set, duplicate names are in order of their parentheses numbers. For example, consider the following pattern (assume PCRE_EXTENDED is set, so white space - including newlines - is ignored):
(?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )
There are four named subpatterns, so the table has four entries, and each entry in the table is eight bytes long. The table is as follows, with non-printing bytes shows in hexadecimal, and undefined bytes shown as ??:
00 01 d a t e 00 ??
00 05 d a y 00 ?? ??
00 04 m o n t h 00
00 02 y e a r 00 ??
When writing code to extract data from named subpatterns using the name-to-number map, remember that the length of the entries is likely to be different for each compiled pattern.

例子

这里有一个从网上找的例子，可是具体找不到原文的连接了，以下：

//gcc pcre_test.c -o pcre_test -L /usr/lib64/ -lpcre
#include <stdio.h>
#include <pcre.h>

int main()
{
    pcre  *re;
        const   char       *errstr;
    int  erroff;
    int captures =0, named_captures, name_size;
    char  *name;
    char *data = "(?<date> (?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d) )";
    int n, i;
    char  *p;
    p = data;
    printf("%s \n", p);
    re = pcre_compile(data, PCRE_CASELESS, &errstr, &erroff, NULL);
    if(NULL == re)
    {
        printf("compile pcre failed\n");
        return 0;
    }
    n = pcre_fullinfo(re, NULL, PCRE_INFO_CAPTURECOUNT, &captures);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_CAPTURECOUNT failed %d \n", n);
        return 0;
    }
    printf(" captures %d \n", captures);
    n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMECOUNT, &named_captures);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_NAMECOUNT failed %d \n", n);
        return 0;
    }
    printf("named_captures %d \n", named_captures);
    n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMEENTRYSIZE, &name_size);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_NAMEENTRYSIZE failed %d \n", n);
        return 0;
    }
    printf("name_size %d \n", name_size);
    n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMETABLE, &name);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_NAMETABLE failed %d \n", n);
        return 0;
    }
    p =name;
    int j;
    for(j = 0; j < named_captures; j++)
    {
        for(i = 0; i <2; i++)
        {
            printf("%x ", p[i]);
        }
        printf("%s \n", &p[2]);
        p += name_size;
    }
    return 0;
}
复制代码

输出结果以下：

从结果中能够看出来：
总共有 5 个捕获分组
4 个命名捕获分组
每一个记录的最大长度是 8，这里就是 month 这条记录是最长的了，由于最后面还有一个 '\0' 结束符，因此长度为 8
咱们能够看出来，对于每一个命名捕获分组，也都会给它分配一个数字编号。而且 capture的数字是和非命名子模式一块儿排列的,也就是根据左括号的前后排列的

匹配

上面介绍了编译，以及获取其余信息，那么剩下的就是最重要的匹配了。

int pcre_exec(
    const pcre *code, 
    const pcre_extra *extra,
    const char *subject, 
    int length, 
    int startoffset, 
    int options, 
    int *ovector, 
    int ovecsize
);
复制代码

参数说明：
code: 编译函数的返回值
extra: pcre_study的返回值，能够为NULL
subject: 待匹配的字符串
length : subject的长度
startoffset: 开始匹配的位置
option: 匹配的选项
vector: 保存匹配结构的数据
ovecsize : vector数组的长度，必须为3的倍数
下面是PCRE文档中对该函数的一些解释，我翻译了一部分：

How pcre_exec() returns captured substrings
In general, a pattern matches a certain portion of the subject, and in addition, further substrings from the subject may be picked out by parts of the pattern. Following the usage in Jeffrey Friedl's book, this is called "capturing" in what follows, and the phrase "capturing subpattern" is used for a fragment of a pattern that picks out a substring. PCRE supports several other kinds of parenthesized subpattern that do not cause substrings to be captured.

一般来讲，一个pattern能够匹配一个subject中的特定一部分，除此以外，subject中的一部分还可能会被pattern中的一部分匹配（意思就是：pattern中可能存在捕获分组，那么subject中的一部分可能会被这部分捕获分组所匹配）。

Captured substrings are returned to the caller via a vector of integer offsets whose address is passed in ovector. The number of elements in the vector is passed in ovecsize, which must be a non-negative number. Note: this argument is NOT the size of ovector in bytes.

咱们在pcre_exec()中的vector参数就是会保存一系列integer offset，经过这些整形偏移量咱们就能够获取捕获分组的内容。vector参数的数量是经过ovecsize参数指定的，ovecsize参数的大小必须是三的倍数。

The first two-thirds of the vector is used to pass back captured substrings, each substring using a pair of integers. The remaining third of the vector is used as workspace by pcre_exec()while matching capturing subpatterns, and is not available for passing back information. The length passed in ovecsize should always be a multiple of three. If it is not, it is rounded down.

vector参数的前2/3用来保存后向引用的分组捕获（好比$1, $2等），每一个substring都会使用vector中的两个整数。剩余的1/3被pcre_exec()函数在捕获分组的时候使用，不能被用来保存后向引用。

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of a pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. The first pair, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec() is one more than the highest numbered pair that has been set. For example, if two substrings have been captured, the returned value is 3. If there are no capturing subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set.

当匹配成功以后，从vector参数的第一个元素开始，每对元素都表明一个捕获分组，直到最多前2/3个元素。vector参数的每对元素的第一个元素表示当前捕获分组的第一个字符在subject中的偏移量，第二个元素表示捕获分组最后一个元素后面的元素在subject中的位置。vector的前两个元素, ovector[0]和ovector[1]用来表示subject中彻底匹配pattern的部分。next pair用来表示第一个捕获分组，以此类推。pcre_exec()的返回值是匹配的最大分组的number加1(这部分很差翻译，直接看英文更容易理解）。例如，若是两个捕获分组被匹配成功，那么返回值就是3。若是没有匹配成功任何分组，那么返回值就是1。

If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned.

若是某个捕获分组被屡次匹配成功，那么返回最后一次匹配成功的substring的信息。

If the vector is too small to hold all the captured substring offsets, it is used as far as possible (up to two-thirds of its length), and the function returns a value of zero. In particular, if the substring offsets are not of interest, pcre_exec() may be called with ovector passed as NULL and ovecsize as zero. However, if the pattern contains back references and the ovector is not big enough to remember the related substrings, PCRE has to get additional memory for use during matching. Thus it is usually advisable to supply an ovector.

若是vector过小，没法保存全部的捕获分组，那么pcre会尽量的使用这个数组（可是最多使用2/3）,而且pcre_exec()函数返回0。特别指出，若是咱们对捕获分组的信息不感兴趣，那么能够把vector参数设置为NULL，ovecsize参数设置为0。

The pcre_info() function can be used to find out how many capturing subpatterns there are in a compiled pattern. The smallest size for ovector that will allow for n captured substrings, in addition to the offsets of the substring matched by the whole pattern, is (n+1)*3.

咱们可使用pcre_info()函数来获取当前的pattern中有多少捕获分组(其实如今使用的都是pcre_fullinfo()函数)。好比ovector参数的值为n，那么为了获取被整个pattern匹配的string的信息，咱们应该把ovecsize的值设置为 (n + 1) * 3.

It is possible for capturing subpattern number n+1 to match some part of the subject when subpattern n has not been used at all. For example, if the string "abc" is matched against the pattern (a|(z))(bc) the return from the function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this happens, both values in the offset pairs corresponding to unused subpatterns are set to -1.

举一个例子，若是咱们使用"abc"来匹配"(a|(z))(bc)"，那么pcre_exec()函数将返回4.其中第一个和第三个捕获分组捕获成功，可是第二个分组没有捕获成功。因此第二个分组对应的那个下标对的值会被设置为 -1。

Offset values that correspond to unused subpatterns at the end of the expression are also set to -1. For example, if the string "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The return from the function is 2, because the highest used capturing subpattern number is 1. However, you can refer to the offsets for the second and third capturing subpatterns if you wish (assuming the vector is large enough, of course).

参考

PCRE函数库连接：http://regexkit.sourceforge.net/Documentation/pcre/pcreapi.html#SEC1
微软关于正则表达式的用法：https://docs.microsoft.com/zh-cn/dotnet/standard/base-types/anchors-in-regular-expressions

喜欢本文的朋友们，欢迎长按下图关注订阅号郑尔多斯，更多精彩内容第一时间送达