可移植的 Scheme 正则表达式库 pregexp.scm 文档翻译

时间 2019-12-20

标签可移植 scheme 正则表达式 pregexp.scm pregexp scm 文档翻译栏目正则表达式繁體版

原文原文链接

pregexp.scm 被不少 Scheme 实现做为内置的正则表达式引擎使用。好比 Racket 里使用的正则表达式引擎就是从它的基础上发展而来的。甚至连文档也大同小异。因此，本文的大部份内容对 Racket 也适用。难能难得的是，pregexp 没有使用某个实现特有的语法或特性，因此它的可移植性很好，只须要少许的修改就可以在几乎全部实现上跑起来。固然，pregexp 的开发年代很早了，也许可能 Racket 里的实现会的一些性能改善或者 BUG 修复。git

1. 简介

正则表达式是一个模式字符串，正则表达式匹配器会尝试与另外一个字符串（的一部分）进行匹配，被匹配的字符串被视为原始文本，而不是一个模式。程序员

正则表达式中的大多数字符会匹配原始文本中出现的本身。所以， "abc"会匹配包含a, b, c三个连续字符的字符串。正则表达式

在正则表达式模式中，一些字符被视为“元字符”，一些字符序列被视为“元序列”，也就是说，它表示的并非该字符自己。例如，在正则表达式 "a.c" 中，字符a和c表示的是字符 a和c自己，然而.能够匹配任意的字符（除了换行符）。因此， "a.c"能够匹配以a开头，以c结尾的任意三个字符，好比： "abc", "aac", "afc", "a*c"...express

若是咱们须要精确匹配.自己，就须要使用转义字符，就是在前面加上一个反斜杠 \，反斜杠也是一个元字符，可是它不匹配任何字符，而是将紧跟着它的元字符变成一个普通字符。好比: "a\\.c"能够匹配"a.c", 使用双斜杠的缘由是，在 Scheme 的字符串中，反斜杠自己就是转义字符，要在Scheme字符串中包含一个反斜杠，就须要双反斜杠。就像在 C 中同样。另外一个例子是 \t，它以一种可读的方式来表示 tab 字符。安全

咱们将字符串表示的正则表达式称为 U-regexp ，U 能够被解释为 Unix-style 或者 universal 。由于这种正则表达式的表示法被广泛接受。咱们的实现使用一种树形的中间表示法，称之为 S-regexp ，S 能够被理解为 Scheme, symbolic 或者 S-expression. S-regexp 更冗长，而且不易读，不易理解，可是便于 Scheme 的递归过程处理。性能

2. 正则表达式过程

pregexp.scm 提供了以下几个过程： pregexp , pregexp-match-positions , pregexp-match, pregexp-split, pregexp-replace, pregexp-replace*, pregexp-quote. 由 pregexp.scm 引入的全部过程都有 'pregexp' 前缀，因此它们不太可能和 Scheme 中的其余名称冲突，包括由实现自己提供的正则表达式过程的名称。spa

2.1 pregexp

pregexp 接受一个字符串表示的正则表达式模式(U-regexp), 返回一个 S-regexp 。rest

(pregexp "c.r")
=> (:sub (:or (:seq #\c :any #\r)))

2.2 pregexp-match-positions

pregexp-match-positions 过程接受一个正则表达式和一个原始文本字符串，若是匹配成功，返回一个 match，不然返回 #f。code

正则表达式能够是 UNIX 风格的正则字符串，或者是树形的 S-regexp 。在内部， pregexp-match-positions 首先将字符串表示的正则表达式编译成 S-regexp ，而后再进行匹配。若是你发现一个正则表达式有可能会被屡次用到，那么明智的作法是用 pregexp 过程将它显式地转换成 S-regexp ，而且保存在一个临时变量中，这样能够节省从新编译的时间。regexp

pregexp-match-positions 返回 #f(若是匹配失败) 或者一个点对列表(若是匹配成功).

(pregexp-match-positions "brain" "bird")
=> #f

(pregexp-match-positions "needle" "hay needle stack")
=> ((4 . 10))

在第二个例子里，整数 4 和 10 标志着被匹配的子串，4 表明子串的索引开始，10 表明索引结束(10 索引处的字符并不包括在内，这与广泛意义上的字符串索引是一致的)。

(substring "hay needle stack" 4 10)
=> "needle"

这里， pregexp-match-positions 返回的列表仅包含一个索引对，该索引对表示匹配的子串在整个字符串中的位置。当咱们稍后讨论子模式时，咱们将看到单个匹配操做如何产生子匹配列表。

pegexp-match-positions 接受可选的第三和第四个参数，指定将要被匹配的子串。

(pregexp-match-positions "needle"
  "his hay needle stack -- my hay needle stack -- her hay needle stack"
  24 43)
=> ((31 . 37))

注意，返回的索引依然是相对于整个字符串来计算的。

2.3 pregexp-match

pregexp-match 的调用相似于 pregexp-match-positions ，可是它返回的是匹配的子串，而不是索引位置。

(pregexp-match "brain" "bird")
=> #f

(pregexp-match "needle" "hay needle stack")
=> ("needle")

pregexp-match 一样接受可选的第三和第四个参数。

2.4 pregexp-split

pregexp-split 过程接受两个参数，一个正则表达式以及一个文本字符串，返回文本字符串的子串构成的列表，由被匹配的子串充当分隔。

(pregexp-split ":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin")
=> ("/bin" "/usr/bin" "/usr/bin/X11" "/usr/local/bin")

(pregexp-split " " "pea soup")
=> ("pea" "soup")

若是第一个参数指定为空字符串，则返回由单个字符组成的列表：

(pregexp-split "" "smithereens")
=> ("s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s")

要在分隔符中表示超过一个的空格，须要使用正则表达式 " +", 而不是 " *"

(pregexp-split " +" "split pea     soup")
=> ("split" "pea" "soup")

(pregexp-split " *" "split pea     soup")
=> ("s" "p" "l" "i" "t" "p" "e" "a" "s" "o" "u" "p")

2.5 pregexp-replace

regexp-replace 过程将匹配的子串替换为另外一个字符串

(pregexp-replace "te" "liberte" "ty")
=> "liberty"

若是没有可匹配的子串，则原样返回文本字符串(eq? 意义上的相等，即同一个对象)。

2.6 pregexp-replace*

pregexp-replace* 替换全部被匹配的子串：

(pregexp-replace* "te" "liberte egalite fraternite" "ty")
=> "liberty egality fratyrnity"

和 pregexp-replace 同样，若是没有匹配，则原样返回原来的文本字符串

2.7 pregexp-quote

pregexp-quote 接受任意一个字符串，返回一个能够精确地表示它的 U-regexp （字符串）。特别是，在输入字符串中能够用做正则表达式元字符的特殊字符会被反斜杠转义，以便它们安全地只匹配本身。

(pregexp-quote "cons")
=> "cons"

(pregexp-quote "list?")
=> "list\\?"

当从一个混合了正则表达式字符串以及逐字的字符串构建复合的正则表达式时 pregexp-quote 至关有用。（为何这么绕？）

3 正则表达式模式语言

这里完整地描述 pregexp 使用的正则表达式模式语言

3.1 基本的断言

^ 和 $ 分别表示字符串的开头和结尾。它们确保靠近它们的正则表达式匹配一个字符串的开头或结尾。例如:

(pregexp-match-positions "^contact" "first contact")
=> #f

匹配失败，由于 'contact' 并无出如今文本字符串的开头。

(pregexp-match-positions "laugh$" "laugh laugh laugh laugh")
=> ((18 . 23))

该正则表达式匹配了最后一个 'laugh'。

元序列 \b 断言存在单词边界。

(pregexp-match-positions "yack\\b" "yackety yack")
=> ((8 . 12))

'yackety' 里的 'yack' 后边没有存在单词边界，因此它没有被匹配。第二个 'yack' 则匹配成功。

元序列 \B 的意思正好相反。它断言单词边界不存在。

(pregexp-match-positions "an\\B" "an analysis")
=> ((3 . 5))

多说一句，第一个出现的 'an'，后面是空格，因此没有被匹配；而 'analysis' 开头的 'an'，后面紧挨着的是'alysis'，没有间隔存在，因此被匹配。

3.2 字符和字符类

一般，正则表达式中的字符与文本字符串中相同的字符相匹配。有时，使用正则表达式来引用单个字符是必要的或者方便的。所以，元序列 \n, \r, \t 以及 \. 分别匹配 newline, return, tab 以及. 。

元字符 . 匹配除了 \n 以外的任意字符。

(pregexp-match "p.t" "pet")
=> ("pet")

它一样匹配 'pat', 'pit', 'pot', 'put', 以及 'p8t'，可是不能匹配 'pfffft'.

字符类匹配一组字符集合中的任意一个字符。典型的字符类是由方括号括起来的一组字符 [...], 它匹配方括号中包含的非空字符序列中的任意一个字符。所以，"p[aeiou]t" 能够匹配 'pat', 'pet', 'pit', 'pot', 'put' 等等。

在方括号中，两个字符之间的连号 - 指定 ASCII 码表里，两个字符之间的一个范围。例如，"ta[b-dgn-p]" 匹配 'tab', 'tac', 'tad', 'tag', 以及 'tan', 'tao', 'tap'。

左括号后面的符号 ^ 反转由剩下的内容指定的集合，即它指定除方括号中标识的字符以外的字符集合。例如，"do[^g]" 匹配由 'do' 开头的全部三个字符，除了 'dog'。

要注意，方括号里的 ^ 和它在方括号外的意思彻底不同。大多数其余元字符(. * + ?等)到了方括号中就再也不是元字符了，虽然为了 peace of mind 仍然能够转义它们。- 只有在方括号内才是一个元字符，固然它不能是方括号里的第一个，也不能是最后一个字符。

方括号字符类不能包含其余带方括号的字符类（尽管它们能包含某些其余类型的字符类——下面将会看到）。所以，在一个带方括号的字符类中，单独的左括号再也不是一个元字符，它能够表明它本身。例如："[a[b]" 匹配 'a', '[', 能及 'b'。

此外，因为方括号字符类不能为空，因此紧接在开头的左括号以后的右括号也不被视为元字符。例如："[]ab]" 匹配 ']', 'a' 和 'b'。

3.2.1 经常使用的字符类

一些标准字符类能够方便地表示为元序列，而不是显式的方括号表达式。\d 匹配一个数字[0-9]；\s 匹配一个空白字符；\w 匹配多是“单词”的一部分的字符。（遵循正则表达式的惯例，咱们认定“单词”字符是 [A-Za-z0-9_] , 也就是能用作 C 语言标识符的字母、数字和下划线）, 虽然这与一个 Scheme 程序员所认为的单词的定义相比可能太过严格（在 Lisp 和 Scheme 里，标识符所能使用的字符太自由了）。

这些元序列的大写版本表示相反的意思，\D 匹配非数字字符，\S 匹配非空白字符，\W 匹配非单词字符。

将这些元序列放置在 Scheme 字符串中时，请记住要写成双反斜械：

(pregexp-match "\\d\\d"
  "0 dear, 1 have 2 read catch 22 before 9")
=> ("22")

这些字符类可使用在一个方括号表达式中，例如："[a-z\\d]"匹配一个小写字母或者一个数字。

3.2.2 POSIX 字符类

POSIX 字符类是一种格式为 [: ... :] 的特殊元序列，只能在方括号表达式中使用。支持的 POSIX 字符类包括：

[:alnum:]       ;; 字母和数字
[:alpha:]       ;; 字母
[:algor:]       ;; 字母 'c', 'h', 'a' 和 'd'
[:ascii:]       ;; 7位 ASCII 字符
[:blank:]       ;; 空白符，即 空格 和 制表符（不包括回车？）
[:cntrl:]       ;; 控制字符，即 ASCII 码表中小于 32 的那些
[:digit:]       ;; 数字，与 '\d' 相同
[:graph:]       ;; ???
[:lower:]       ;; 小写字母
[:print:]       ;; ???
[:space:]       ;; 空白符，与 '\s' 相同
[:upper:]       ;; 大写字母
[:word:]        ;; 字母，数字以及下划线，与 \w 相同
[:xdigit:]      ;; 十六进制数字

例如，正则表达式"[[:alpha:]_]" 匹配一个字母或下划线

(pregexp-match "[[:alpha:]_]" "--x--")
=> ("x")

(pregexp-match "[[:alpha:]_]" "--_--")
=> ("_")

(pregexp-match "[[:alpha:]_]" "--:--")
=> #f

POSIX 类只有在额外的方括号中才有效，当它不在方括号表达式中时，例如 "[:alpha:]"，不会被认为是字母类。按照之前的原则，它只能匹配 ':', 'a', 'l', 'p', 'h' 这几个字符。

(pregexp-match "[:alpha:]" "--a--")
=> ("a")

(pregexp-match "[:alpha:]" "--_--")
=> #f

经过在 [: 后面紧跟着插入一个 ^, 你获得 POSIX 字符类的反转。所以，[:^alpha:] 表示除了字母之外的全部字符。

3.3 量词

量词 *, + 以及 ? 分别匹配前面的子模式： 0或0个以上，1个或1个以上，0个或1个实例。

(pregexp-match-positions "c[ad]*r" "cadaddadddr")
=> ((0 . 11))
(pregexp-match-positions "c[ad]*r" "cr")
=> ((0 . 2))

(pregexp-match-positions "c[ad]+r" "cadaddadddr")
=> ((0 . 11))
(pregexp-match-positions "c[ad]+r" "cr")
=> #f

(pregexp-match-positions "c[ad]?r" "cadaddadddr")
=> #f
(pregexp-match-positions "c[ad]?r" "cr")
=> ((0 . 2))
(pregexp-match-positions "c[ad]?r" "car")
=> ((0 . 3))

3.3.1 数字量词

你可使用大括号来指定比使用 * + ? 更精细的数量。

量词 {m} 精确过匹配前面的子模式 m 个实例， m 必须是非负的整数。

量词 {m,n}; 匹配最少 m 个，最多 n 个实例。m 和 n 必须是非负的整数，而且 m <= n。二者均可以省略，在这种状况下，m 默认为 0, 而 n 表示无限大。

很明显，+ 和 ? 分别是 {1,} 和 {0,1} 的缩写，* 是 {,} 的缩写，而且与 {0,} 等价。

(pregexp-match "[aeiou]{3}" "vacuous")
=> ("uou")

(pregexp-match "[aeiou]{3}" "evolve")
=> #f

(pregexp-match "[aeiou]{2,3}" "evolve")
=> #f

(pregexp-match "[aeiou]{2,3}" "zeugma")
=> ("eu")

3.3.2 非贪心量词

上面所描述的量词都是贪心的，即，它们匹配所能匹配的最大数量的实例。

(pregexp-match "<.*>" "<tag1> <tag2> <tag3>")
=> ("<tag1> <tag2> <tag3>")

要将这些量词变成 非贪心 的，在后面附加一个问号 ? 便可。非贪心量词只匹配最小数量的实例。

(pregexp-match "<.*?>" "<tag1> <tag2> <tag3>")
=> ("<tag1>")

非贪心量词分别是：*?, +?, ??, {m}?, {m,n}?。要注意元字符 ? 的两种不一样的用法。

3.4 集群

集群，就是用圆括号包围起来的表达式(...), 将圆括号中的子模式识别为一个单独的正则表达式实体。它使得匹配器捕获子模式，而且将文本字符串中匹配子模式的部分附加到总体匹配当中。所谓总体匹配，就是伪装全部的圆括号都不存在（在子模式后面有量词的状况下，这种表述不正确），进行匹配。总体匹配后再将每一对圆括号都视为一个单独的正则表达式，分别进行匹配，最后匹配的结果会附加到总体匹配的结果里面去。

(pregexp-match "([a-z]+) ([0-9]+), ([0-9]+)" "jan 1, 1970")
=> ("jan 1, 1970" "jan" "1" "1970")

集群还致使接下来的量词将整个封闭起来的子模式视为一个独立的实体。

(pregexp-match "(poo )*" "poo poo platter")
=> ("poo poo " "poo ")

子匹配所返回的数量老是等于正则表达式中指定的子模式的数量。哪怕一个子模式匹配多个子串，或者是一个也不匹配。

(pregexp-match "([a-z ]+;)*" "lather; rinse; repeat;")
=> ("lather; rinse; repeat;" " repeat;")

在这里，被量词修饰的子模式匹配了三次，可是最后它只返回了一次。

被量词修饰的子模式也有可能不匹配，即使整体是是匹配成功的。在这种状况下，失败的子匹配用 #f 表示。

(define date-re
  ;match `month year' or `month day, year'.
  ;subpattern matches day, if present
  (pregexp "([a-z]+) +([0-9]+,)? *([0-9]+)"))

(pregexp-match date-re "jan 1, 1970")
=> ("jan 1, 1970" "jan" "1," "1970")

(pregexp-match date-re "jan 1970")
=> ("jan 1970" "jan" #f "1970")

3.4.1 反向引用

子匹配能够用于插入字符串参数的过程 pregexp-replace 和 pregexp-replace* . 插入字符串可使用\n做为反向引用返回第 n 个子匹配。即匹配第 n 个子模式的子串。\0引用整个匹配，它也能够指定为\&。

(pregexp-replace "_(.+?)_"
  "the _nina_, the _pinta_, and the _santa maria_"
  "*\\1*")
=> "the *nina*, the _pinta_, and the _santa maria_"

(pregexp-replace* "_(.+?)_"
  "the _nina_, the _pinta_, and the _santa maria_"
  "*\\1*")
=> "the *nina*, the *pinta*, and the *santa maria*"

;recall: \S stands for non-whitespace character

(pregexp-replace "(\\S+) (\\S+) (\\S+)"
  "eat to live"
  "\\3 \\2 \\1")
=> "live to eat"

在插入字符串中使用 \\ 指定一个字面的反斜杠。另外，\$ 表明空字符串，能够用于将反引用 \n 与紧领的数字分隔开。

也能够在正则表达式械中使用反向引用来引用回到模式中已经匹配的子模式。\n 表明第 n 个子匹配的精确重复。

(pregexp-match "([a-z]+) and \\1"
  "billions and billions")
=> ("billions and billions" "billions")

注意，反向引用不只仅是前面的子模式的重复。相反，它是已经由子模式匹配的特定子串的重复。

在上面的例子中，反向引用只能匹配 'billions', 它不能匹配 'millions'，就算是子模式回到 ([a-z]+) —— 原本就没有这样作的必要。

(pregexp-match "([a-z]+) and \\1"
  "billions and millions")
=> #f

The following corrects doubled words:

(pregexp-replace* "(\\S+) \\1"
  "now is the the time for all good men to to come to the aid of of the party"
  "\\1")
=> "now is the time for all good men to come to the aid of the party"

下面的例子标记了在数字字符串中全部当即重复的模式：

(pregexp-replace* "(\\d+)\\1"
  "123340983242432420980980234"
  "{\\1,\\1}")
=> "12{3,3}40983{24,24}3242{098,098}0234"

3.4.2 非捕获集群

有时会须要指定一个集群（一般用于量化），但不能触发子匹配信息的捕获。这样的集群称为非捕获集群。在这种状况下，使用 (?: 而不是 ( 做为集群的开始。在下面的例子中，非捕获集群消除了给定路径名的“目录”部分，而捕获集群标识了文件名。

(pregexp-match "^(?:[a-z]*/)*([a-z]+)$"
  "/usr/local/bin/mzscheme")
=> ("/usr/local/bin/mzscheme" "mzscheme")

3.4.3 Cloisters

在一个非捕获集群的 ? 和 : 之间的位置称为 cloister . 你能够在那里添加修饰符，这将产生一个被特殊处理的子模式。修饰符 i 使子模式匹配大小写不敏感：

(pregexp-match "(?i:hearth)" "HeartH")
=> ("HeartH")

修饰符 x 使子模式匹配对空白符不敏感，即，子模式中的空格和注释将被忽略。注释一般以分号开头，一直延续到行末。若是你须要在对空白不敏感的子模式中包含一个字面意义上的空格或者分号，能够用反斜杠来转义它们。

(pregexp-match "(?x: a   lot)" "alot")
=> ("alot")

(pregexp-match "(?x: a  \\  lot)" "a lot")
=> ("a lot")

(pregexp-match "(?x:
   a \\ man  \\; \\   ; ignore
   a \\ plan \\; \\   ; me
   a \\ canal         ; completely
   )"
 "a man; a plan; a canal")
=> ("a man; a plan; a canal")

全局变量 *pregexp-comment-char* 包含了注释字符 (#\;) ，要使用 Perl 风格的注释符，能够：

(set! *pregexp-comment-char* #\#)

你能够在 cloister 里添加更多的修饰符

(pregexp-match "(?ix:
   a \\ man  \\; \\   ; ignore
   a \\ plan \\; \\   ; me
   a \\ canal         ; completely
   )"
 "A Man; a Plan; a Canal")
=> ("A Man; a Plan; a Canal")

在一个修饰符前添加减号- 会反转其含义。所以，你可使用 -i 以及 -x 来推翻由封闭集群引发的不敏感性。

(pregexp-match "(?i:the (?-i:TeX)book)"
  "The TeXbook")
=> ("The TeXbook")

This regexp will allow any casing for the and book but insists that TeX not be differently cased.

3.5 Alternation

You can specify a list of alternate subpatterns by separating them by |. The | separates subpatterns in the nearest enclosing cluster (or in the entire pattern string if there are no enclosing parens).

(pregexp-match "f(ee|i|o|um)" "a small, final fee")
=> ("fi" "i")

(pregexp-replace* "([yi])s(e[sdr]?|ing|ation)"
   "it is energising to analyse an organisation
   pulsing with noisy organisms"
   "\\1z\\2")
=> "it is energizing to analyze an organization
   pulsing with noisy organisms"

再次提醒，若是你但愿仅使用 clustering merely to specify a list of alternate subpatterns ，可是不但愿子匹配，请使用(?: 而不是 (

(pregexp-match "f(?:ee|i|o|um)" "fun for all")
=> ("fo")

关于 alternation 一个重要的事情是，最左边的 alternate 老是被最早挑选，而无论它的长度。所以，若是一个 alternate 是以后 alternate 的前缀，则后者可能没有机会被匹配。

(pregexp-match "call|call-with-current-continuation"
  "call-with-current-continuation")
=> ("call")

因此，为了让较长的 alternate 有被匹配的机会，请将较长的 alternate 放在较短的 alternate 前面。

(pregexp-match "call-with-current-continuation|call"
  "call-with-current-continuation")
=> ("call-with-current-continuation")

In any case, an overall match for the entire regexp is always preferred to an overall nonmatch. In the following, the longer alternate still wins, because its preferred shorter prefix fails to yield an overall match.

(pregexp-match "(?:call|call-with-current-continuation) constrained"
  "call-with-current-continuation constrained")
=> ("call-with-current-continuation constrained")

3.6 回溯

咱们已经看到，贪心量词老是匹配最大次数，可是最重要的优先级是整个匹配成功。考虑

(pregexp-match "a*a" "aaaa")

该正则表达式由两个子正则表达式组成，a 后面跟着 *a 。就算 * 是一个贪心量词，
*a 也不被容许匹配 "aaaa" 中全部的 4 个 a , 它只能匹配最开始的 3 个 a，留下最后一个 a 用于第二个子正则表达式。这样将确保整个正则表达式匹配成功。

正则表达式匹配器经过一个称为回溯的过程来作到这一点。匹配器暂时容许贪心量词匹配全部的 4 个 a ，可是当它意识到这样会致使总体匹配失败时，它会回溯到更少的贪心匹配 3 个 a，甚至若是这样还会失败，好比下面的调用：

(pregexp-match "a*aa" "aaaa")

匹配器还会进一步回溯，只有当全部可能的回溯都尝试过才会发生总体匹配失败。

回溯并不限于贪心量词，非贪心量词匹配尽量少的实例，并逐渐回溯到愈来愈多的实例，以实现总体匹配成功。在 alternation 的匹配中也会进行回溯，当左边的 alternation 会致使总体匹配失败时，会尝试右边的 alternation 。

3.6.1 禁止回溯

有时禁止回溯会更有效。例如，咱们可能但愿作出选择，或者咱们知道尝试 alternatives 是徒劳的。非回溯式正则表达式包含在 (?>...). 之间

(pregexp-match "(?>a+)." "aaaa")
=> #f

在这个调用里，子表达式 ?>a+ 贪婪地匹配全部 4 个 a，而且拒绝回溯的机会。因此总体匹配失败。所以这个正则表达式的效果是匹配一个或多个 a，后面跟一个确定不是 a 的东西。

3.7 展望将来

You can have assertions in your pattern that look ahead or behind to ensure that a subpattern does or does not occur. These “look around” assertions are specified by putting the subpattern checked for in a cluster whose leading characters are: ?= (for positive lookahead), ?! (negative lookahead), ?<= (positive lookbehind), ?<! (negative lookbehind). Note that the subpattern in the assertion does not generate a match in the final result. It merely allows or disallows the rest of the match.

3.7.1 Lookahead

Positive lookahead (?=) peeks ahead to ensure that its subpattern could match.

(pregexp-match-positions "grey(?=hound)"
  "i left my grey socks at the greyhound")
=> ((28 . 32))

The regexp "grey(?=hound)" matches grey, but only if it is followed by hound. Thus, the first grey in the text string is not matched.

Negative lookahead (?!) peeks ahead to ensure that its subpattern could not possibly match.

(pregexp-match-positions "grey(?!hound)"
  "the gray greyhound ate the grey socks")
=> ((27 . 31))

The regexp "grey(?!hound)" matches grey, but only if it is not followed by hound. Thus the grey just before socks is matched.

3.7.2 Lookbehind

Positive lookbehind (?<=) checks that its subpattern could match immediately to the left of the current position in the text string.

(pregexp-match-positions "(?<=grey)hound"
  "the hound in the picture is not a greyhound")
=> ((38 . 43))

The regexp (?<=grey)hound matches hound, but only if it is preceded by grey.

Negative lookbehind (?<!) checks that its subpattern could not possibly match immediately to the left.

(pregexp-match-positions "(?<!grey)hound"
  "the greyhound in the picture is not a hound")
=> ((38 . 43))

The regexp (?<!grey)hound matches hound, but only if it is not preceded by grey.

Lookaheads and lookbehinds can be convenient when they are not confusing.