lilac-parser 是我用 ClojureScript 实现的一个库, 能够作一些正则的功能.
看名字, 这个库设计的时候更可能是一个 parser 的思路,
从使用来讲, 当作一个正则也是比较顺的. 虽然不如正则简短明了.
正则的缺点主要是基于字符串形态编写, 须要转义, 规则长了就很差维护了.
而 lilac-parser 的方式, 就挺容易进行组合的, 我这边举一些例子git
首先是 is+
这个规则, 进行精确匹配,github
(parse-lilac "x" (is+ "x")) ; {:ok? true, :rest nil} (parse-lilac "xyz" (is+ "xyz")) ; {:ok? true, :rest nil} (parse-lilac "xy" (is+ "x")) ; {:ok? false} (parse-lilac "xy" (is+ "x")) ; {:ok? true, :rest ("y")} (parse-lilac "y" (is+ "x")) ; {:ok? false}
能够看到, 头部匹配上的表达式, 都返回了 true.
后边是否还有其余内容, 须要经过 :rest
字段再去单独判断了.数组
固然精确匹配比较简单, 而后是选择匹配,mvc
(parse-lilac "x" (one-of+ "xyz")) ; {:ok? true} (parse-lilac "y" (one-of+ "xyz")) ; {:ok? true} (parse-lilac "z" (one-of+ "xyz")) ; {:ok? true} (parse-lilac "w" (one-of+ "xyz")) ; {:ok? false} (parse-lilac "xy" (one-of+ "xyz")) ; {:ok? true, :rest ("y")}
反过来, 能够有排除的规则,ide
(parse-lilac "x" (other-than+ "abc")) ; {:ok? true, :rest nil} (parse-lilac "xy" (other-than+ "abc")) ; {:ok? true, :rest ("y")} (parse-lilac "a" (other-than+ "abc")) ; {:ok? false}
在此基础上, 增长一些逻辑, 表示判断的规则能够不存在,
固然容许不存在的话, 任什么时候候均可以退回到 true 的结果的,性能
(parse-lilac "x" (optional+ (is+ "x"))) ; {:ok? true, :rest nil} (parse-lilac "" (optional+ (is+ "x"))) ; {:ok? true, :rest nil} (parse-lilac "x" (optional+ (is+ "y"))) ; {:ok? true, :rest("x")}
也能够设定规则, 判断多个, 也就是大于 1 个(目前不能控制具体个数),spa
(parse-lilac "x" (many+ (is+ "x"))) (parse-lilac "xx" (many+ (is+ "x"))) (parse-lilac "xxx" (many+ (is+ "x"))) (parse-lilac "xxxy" (many+ (is+ "x")))
若是容许 0 个的状况, 就不是 many 了, 而是 some 的规则,设计
(parse-lilac "" (some+ (is+ "x"))) (parse-lilac "x" (some+ (is+ "x"))) (parse-lilac "xx" (some+ (is+ "x"))) (parse-lilac "xxy" (some+ (is+ "x"))) (parse-lilac "y" (some+ (is+ "x")))
相应的, or 的规则能够写出来,rest
(parse-lilac "x" (or+ [(is+ "x") (is+ "y")])) (parse-lilac "y" (or+ [(is+ "x") (is+ "y")])) (parse-lilac "z" (or+ [(is+ "x") (is+ "y")]))
而 combine 是用来顺序组合多个规则的,code
(parse-lilac "xy" (combine+ [(is+ "x") (is+ "y")])) ; {:ok? true, :rest nil} (parse-lilac "xyz" (combine+ [(is+ "x") (is+ "y")])) ; {:ok? true, :rest ("z")} (parse-lilac "xy" (combine+ [(is+ "y") (is+ "x")])) ; {:ok? flase}
而 interleave 是表示两个规则, 而后相互间隔重复,
这种场景不少都是逗号间隔的表达式的处理当中用到,
(parse-lilac "xy" (interleave+ (is+ "x") (is+ "y"))) (parse-lilac "xyx" (interleave+ (is+ "x") (is+ "y"))) (parse-lilac "xyxy" (interleave+ (is+ "x") (is+ "y"))) (parse-lilac "yxy" (interleave+ (is+ "x") (is+ "y")))
另外当前的代码还提供了几个内置的规则, 用来判断字母, 数字, 中文的状况,
(parse-lilac "a" lilac-alphabet) (parse-lilac "A" lilac-alphabet) (parse-lilac "." lilac-alphabet) ; {:ok? false} (parse-lilac "1" lilac-digit) (parse-lilac "a" lilac-digit) ; {:ok? false} (parse-lilac "汉" lilac-chinese-char) (parse-lilac "E" lilac-chinese-char) ; {:ok? false} (parse-lilac "," lilac-chinese-char) ; {:ok? false} (parse-lilac "," lilac-chinese-char) ; {:ok? false}
具体某些特殊的字符的话, 暂时只能经过 unicode 范围来指定了.
(parse-lilac "a" (unicode-range+ 97 122)) (parse-lilac "z" (unicode-range+ 97 122)) (parse-lilac "A" (unicode-range+ 97 122))
有了这些规则, 就能够组合来模拟正则的功能了, 好比查找匹配项有多少,
(find-lilac "write cumulo and respo" (or+ [(is+ "cumulo") (is+ "respo")])) ; find 2 (find-lilac "write cumulo and phlox" (or+ [(is+ "cumulo") (is+ "respo")])) ; find 1 (find-lilac "write cumulo and phlox" (or+ [(is+ "cirru") (is+ "respo")])) ; find 0
或者直接进行字符串替换, 这就跟正则差很少了.
(replace-lilac "cumulo project" (or+ [(is+ "cumulo") (is+ "respo")]) (fn [x] "my")) ; "my project" (replace-lilac "respo project" (or+ [(is+ "cumulo") (is+ "respo")]) (fn [x] "my")) ; "my project" (replace-lilac "phlox project" (or+ [(is+ "cumulo") (is+ "respo")]) (fn [x] "my")) ; "phlox project"
能够看到, 这个写法就是组合出来的, 写起来比正则长, 可是能够定义变量, 作一些抽象.
简单的例子可能看不出这样作有什么用, 可能就是以为搞得反而更长了, 并且性能更差.
个人项目当中有个简单的 JSON 解析的例子, 这个用正则就搞不定了吧...
直接搬运代码以下:
; 判断 true false 两种状况, 返回的是 boolean (def boolean-parser (label+ "boolean" (or+ [(is+ "true") (is+ "false")] (fn [x] (if (= x "true") true false))))) (def space-parser (label+ "space" (some+ (is+ " ") (fn [x] nil)))) ; 组合一个包含空白和逗号的解析器, label 只是注释, 能够忽略 (def comma-parser (label+ "comma" (combine+ [space-parser (is+ ",") space-parser] (fn [x] nil)))) (def digits-parser (many+ (one-of+ "0123456789") (fn [xs] (string/join "" xs)))) ; 为了简单, null 和 undefined 直接返回 nil 了 (def nil-parser (label+ "nil" (or+ [(is+ "null") (is+ "undefined")] (fn [x] nil)))) ; number 的状况, 须要考虑前面可能有负号, 后面可能有小数点 ; 这边偷懒没考虑科学记数法了... (def number-parser (label+ "number" (combine+ ; 负号.. 可选的 [(optional+ (is+ "-")) digits-parser ; 组合出来小数部分, 这也是可选的 (optional+ (combine+ [(is+ ".") digits-parser] (fn [xs] (string/join "" xs))))] (fn [xs] (js/Number (string/join "" xs)))))) (def string-parser (label+ "string" (combine+ ; 字符串的解析, 引号开头引号结尾 [(is+ "\"") ; 中间是非引号的字符串, 或者转义符号的状况 (some+ (or+ [(other-than+ "\"\\") (is+ "\\\"") (is+ "\\\\") (is+ "\\n")])) (is+ "\"")] (fn [xs] (string/join "" (nth xs 1)))))) (defparser value-parser+ () identity (or+ [number-parser string-parser nil-parser boolean-parser (array-parser+) (object-parser+)])) (defparser object-parser+ () identity (combine+ [(is+ "{") (optional+ ; 对象就比较复杂了, 主要看 interleave 部分吧, 外边只是花括号的处理 (interleave+ (combine+ [string-parser space-parser (is+ ":") space-parser (value-parser+)] (fn [xs] [(nth xs 0) (nth xs 4)])) comma-parser (fn [xs] (take-nth 2 xs)))) (is+ "}")] (fn [xs] (into {} (nth xs 1))))) (defparser array-parser+ () (fn [x] (vec (first (nth x 1)))) (combine+ [(is+ "[") ; 数组, 一样是 interleave 的状况 (some+ (interleave+ (value-parser+) comma-parser (fn [xs] (take-nth 2 xs)))) (is+ "]")]))
能够看到, 经过 lilac-parser 构造规则的当时, 比较容易就生成了一个 JSON Parser.虽然支持的规则比较简单, 并且性能不大理想, 可是比起正则来讲, 这个代码可读不少了.相信能够做为一种思路, 用在不少文本处理的场景当中.为了也许能够提供简化一些的版本, 在 JavaScript 直接使用, 代替正则.