Antlr(DSL)

时间 2019-11-19

标签 antlr dsl 繁體版

原文原文链接

Antlrhtml

Name：ANother Tool for language for Language Recognitionjava

Site： node

https://github.com/antlr/git

https://theantlrguy.atlassian.net/wiki/display/ANTLR3/ANTLR+v3+documentationgithub

http://www.antlr3.org/grammar/list.html算法

http://www.crifan.com/files/doc/docbook/antlr_tutorial/release/pdf/antlr_tutorial.pdfexpress

做用：生成某种语言的Lexer, Parser, Tree Walker or Lexer&Parser的combinorapi

用例： Hibernate解析HQL闭包

Spring解析 ELide

Gemfire(or Geode)解析OQL

版本：3.3（3.3其实是用2.7依据Antlr.g grammar文件生成的parser）

（由这个parser来解析咱们的grammar 文件，而后由它的另外一个library StringTemplate 来生成咱们的parser 或者lexer）

输入：特定语言A的文法文件 (.g文件)

输出：特定语言A的解析程序（能够是Java C# C++ 等等）

文法文件：通常包括header块、options块、文法分析器类（parser）及规则定义、词法分扫描器类（lexer）及token定义。其中最为重要的是规则和token的定义。规则的定义形式和编译理论中的扩展巴科斯范式（EBNF）极为类似，包括规则名、规则体、一个用做结束标志的分号和异常处理部分（可省略）

规则的名字必须是小写字母开始，而token的名字则必须是大写字母开始。

.g文件格式注意顺序固定，而且rule在最后部分：

     optionSpec
     tokensSpec
     attributeScopes
     actions
     rule1 : ...  | ;
     rule2 :      | ;

基础知识

ANTLR生成的解析器叫作递归降低解析器（recursive-descent parser），属于自顶向下解析器（top-down parser）的一种。顾名思义，递归降低指的就是解析过程是从语法树的根开始向叶子（token）递归，比较酷的是代码的调用图能与树结点对应上。会针对每个rule 都会生成一个 function ，若是是最终符号会调用 match() 方法。

ANTLR为每一个Rule都会生成一个Context对象，它会记录识别时的全部信息。ANTLR提供了Listener和Visitor两种遍历机制。Listener是全自动化的，ANTLR会主导深度优先遍历过程，咱们只需处理各类事件就能够了。而Visitor则提供了可控的遍历方式，咱们能够自行决定是否显示地调用子结点的visit方法。

LL(K)文法

LL文法是自上而下的分析法，从文法的开始符号出发，或是说从树根开始，向下构造语法树，知道创建每一个树叶。也叫递归降低分析法。

非肯定的自上而下

本质上就是从特定的文法符号开始进行穷举，直到找到匹配的字符串（合法输入）或穷举结束（不合法输入）。每一步都是对当前句型的最左非终端(一般是表达始终的大写字母),当有多个符号时，就只能逐个尝试，在尝试失败时，固然会回溯到原先字符串，故称带回溯的分析方法。效率低。

非肯定的下推自动机

输入字符串，读头，有穷状态自动机，先进后出下推栈。本质是输入后使得状态机到达某个状态

文法如何消除左递归：

1.用扩展的BNF巴科斯范式

        用{}表示出现0~n次
        用[]表示出现0或1次
        用()表示隔离公共因子A->x(y|w|..|z)

2.直接改写，消除左递归

        一句U  →  UxIy
        改成U  →  yU1
        U1 →  xU1|null

LL(k)文法是上下文无关文法的一个真子集

LL(k)文法也是容许采用肯定的从左至右扫描(输入串)和自上而下分析技术的最大一类文法。

LR文法是自下而上的分析法，从给定的输入串开始，或是说从语法书的末端开始，向上规约，直至根节点。也叫算符优先分析法

在 Anltr 中，算法的优先级须要经过文法规则的嵌套定义来体现

在 Antlr 中语法定义和词法定义经过规则的第一个字符来区别，规定语法定义符号的第一个字母小写，而词法定义符号的第一个字母大写。

Antlr 支持多种目标语言，能够把生成的分析器生成为 Java，C#，C，Python，JavaScript 等多种语言，默认目标语言为 Java，经过 options {language=?;} 来改变目标语言。

ANTLR的语法文件使用扩展巴科斯范式EBNF描述，记得编译原理的用起来很是简单，须要进一步了解的是怎么构造本身的recognizer和translator。不少的语法不须要从头写，一方面不少语言标准中基本都使用EBNF描述，另外一方面ANTLR网站http://www.antlr.org/grammar/list上有大量写好的语法文件，能够参考使用。

巴科斯范式扩展符号 EBNF

() : 产生式组合
? 
: 产生式出现0或1次
* 
: 0或屡次
+ 
: 1或屡次
.  
: 任意一个字符
~ 
: 不出现后面的字符
.. 
: 字符范围
能够参考http://www.cl.cam.ac.uk/~mgk25/iso-ebnf.html

整数定义:

integer
: (HEX_PREFIX | OCTAL_PREFIX)? DIGITS; //能够有一个16进制或8进制前缀，没有则为10进制定义
HEX_PREFIX:
'0x'; //16进制前缀
OCTAL_PREFIX:
'0o'; //8进制前缀
DIGITS:
'1'..'9' '0'..'9'*; //第一个字符必须为1-9，后面能够是任意多个0-9字符

多行注释符号/*和*/的定义:

ML_COMMENT : '/*' (
options {greedy=false;} : . )* '*/' ;
/*开头，*/结束，(.)*表示中间能够有任意多个字符，options
{greedy=false;}是一个谓词选项，告诉分析器不要采用贪婪模式，即匹配到随后出现的第一个*/就结束，而不是试图去匹配输入字符流中的最后一个*/。

左递归、右递归 left/right recursion

若是一个产生式在最左开始位置包含它本身，叫左递归，例如exp: A | exp ',' A。而exp: A ',' exp | A则是右递归。用实际例子来看。

1. 右递归

例如实现一个表达式: +8结果为9；++8结果为10；+++8结果为11，以此类推。右递归语法为:

expr: PLUS expr|INT;
INT: '1'..'9''0'..'9'*;
PLUS: '+';

用C#示例的方法来运行这个例子的语法文件内容以下:

left returns [int value] : e=expr { $value = $e.value; };
expr returns [int value] :
    PLUS e=expr { $value=$e.value+1; }
    | INT { $value=int.Parse( $INT.text ); };
INT: '1'..'9' '0'..'9'*;
PLUS: '+';

有个理解上容易产生歧义的地方，即expr应当解析成expr: (PLUS expr) | INT;仍是expr: PLUS (expr | INT);？应当是前面这种方式。

2. 左递归

例如实现表达式: 8+结果为9；8++结果为10；8+++结果为11，以此类推。左递归语法为:

expr: expr|INT PLUS;
INT: '1'..'9''0'..'9'*;
PLUS: '+';

ANTLR不支持左递归，上面的语法在生成时会报错: error(210): The following sets of rules are mutually left-recursive [expr]。

将这个语法改成下面这样，就不是左递归了。

expr: INT PLUS*;
INT: '1'..'9''0'..'9'*;
PLUS: '+';

零散的概念

语法多义性: 语法设计最值得关注的问题。第一点是人的思惟对语法描述理解的歧义，与语法解释器的实际结果不一致，例如上面提到
的expr的问题。另外就是语法描述自己逻辑上存在多义性，即对一样的输入能够解释成多种结果，它们都符合语法描述的规则。

lexer:词法分析器，从输入字符流解析出词汇序列(tokens)。

parser:语法解析器，对词汇进行语法分析，生成语法树(抽象语法树AST)。

EBNF的语法不区分词法分析和语法分析，对应的只有终结符、非终结符，终结符描述输入，非终结符描述输入所表达的树结构。ANTLR
使用词法分析器识别终结符，使用语法分析器分析非终结符(生成的分析器代码文件有两个，一个是词法分析器***Lexer，一个是语法
分析器***Parser)，并要求词法规则所有以大写开始，语法规则所有以小写开始。对每个语法规则，最终都必须以词法规则结束，
不然是一个无效语法，生成时会报错。

ANTLR生成的分析器代码中，语法规则都会有一个同名的方法，而词法规则的名称则跟语法文件给出的不同。若是须要使用这些词法规则，可取的方法之一是定义一个语法规则与之对应；另外就是定义一个符号表，例如tokens{...}。定义符号表的优势是会优先匹配符号表，例如一些关键字等，能够避免他们被其它规则匹配上。

关于规则的定义顺序，语法文件中先出现的规则具备优先匹配的做用。

ANTLRv3.g

grammarDef
      :   DOC_COMMENT?
    ('lexer'  {gtype=LEXER_GRAMMAR;}    // pure lexer
    |   'parser' {gtype=PARSER_GRAMMAR;}   // pure parser
    |   'tree'   {gtype=TREE_GRAMMAR;}     // a tree parser
    |     {gtype=COMBINED_GRAMMAR;} // merged parser/lexer
    )
    g='grammar' id ';' optionsSpec? tokensSpec? attrScope* action*
    rule+
    EOF
    -> ^( {adaptor.create(gtype,$g)}
               id DOC_COMMENT? optionsSpec? tokensSpec? attrScope* action* rule+
             );
)

Lexer：文法分析器类。主要用于把读入的字节流根据规则分段。既把长面条根据你要的尺寸切成一段一段，并不对其做任何修改

类型名：匹配的具体规则

Parser：解析器类。主要用于处理通过 Lexer 处理后的各段。

起始规则名： 
规则实例名：类型名或规则名 
{Java 语句……； }； 
……

起始规则名：任意。

规则实例名：就象 Java 中“ String s ；”的 s 同样。规则实例名用于在以后的 JAVA 语句中调用。

类型名或规则名：能够是在 Lexer 中定义的类型名，也能够是 Parser 中定义的规则名。感受就像是 int 与 Integer 的区别。

Java 语句：指当知足当前规则时所执行的语句。 Antlr 会自动嵌入生成的 java 类中。

实践：

1）在此法分析中，好比要描述一个“>=”与”>”时，若是用

BEQUAL:('>''=')；  
BIGER：”>”;

当语法文件进行此法分析的时，当扫描到一个”>”形式时，不知道是将其当BEQUAL仍是当BIGER符号处理，即出现了冲突，那么能够采用如下这种形式定义：

BEQUAL:('>''=')=>('>''=')|'>'{ $setType(BIGER); };//它的形式为： (...)=>(...)|(...)。这至关于通常语言中的三元表达式：(1)?(2):(3)。若是式1为真，则返回式2的值，不然返回式3的值。

2）在ANTLR中一个规则至关与JAVA语言中的一个函数，所以它能够有传参和返回值,例如:

expr [HashMap hm] returns [String s]
//即至关于JAVA中的： 
public String expr(HashMap hm){…}

3) ANTLR中能够内嵌生成的目标语言

{import java.lang.Math;}

4）标点符号和关键字

符号   描述  
(...)  子规则  
(...)*  闭包子规则（零和多个）
(...)+  正闭包子规则（一个和多个）
(...)?  可选（零个和一个）
{...}  语义动做  [...]  规则参数
{...}?  语义谓词 
(...)=>  语法谓词  
|    可选符  
..  范围符
~  非  
.  通配符  
=  赋值   
:  标号符, 规则开始 
;    规则结束  
<...>  元素选项  
class  语法类  
extends  指定语法基类  
returns  指定返回类型  
options  options 节  
tokens   tokens 节  
header  header 节  
tokens   token 定义节

!		do not include node or subtree (if referencing a rule) in subtree
^		make node root of subtree created for entire enclosing rule even if nested in a subrule

5）规则引用以小写字母开头的标识符是为ANTLR的语法规则。接下来的字符能够是任意字母，数字或下划线。词法规则不能引用语法规则。词法规则以大写字母开头。

6）动做. 在<>尖括号中的字符序列是语义动做（多是嵌套的）。在字符串和字符中的尖括号不是动做分隔符。

7）动做参数在[ ]方括号中的字符序列是动做参数（多是嵌套的）。在字符串和字符中的方括号不是动做分隔符。在[]中的参数是用被生成的语言的语法定义的，而且用逗号分开。

codeBlock  
[int scope, String name] // input arguments 
returns [int x]          // return values
 : ...
 // pass 2 args, get return 

testcblock 
{int y;}  
: y=cblock[1,"John"]    ;

8）header节

一个header节包含了一些将直接被替换到输出的语法分析器中的源码，这些源码将在全部的ANTLR生成的代码以前。这个主要用在C++的输出中，由于C++须要一些元素在引用以前必须被声明。在Java中，这能够用来为最后的语法分析器指定一些包文件。一个header节看起来像下面这样：

  header {    source code in the language generated by ANTLR;  }

header 节是语法文件的第一个节。根据选择的目标语言的不一样，不一样类型header节都是可能出现的

9）典型的词法解析器

{optional class code preamble } 
class YourLexerClass extends Lexer; 
options 
tokens  
{ optional action for instance vars/methods }
lexer rules..

10）典型的语法分析器

{ optional class code preamble }  
class YourParserClass extends Parser; 
options 
tokens  
{ optional action for instance vars/methods }  
parser rules...

11）典型的树分析器

{ optional class code preamble }  
class YourTreeParserClass extends TreeParser; 
options 
tokens  
{ optional action for instance vars/methods } 
tree parser rules...

12）关键字

Keyword  |  Description
---------+--------------------------------------------------------
scope    |  Dynamically-scoped attribute
fragment |  lexer rule is a helper rule, not real token for parser
lexer    |  grammar type
tree     |  grammar type
parser   |  grammar type
grammar  |  grammar header
returns  |  rule return value(s)
throws   |  rule throws exception(s)
catch    |  catch rule exceptions
finally  |  do this no matter what
options  |  grammar or rule options
tokens   |  can add tokens with this; usually imaginary tokens
import   |  import grammar(s)

13）fragment的字面意思，就是片断，就是用于别的，完整的总体，所调用的；

其自己不能单独做为一个token（标示，只能被别人调用，使用）

其含义有点相似于：

inline函数，达到直接替换的效果

结构体或类的私有变量（不被外界所访问，仅供本身（文件内部）所使用）

14）Options

language	The target language for code generation. Default is Java. See Code Generation Targets for list of currently supported target languages.
tokenVocab	Where ANTLR should get predefined tokens and token types. Tree grammars need it to get the token types from the parser that creates its trees. TODO: Default value? Example?
output 		The type of output the generated parser should return. Valid values are AST and template. TODO: Briefly, what are the interpretations of these values? Default value?
ASTLabelType Set the type of all tree labels and tree-valued expressions. Without this option, trees are of type Object. TODO: Cross-reference default impl (org.antlr.runtime.tree.CommonTree in Java)?
TokenLabelType	Set the type of all token-valued expressions. Without this option, tokens are of type org.antlr.runtime.Token in Java (IToken in C#).
superClass	Set the superclass of the generated recognizer. TODO: Default value (org.antlr.runtime.Parser in Java)?
filter		In the lexer, this allows you to try a list of lexer rules in order. The first one that matches, wins. This is the token that nextToken() returns. If nothing matches, the lexer consumes a single character and tries the list of rules again. See Lexical filters for more.
rewrite		Valid values are true and false. Default is false. Use this option when your translator output looks very much like the input. Your actions can modify the TokenRewriteStream to insert, delete, or replace ranges of tokens with another object. Used in conjunction with output=template, you can very easily build translators that tweak input files.
k		Limit the lookahead depth for the recognizer to at most k symbols. This prevents the decision from using acyclic LL* DFA.
backtrack	Valid values are true and false. Default is false. Taken from http://www.antlr.org:8080/pipermail/antlr-interest/2006-July/016818.html : The new feature (a big one) is the backtrack=true option for grammar, rule, and block that lets you type in any old crap and ANTLR will backtrack if it can't figure out what you meant. No errors are reported by antlr during analysis. It implicitly adds a syn pred in front of every production, using them only if static grammar LL* analysis fails. Syn pred code is not generated if the pred is not used in a decision. This is essentially a rapid prototyping mode. It is what I have used on the java.g. Oh, it doesn't memoize partial parses (i.e. rule parsing results) during backtracking automatically now. You must also say memoize=true. Can make a HUGE difference to turn on.
memoize		Valid values are true and false. When backtracking, remember whether or not rule references succeed so that the same input position cannot be parsed more than once by the same rule. This effectively guarantees linear parsing when backtracking at the cost of more memory. TODO: Default value (false)?

15）Rules

T				Token reference. An uppercase identifier; lexer grammars may use optional arguments for fragment token rules.
T<node=V> or T<V>	Token reference with the optional token option node to indicate tree construction note type; can be followed by arguments on right hand side of -> rewrite rule
T[«args»]		Lexer rule (token rule) reference. Lexer grammars may use optional arguments for fragment token rules.
r [«args»]		Rule reference. A lowercase identifier with optional arguments.
'«one-or-more-char»'	String or char literal in single quotes. In parser, a token reference; in lexer, match that string.
{«action»}		An action written in target language. Executed right after previous element and right before next element.
{«action»}?		Semantic predicate.
{«action»}?=>	Gated semantic predicate.
(«subrule»)=>	Syntactic predicate.
(«x»|«y»|«z»)	Subrule. Like a call to a rule with no name.
(«x»|«y»|«z»)?	Optional subrule.
(«x»|«y»|«z»)*	Zero-or-more subrule.
(«x»|«y»|«z»)+	One-or-more subrule.
«x»?			Optional element.
«x»*			Zero-or-more element.
«x»+			One-or-more element.

16）Other说明

1. grammar SimpleCalc;，定义语法名称。语法文件(.g文件)名称必须与这里指定的名称一致。默认状况生成的语法分析器类名为"语法名称"+Parser，词法分析器类名为"语法名称"+Lexer。
2. options {...}，定义全局配置参数，设置ANTLR生成过程当中的一些控制选项。
3. tokens {...}，定义全局的符号表。
4. @members {...}，这里面给出的代码将放入到生成的语法分析器类中，做为分析器类的成员属性、方法等。
   示例中为分析器添加了一个Main方法，免得再格外写一个测试类。
   Main方法先从命令行读取输入字符，将输入字符传给生成的词法分析器SimpleCalcLexer，获得词汇序列CommonTokenStream。接下来使用生成的语法分析器SimpleCalcParser对词汇序列进行分析，在生成语法树的过程当中直接由expr()方法返回计算结果。
5. 语法规则。
   把语法规则中的{...}、[...]等相关ANTLR的Action去掉就是EBNF。
   ANTLR生成的语法解析器代码中，对应每一个语法规则都会生成一个方法，这个方法完成对应语法规则的分析逻辑。{...}、[...]等就是咱们本身的recognizer、translator须要的一些额外控制(Action)，分析匹配的逻辑由ANTLR完成，咱们添加的这些控制就是对匹配到的结果怎样进行处理，是生成一个语法树，再由其它程序对语法树进行翻译，仍是直接结合ANTLR的分析过程进行翻译转换或计算处理等，由咱们进行控制。示例中是直接进行计算求值。
   returns [...]告诉ANTLR生成的方法须要返回什么内容，[int value]表示返回值类型为int，名字叫作value，就是声明了一个变量，用它来返回。在方法体里面，咱们经过{...}中的内容进行求值运算，并把结果设置给value。{...}咱们能够看做是一个宏，或者是一个模板，在里面咱们可使用ANTLR在代码生成时内置的一些变量/对象，它们以$开始(象StringTemplate语法，但不须要对应的$进行闭合)，这些变量在代码生成过程当中ANTLR为咱们设置好。其它的代码保持不变，放入到方法的相应位置上。
6. 词法规则。
   { $channel = HIDDEN; }。默认状况下ANTLR在词法分析器和语法分析器之间使用两个通道通信，一个default和一个hidden。语法分析器监听default通道接收词汇序列，因此若是将某个词汇发送到hidden通道，这个词汇就会被语法分析器忽略掉。示例中将回车换行等空白字符都过滤掉，不进行语法分析。
   fragment: 对词法规则有效(没有看到对语法规则有做用)，它在生成的语法树上不会有对应的节点，便可以这样理解，它是咱们在语法中定义的一个宏，能够被其它语法规则调用，但它不会成为最终语法树上的节点。

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。