ANTLR 空白问题（不是典型的问题）答案

【问题标题】：ANTLR White Space Question (and not the typical one)ANTLR 空白问题（不是典型的问题）
【发布时间】：2011-02-15 20:56:24
【问题描述】：

考虑一下这个简短的 SmallC 程序：

#include "lib"
main() {
    int bob;
}

如果我在 ANTLWorks 中以及在使用解释器时指定行尾 ->“Mac (CR)”，我的 ANTLR 语法就可以正常使用。如果我将行尾选项设置为 Unix (LF)，则语法会抛出 NoViableAltException 并且在 include 语句结束后不识别任何内容。如果我在包含末尾添加换行符，此错误就会消失。我为此使用的计算机是 Mac，所以我认为必须将行尾设置为 Mac 格式是有意义的。因此，我改用 Linux 机器——得到了同样的结果。如果我在 ANTLRWorks Interpreter 框中键入任何内容，并且如果我没有选择行尾 Mac (CR)，则会出现上述情况下空行不足的问题，此外，每个语句块的最后一条语句都需要分号后面的额外空格（即，在 bob; 之后）。

当我在要解析的代码输入文件上运行 Java 版本的语法时，这些错误再次出现......

可能是什么问题？我会理解问题是否存在太多新行，格式可能解析器不理解/没有被我的空格规则捕获。但是在这种情况下，这是缺少新行的问题。

我的空格声明如下：

WS      :   ( '\t' | ' ' | '\r' | '\n' )+   { $channel = HIDDEN; } ;

或者，这可能是由于模棱两可的问题吗？

这是完整的语法文件（请随意忽略前几个块，它们会覆盖 ANTLR 的默认错误处理机制：

grammar SmallC;

options {
    output = AST ;  // Set output mode to AST
}

tokens {
    DIV = '/' ;
    MINUS   = '-' ;
    MOD = '%' ;
    MULT    = '*' ;
    PLUS    = '+' ;
    RETURN  = 'return' ;
    WHILE   = 'while' ;

    // The following are empty tokens used in AST generation
    ARGS ;
    CHAR ;
    DECLS ;
    ELSE ;
    EXPR ;
    IF ;
    INT ;
    INCLUDES ;
    MAIN ;
    PROCEDURES ;
    PROGRAM ;
    RETURNTYPE ;
    STMTS ;
    TYPEIDENT ;
}

@members { 
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
    throw new MismatchedTokenException(ttype, input);
} 
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
    throw e;
}
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
     throw new MissingTokenException(ttype, input, null);
}

// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
    String msg = super.getErrorMessage(e, tokenNames);
    if ( paraphrases.size()>0 ) {
        String paraphrase = (String)paraphrases.peek();
        msg = msg+" "+paraphrase;
    }
    return msg;
}

// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
{
    String exType;
    String hdr;
    if (e instanceof UnwantedTokenException) {
        exType = "UnwantedTokenException";
    } else if (e instanceof MissingTokenException) {
        exType = "MissingTokenException";
    } else if (e instanceof MismatchedTokenException) {
        exType = "MismatchedTokenException";
    } else if (e instanceof MismatchedTreeNodeException) {
        exType = "MismatchedTreeNodeException";
    } else if (e instanceof NoViableAltException) {
        exType = "NoViableAltException";
    } else if (e instanceof EarlyExitException) {
        exType = "EarlyExitException";
    } else if (e instanceof MismatchedSetException) {
        exType = "MismatchedSetException";
    } else if (e instanceof MismatchedNotSetException) {
        exType = "MismatchedNotSetException";
    } else if (e instanceof FailedPredicateException) {
        exType = "FailedPredicateException";
    } else {
        exType = "Unknown";
    }

    if ( getSourceName()!=null ) {
        hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": "; 
    } else {
        hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": "; 
    }
    String msg = getErrorMessage(e, tokenNames);
    emitErrorMessage(hdr + msg + ".");
}
}

// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
@rulecatch {
catch (RecognitionException e) {
    reportError(e);
    throw e;
} 
}

/*------------------------------------------------------------------
 * PARSER RULES
 *
 * Many of these make use of ANTLR's rewrite rules to allow us to
 * specify the roots of AST sub-trees, and to allow us to do away
 * with certain insignificant literals (like parantheses and commas
 * in lists) and to add empty tokens to disambiguate the tree 
 * construction
 *
 * The @init and @after definitions populate the paraphrase
 * stack to allow us to specify which grammar rule we are in when
 * errors are found.
 *------------------------------------------------------------------*/

args
@init { paraphrases.push("in these procedure arguments"); }
@after { paraphrases.pop(); }
        :   ( typeident ( ',' typeident )* )?   ->  ^( ARGS ( typeident ( typeident )* )? )? ;

body
@init { paraphrases.push("in this procedure body"); }
@after { paraphrases.pop(); }
        :   '{'! decls stmtlist '}'! ;

decls
@init { paraphrases.push("in these declarations"); }
@after { paraphrases.pop(); }
        :   ( typeident ';' )*  ->  ^( DECLS ( typeident )* )? ;

exp
@init { paraphrases.push("in this expression"); }
@after { paraphrases.pop(); }
        :   lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;

factor      :   '(' lexp ')'
        |   ( MINUS )? ( IDENT | NUMBER ) 
        |   CHARACTER
        |   IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;

lexp        :   term ( ( PLUS | MINUS )^ term )* ;

includes
@init { paraphrases.push("in the include statements"); }
@after { paraphrases.pop(); }
        :   ( '#include' STRING )*  ->  ^( INCLUDES ( STRING )* )? ;

main    
@init { paraphrases.push("in the main method"); }
@after { paraphrases.pop(); }
        :   'main' '(' ')' body ->  ^( MAIN body ) ;

procedure
@init { paraphrases.push("in this procedure"); }
@after { paraphrases.pop(); }
        :   ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;

procedures  :   ( procedure )*  ->  ^( PROCEDURES ( procedure)* )? ;

proc_return_char
        :   'char'  ->  ^( RETURNTYPE CHAR ) ;

proc_return_int :   'int'   ->  ^( RETURNTYPE INT ) ;

// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program     :   includes decls procedures main EOF ;

stmt
@init { paraphrases.push("in this statement"); }
@after { paraphrases.pop(); }
        :   '{'! stmtlist '}'!
        |   WHILE '(' exp ')' s=stmt    ->  ^( WHILE ^( EXPR exp ) $s )
        |   'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )?    ->  ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
        |   IDENT '='^ lexp ';'! 
        |   ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
        |   'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
        |   RETURN ( lexp )? ';'    ->  ^( RETURN ( lexp )? ) 
        |   IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;

stmtlist    :   ( stmt )*   ->  ^( STMTS ( stmt )* )? ;

term        :   factor ( ( MULT | DIV | MOD )^ factor )* ;

// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident   :   typeident_char | typeident_int ;

typeident_char  :   'char' s2=IDENT ->  ^( CHAR $s2 ) ;

typeident_int   :   'int' s2=IDENT  ->  ^( INT $s2 ) ;

/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)
IDENT       :   ( LCASE_ALPHA | UCASE_ALPHA | '_' ) ( LCASE_ALPHA | UCASE_ALPHA | DIGIT | '_' )* ;

CHARACTER   :   PRINTABLE_CHAR
        |   '\n' | '\t' | EOF ;

NUMBER      :   ( DIGIT )+ ;

STRING      :   '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;

WS      :   ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+    { $channel = HIDDEN; } ;

fragment 
DIGIT       :   '0'..'9' ;

fragment
LCASE_ALPHA :   'a'..'z' ;

fragment
NONALPHA_CHAR   :   '`' | '~' | '!' | '@' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'
        |   '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
        |   '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ; 

fragment
PRINTABLE_CHAR  :   LCASE_ALPHA | UCASE_ALPHA | DIGIT | NONALPHA_CHAR ;
fragment
UCASE_ALPHA :   'A'..'Z' ;

【问题讨论】：

您能否发布足够多的语法，以便正确解析您发布的 SmallC 代码 sn-p？
亲爱的 Bart - 我添加了大量的代码。我不想发布整个事情，因为那会很长。如果您认为您可能有想法并且不介意我亲自将其发送给您，请告诉我，我会发送给您。
@Geoffroy，干杯。明天我可能会看看它：我现在要开枪了！
@Geoffroy，对我来说还有很多事情要继续。我需要一些可以复制和粘贴的东西来显示你提到的行为。也许最简单的方法是发布所有内容，因为我在您发布的 sn-p 中看到了一些其他可能导致问题的内容（但如果没有看到更多内容则无法确定）。它是否很大并不重要：它会被裁剪，就像您已经发布的语法被垂直裁剪一样。
@Geoffroy，暂时忘记 ANTLRWorks。命令java -cp antlr-3.2.jar org.antlr.Tool SmallC.g 产生什么？我的猜测是产生了错误或警告，在这种情况下，您无法预测 ANTLRWorks 的解释器会咳出什么。

标签： whitespace antlr antlrworks

【解决方案1】：

从命令行，我确实收到了警告：

java -cp antlr-3.2.jar org.antlr.Tool SmallC.g 
warning(200): SmallC.g:182:37: Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

但这不会阻止生成词法分析器/解析器。

无论如何，问题是：ANTLR 的词法分析器尝试匹配它在文件中遇到的第一个词法分析器规则，如果它不能匹配所述标记，它会向下渗透到下一个词法分析器规则。现在您已经在WS 规则之前定义了CHARACTER 规则，它们都匹配字符\n。这就是为什么它在 Linux 下不起作用的原因，因为 \n 被标记为 CHARACTER。如果在CHARACTER 规则之前定义WS 规则，则一切正常：

// other rules ...

WS
  :  ('\t' | ' ' | '\r' | '\n' | '\u000C')+ { $channel = HIDDEN; } 
  ;

CHARACTER   
  :  PRINTABLE_CHAR | '\n' | '\t' | EOF 
  ;

// other rules ...

运行测试类：

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = 
                "#include \"lib\"\n" + 
                "main() {\n" + 
                "   int bob;\n" + 
                "}\n";
        ANTLRStringStream in = new ANTLRStringStream(source);
        SmallCLexer lexer = new SmallCLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        SmallCParser parser = new SmallCParser(tokens);
        SmallCParser.program_return returnValue = parser.program();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
    }
}

产生以下 AST：

没有任何错误消息。

但是您应该修复语法警告，并从 CHARACTER 规则中删除 \n，因为它永远无法在 CHARACTER 规则中匹配。

另一件事：您在解析器规则中混合了很多关键字，而没有在词法分析器规则中明确定义它们。由于先到先服务的词法分析器规则，这很棘手：您不希望 'if' 意外地被标记为 IDENT。最好这样做：

IF : 'if';
IDENT : 'a'..'z' ... ; // After the `IF` rule!

【讨论】：

巴特非常感谢，这确实可以解决问题。所以我想这毕竟是一个模棱两可的问题！并感谢您指出我的代码的其他异常情况。最好的问候，杰弗里