【发布时间】:2011-02-15 20:56:24
【问题描述】:
考虑一下这个简短的 SmallC 程序:
#include "lib"
main() {
int bob;
}
如果我在 ANTLWorks 中以及在使用解释器时指定行尾 ->“Mac (CR)”,我的 ANTLR 语法就可以正常使用。如果我将行尾选项设置为 Unix (LF),则语法会抛出 NoViableAltException 并且在 include 语句结束后不识别任何内容。如果我在包含末尾添加换行符,此错误就会消失。我为此使用的计算机是 Mac,所以我认为必须将行尾设置为 Mac 格式是有意义的。因此,我改用 Linux 机器——得到了同样的结果。如果我在 ANTLRWorks Interpreter 框中键入任何内容,并且如果我没有选择行尾 Mac (CR),则会出现上述情况下空行不足的问题,此外,每个语句块的最后一条语句都需要分号后面的额外空格(即,在 bob; 之后)。
当我在要解析的代码输入文件上运行 Java 版本的语法时,这些错误再次出现......
可能是什么问题?我会理解问题是否存在太多新行,格式可能解析器不理解/没有被我的空格规则捕获。但是在这种情况下,这是缺少新行的问题。
我的空格声明如下:
WS : ( '\t' | ' ' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
或者,这可能是由于模棱两可的问题吗?
这是完整的语法文件(请随意忽略前几个块,它们会覆盖 ANTLR 的默认错误处理机制:
grammar SmallC;
options {
output = AST ; // Set output mode to AST
}
tokens {
DIV = '/' ;
MINUS = '-' ;
MOD = '%' ;
MULT = '*' ;
PLUS = '+' ;
RETURN = 'return' ;
WHILE = 'while' ;
// The following are empty tokens used in AST generation
ARGS ;
CHAR ;
DECLS ;
ELSE ;
EXPR ;
IF ;
INT ;
INCLUDES ;
MAIN ;
PROCEDURES ;
PROGRAM ;
RETURNTYPE ;
STMTS ;
TYPEIDENT ;
}
@members {
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MismatchedTokenException(ttype, input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
throw e;
}
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MissingTokenException(ttype, input, null);
}
// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
String msg = super.getErrorMessage(e, tokenNames);
if ( paraphrases.size()>0 ) {
String paraphrase = (String)paraphrases.peek();
msg = msg+" "+paraphrase;
}
return msg;
}
// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
{
String exType;
String hdr;
if (e instanceof UnwantedTokenException) {
exType = "UnwantedTokenException";
} else if (e instanceof MissingTokenException) {
exType = "MissingTokenException";
} else if (e instanceof MismatchedTokenException) {
exType = "MismatchedTokenException";
} else if (e instanceof MismatchedTreeNodeException) {
exType = "MismatchedTreeNodeException";
} else if (e instanceof NoViableAltException) {
exType = "NoViableAltException";
} else if (e instanceof EarlyExitException) {
exType = "EarlyExitException";
} else if (e instanceof MismatchedSetException) {
exType = "MismatchedSetException";
} else if (e instanceof MismatchedNotSetException) {
exType = "MismatchedNotSetException";
} else if (e instanceof FailedPredicateException) {
exType = "FailedPredicateException";
} else {
exType = "Unknown";
}
if ( getSourceName()!=null ) {
hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": ";
} else {
hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": ";
}
String msg = getErrorMessage(e, tokenNames);
emitErrorMessage(hdr + msg + ".");
}
}
// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
@rulecatch {
catch (RecognitionException e) {
reportError(e);
throw e;
}
}
/*------------------------------------------------------------------
* PARSER RULES
*
* Many of these make use of ANTLR's rewrite rules to allow us to
* specify the roots of AST sub-trees, and to allow us to do away
* with certain insignificant literals (like parantheses and commas
* in lists) and to add empty tokens to disambiguate the tree
* construction
*
* The @init and @after definitions populate the paraphrase
* stack to allow us to specify which grammar rule we are in when
* errors are found.
*------------------------------------------------------------------*/
args
@init { paraphrases.push("in these procedure arguments"); }
@after { paraphrases.pop(); }
: ( typeident ( ',' typeident )* )? -> ^( ARGS ( typeident ( typeident )* )? )? ;
body
@init { paraphrases.push("in this procedure body"); }
@after { paraphrases.pop(); }
: '{'! decls stmtlist '}'! ;
decls
@init { paraphrases.push("in these declarations"); }
@after { paraphrases.pop(); }
: ( typeident ';' )* -> ^( DECLS ( typeident )* )? ;
exp
@init { paraphrases.push("in this expression"); }
@after { paraphrases.pop(); }
: lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;
factor : '(' lexp ')'
| ( MINUS )? ( IDENT | NUMBER )
| CHARACTER
| IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;
lexp : term ( ( PLUS | MINUS )^ term )* ;
includes
@init { paraphrases.push("in the include statements"); }
@after { paraphrases.pop(); }
: ( '#include' STRING )* -> ^( INCLUDES ( STRING )* )? ;
main
@init { paraphrases.push("in the main method"); }
@after { paraphrases.pop(); }
: 'main' '(' ')' body -> ^( MAIN body ) ;
procedure
@init { paraphrases.push("in this procedure"); }
@after { paraphrases.pop(); }
: ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;
procedures : ( procedure )* -> ^( PROCEDURES ( procedure)* )? ;
proc_return_char
: 'char' -> ^( RETURNTYPE CHAR ) ;
proc_return_int : 'int' -> ^( RETURNTYPE INT ) ;
// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program : includes decls procedures main EOF ;
stmt
@init { paraphrases.push("in this statement"); }
@after { paraphrases.pop(); }
: '{'! stmtlist '}'!
| WHILE '(' exp ')' s=stmt -> ^( WHILE ^( EXPR exp ) $s )
| 'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )? -> ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
| IDENT '='^ lexp ';'!
| ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
| 'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
| RETURN ( lexp )? ';' -> ^( RETURN ( lexp )? )
| IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;
stmtlist : ( stmt )* -> ^( STMTS ( stmt )* )? ;
term : factor ( ( MULT | DIV | MOD )^ factor )* ;
// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident : typeident_char | typeident_int ;
typeident_char : 'char' s2=IDENT -> ^( CHAR $s2 ) ;
typeident_int : 'int' s2=IDENT -> ^( INT $s2 ) ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)
IDENT : ( LCASE_ALPHA | UCASE_ALPHA | '_' ) ( LCASE_ALPHA | UCASE_ALPHA | DIGIT | '_' )* ;
CHARACTER : PRINTABLE_CHAR
| '\n' | '\t' | EOF ;
NUMBER : ( DIGIT )+ ;
STRING : '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ { $channel = HIDDEN; } ;
fragment
DIGIT : '0'..'9' ;
fragment
LCASE_ALPHA : 'a'..'z' ;
fragment
NONALPHA_CHAR : '`' | '~' | '!' | '@' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'
| '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
| '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ;
fragment
PRINTABLE_CHAR : LCASE_ALPHA | UCASE_ALPHA | DIGIT | NONALPHA_CHAR ;
fragment
UCASE_ALPHA : 'A'..'Z' ;
【问题讨论】:
-
您能否发布足够多的语法,以便正确解析您发布的 SmallC 代码 sn-p?
-
亲爱的 Bart - 我添加了大量的代码。我不想发布整个事情,因为那会很长。如果您认为您可能有想法并且不介意我亲自将其发送给您,请告诉我,我会发送给您。
-
@Geoffroy,干杯。明天我可能会看看它:我现在要开枪了!
-
@Geoffroy,对我来说还有很多事情要继续。我需要一些可以复制和粘贴的东西来显示你提到的行为。也许最简单的方法是发布所有内容,因为我在您发布的 sn-p 中看到了一些其他可能导致问题的内容(但如果没有看到更多内容则无法确定)。它是否很大并不重要:它会被裁剪,就像您已经发布的语法被垂直裁剪一样。
-
@Geoffroy,暂时忘记 ANTLRWorks。命令
java -cp antlr-3.2.jar org.antlr.Tool SmallC.g产生什么?我的猜测是产生了错误或警告,在这种情况下,您无法预测 ANTLRWorks 的解释器会咳出什么。
标签: whitespace antlr antlrworks