ANTLR4 解析器问题答案

【问题标题】：ANTLR4 Parser issuesANTLR4 解析器问题
【发布时间】：2020-10-19 13:21:29
【问题描述】：

我正在尝试为 c++ 样式的头文件编写解析器，但未能正确配置解析器。

词法分析器：

lexer grammar HeaderLexer;

SectionLineComment
    :   LINE_COMMENT_SIGN Section CharacterSequence
    ;

Pragma
    : POUND 'pragma'
    ;

Section
    :  AT_SIGN 'section'
    ;

Define
    : POUND 'define'
    | LINE_COMMENT_SIGN POUND 'define'
    ;

Booleanliteral
   : False
   | True
   ;

QuotedCharacterSequence
    :   '"' .*?  '"'
    ;

ArraySequence
    :   '{' .*?  '}'
    |   '[' .*?  ']'
    ;

IntNumber
    :   Digit+
    ;

DoubleNumber
    :   Digit+ POINT Digit+
    |   ZERO POINT Digit+
    ;

CharacterSequence
    :   Text+
    ;

Identifier
    :   [a-zA-Z_0-9]+
    ;

BlockComment
    : '/**' .*? '*/'
    ;

LineComment
    :   LINE_COMMENT_SIGN ~[\r\n]*
    ;

EmptyLineComment
    :   LINE_COMMENT_SIGN -> skip
    ;

Newline
    :   (   '\r' '\n'?
        |   '\n'
        )
        -> skip
    ;

WhiteSpace
   : [ \r\n\t]+ -> skip;

fragment POUND : '#';
fragment AT_SIGN : '@';
fragment LINE_COMMENT_SIGN : '//';
fragment POINT : '.';
fragment ZERO : '0';

fragment Digit
    :   [0-9]
    ;

fragment Text
    :   [a-zA-Z0-9.]
    ;


fragment False
   : 'false'
   ;

fragment True
   : 'true'
   ;

解析器：

parser grammar HeaderParser;

options { tokenVocab=HeaderLexer; }

compilationUnit: statement* EOF;

statement
    : comment? pragmaDirective
    | comment? defineDirective
    | section
    | comment
    ;

pragmaDirective
    :   Pragma CharacterSequence
    ;

defineDirective
    :   Define Identifier Booleanliteral LineComment?
    |   Define Identifier DoubleNumber LineComment?
    |   Define Identifier IntNumber LineComment?
    |   Define Identifier CharacterSequence LineComment?
    |   Define Identifier QuotedCharacterSequence LineComment?
    |   Define Identifier ArraySequence LineComment?
    |   Define Identifier
    ;

section: SectionLineComment;

comment
    : BlockComment
    | LineComment+
    ;

要解析的文本：

/**
 * BLOCK COMMENT
 */
#pragma once

/**
 * BLOCK COMMENT
 */
#define CONFIGURATION_H_VERSION 12345

#define IDENTIFIER abcd
#define IDENTIFIER_1 abcd
#define IDENTIFIER_1 abcd.dd

#define IDENTIFIER_2 true // Line
#define IDENTIFIER_20 {ONE, TWO} // Line
#define IDENTIFIER_20_30   { 1, 2, 3, 4 }
#define IDENTIFIER_20_30_A   [ 1, 2, 3, 4 ]
#define DEFAULT_A 10.0

//================================================================
//============================= INFO =============================
//================================================================

/**
 * SEPARATE BLOCK COMMENT
 */

//==================================================================
//============================= INFO ===============================
//==================================================================
// Line 1
// Line 2
//

// @section test

// Line 3
#define IDENTIFIER_TWO "(ONE, TWO, THREE)" // Line 4
//#define IDENTIFIER_3 Version.h // Line 5

// Line 6
#define IDENTIFIER_THREE

使用这个配置我有几个问题：

解析器无法正确解析第 11 行的“#define IDENTIFIER abcd”
“//@section test”第36行被解析为行注释，但我需要将其解析为单独的token
注释的define指令的解析不起作用“//#define IDENTIFIER_3 Version.h //第5行”

【问题讨论】：

标签： parsing antlr4

【解决方案1】：

只要在解析时出现问题，就应该检查词法分析器生成的标记类型。

这是您的词法分析器生成的标记：

BlockComment              `/**\n * BLOCK COMMENT\n */`
Pragma                    `#pragma`
CharacterSequence         `once`
BlockComment              `/**\n * BLOCK COMMENT\n */`
Define                    `#define`
Identifier                `CONFIGURATION_H_VERSION`
IntNumber                 `12345`
Define                    `#define`
CharacterSequence         `IDENTIFIER`
CharacterSequence         `abcd`
Define                    `#define`
Identifier                `IDENTIFIER_1`
CharacterSequence         `abcd`
Define                    `#define`
Identifier                `IDENTIFIER_1`
CharacterSequence         `abcd.dd`
Define                    `#define`
Identifier                `IDENTIFIER_2`
Booleanliteral            `true`
LineComment               `// Line`
Define                    `#define`
Identifier                `IDENTIFIER_20`
ArraySequence             `{ONE, TWO}`
LineComment               `// Line`
Define                    `#define`
Identifier                `IDENTIFIER_20_30`
ArraySequence             `{ 1, 2, 3, 4 }`
Define                    `#define`
Identifier                `IDENTIFIER_20_30_A`
ArraySequence             `[ 1, 2, 3, 4 ]`
Define                    `#define`
Identifier                `DEFAULT_A`
DoubleNumber              `10.0`
LineComment               `//================================================================`
LineComment               `//============================= INFO =============================`
LineComment               `//================================================================`
BlockComment              `/**\n * SEPARATE BLOCK COMMENT\n */`
LineComment               `//==================================================================`
LineComment               `//============================= INFO ===============================`
LineComment               `//==================================================================`
LineComment               `// Line 1`
LineComment               `// Line 2`
LineComment               `//`
LineComment               `// @section test`
LineComment               `// Line 3`
Define                    `#define`
Identifier                `IDENTIFIER_TWO`
QuotedCharacterSequence   `"(ONE, TWO, THREE)"`
LineComment               `// Line 4`
LineComment               `//#define IDENTIFIER_3 Version.h // Line 5`
LineComment               `// Line 6`
Define                    `#define`
Identifier                `IDENTIFIER_THREE`

正如您在上面的列表中看到的，#define IDENTIFIER abcd 没有被正确解析，因为它产生了以下标记：

Define                    `#define`
CharacterSequence         `IDENTIFIER`
CharacterSequence         `abcd`

因此可以不匹配解析器规则：

defineDirective
    :   ...
    |   Define Identifier CharacterSequence LineComment?
    |   ...
    ;

如您所见，词法分析器独立于解析器运行。无论解析器是否尝试为文本 "IDENTIFIER" 匹配一个 Identifier，词法分析器都会为此生成一个 CharacterSequence 标记。

词法分析器仅根据 2 条规则创建标记：

尝试匹配尽可能多的字符
如果 2 个（或更多）词法分析器规则可以匹配相同的字符，则首先定义的规则“获胜”

由于上述规则，//#define IDENTIFIER_3 Version.h // Line 5 被标记为LineComment（适用规则 1：尽可能匹配）。并且像once 这样的输入被标记为CharacterSequence 而不是Identifier（适用规则2：CharacterSequence 在Identifier 之前定义）

要在评论内外对#define 进行相同处理，您可以使用lexical modes。每当词法分析器看到// 时，它就会进入一个特殊的注释模式，一旦进入这种注释模式，您还将识别#define 和@section 标记。当您看到其中一个标记时（或者当您看到换行符时），您就会退出此模式。

快速演示：

lexer grammar HeaderLexer;

SPACES          : [ \r\n\t]+ -> skip;
COMMENT_START   : '//' -> pushMode(COMMENT_MODE);
PRAGMA          : '#pragma';
SECTION         : '@section';
DEFINE          : '#define';
BOOLEAN_LITERAL :  'true' | 'false';
STRING          : '"' .*? '"';
IDENTIFIER      : [a-zA-Z_] [a-zA-Z_0-9]*;
BLOCK_COMMENT   : '/**' .*? '*/';
OTHER           : .;
NUMBER          : [0-9]+ ('.' [0-9]+)?;
CHAR_SEQUENCE   : [a-zA-Z_] [a-zA-Z_0-9.]*;
ARRAY_SEQUENCE  : '{' .*?  '}' | '[' .*?  ']';

mode COMMENT_MODE;

  // If we match one of the followinf 3 rules, leave this comment mode
  COMMENT_MODE_DEFINE     : '#define' -> type(DEFINE), popMode;
  COMMENT_MODE_SECTION    : '@section' -> type(SECTION), popMode;
  COMMENT_MODE_LINE_BREAK : [\r\n]+ -> skip, popMode;

  // If none of the 3 rules above matched, consume a single
  // character (which is part of the comment)
  COMMENT_MODE_PART       : ~[\r\n];

然后解析器可能如下所示：

parser grammar HeaderParser;

options { tokenVocab=HeaderLexer; }

compilationUnit
 : statement* EOF
 ;

statement
 : comment? pragmaDirective
 | comment? defineDirective
 | sectionLineComment
 | comment
 ;

pragmaDirective
 :   PRAGMA char_sequence
 ;

defineDirective
 : DEFINE IDENTIFIER BOOLEAN_LITERAL line_comment?
 | DEFINE IDENTIFIER NUMBER line_comment?
 | DEFINE IDENTIFIER char_sequence line_comment?
 | DEFINE IDENTIFIER STRING line_comment?
 | DEFINE IDENTIFIER ARRAY_SEQUENCE line_comment?
 | DEFINE IDENTIFIER
 ;

sectionLineComment
 : COMMENT_START COMMENT_MODE_PART? SECTION char_sequence
 ;

comment
 : BLOCK_COMMENT
 | line_comment
 ;

line_comment
 : COMMENT_START COMMENT_MODE_PART*
 ;

char_sequence
 : CHAR_SEQUENCE
 | IDENTIFIER
 ;

【讨论】：

感谢您的回答。它有效，但未能在 2 种情况下生成正确的语句： 1. 对于文本 code #define DEFAULT_A 10.0 //================================================================ //============================= INFO ============================= //================================================================ code`，第一行注释在 DEFINE 语句中显示为标记，但目的是分隔不同的行 cmets提到 DEFINE 的行。 2. NUMBER 词法分析器无法标记单个数字
不客气。我不太明白问题出在哪里：这些评论框不太适合问答。我确信我的简短演示不会处理您所有的案例，并且在修复您现在提到的事情之后，我相信会有更多的极端案例。正如我在回答中解释的那样：首先检查创建了哪些令牌。先尝试自己解决。如果遇到困难，请创建一个新问题并详细说明问题所在。
好的，如果我将其他词法分析器移动到 ARRAY_SEQUENCE 下方，它似乎可以解决一位数的问题，但我不确定这是否是正确的方法。
是的，我会先尝试自己解决问题，如果我无法解决问题，我会创建另一个问题主题作为单独的线程。谢谢！