如何根据编译器设计忽略字符串中的注释？答案

【问题标题】：How can i ignore comments in a string based on compiler design?如何根据编译器设计忽略字符串中的注释？
【发布时间】：2021-11-22 17:08:01
【问题描述】：

我想忽略像{ comments } 和// comments 这样的每条评论。我有一个名为 peek 的指针，它逐个字符地检查我的字符串。我知道如何忽略换行符、制表符和空格，但我不知道如何忽略 cmets。

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    comment\n
}\n end
"""

tokens = []
tmp = ''

for i, peek in enumerate(string.lower()):
    if peek == ' ' or peek == '\n':
        tokens.append(tmp)
        # ignoing WS's and comments
        if(len(tmp)>0): 
            print(tmp)

        tmp = ''
    
    else:
        tmp += peek

这是我的结果：

begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end

如您所见，空格会被忽略，但 cmets 不会。

我怎样才能得到如下结果？

begin
west
west
north
north
north
west
east
east
south
end

【问题讨论】：

对于基本实现，检查当前字符是否为{，如果是，则忽略所有内容，直到关闭}，然后照常进行。
@AndrewMcClement 这就是问题我不知道如何实现这一步，我不知道我是否应该使用正则表达式或其他东西，另一个问题是输入字符串是否有多行 cmets ，我应该怎么做。
只需使用 if peek == '{': skip = True elif peek == '}': skip = False 并在 if not skip: ... your code 中运行 if/else 的其余部分

标签： python compiler-construction lexical-analysis

【解决方案1】：

简单地使用全局变量skip = False，当你得到{时设置它True，当你得到}时设置False，其余的if/else运行在if not skip:中

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    comment\n
}\n end
"""

tokens = []
tmp = ''
skip = False

for i, peek in enumerate(string.lower()):

    if peek == '{':
        skip = True
    elif peek == '}':
        skip = False
    elif not skip:

        if peek == ' ' or peek == '\n':
            tokens.append(tmp)
            # ignoing WS's and comments
            if(len(tmp)>0): 
                print(tmp)
            tmp = ''
        else:
            tmp += peek

因为你可能已经嵌套了{ { } }like

{\n
    { comment1 }\n
    comment2\n
    { comment3 }\n
}\n

所以最好使用skip 来计算{ }

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    { comment1 }\n
    comment2\n
    { comment3 }\n
}\n end
"""

tokens = []
tmp = ''
skip = 0

for i, peek in enumerate(string.lower()):

    if peek == '{':
        skip += 1
    elif peek == '}':
        skip -= 1
    elif not skip:  # elif skip == 0:

        if peek == ' ' or peek == '\n':
            tokens.append(tmp)
            # ignoing WS's and comments
            if(len(tmp)>0): 
                print(tmp)
            tmp = ''
        else:
            tmp += peek

但也许最好将所有都设为tokens，然后过滤tokens。但我跳过了这个想法。

编辑：

使用 Python 模块 sly 的版本类似于 C/C++ 工具 lex/yacc

MULTI_LINE_COMMENT 的正则表达式我在其他构建解析器的工具中找到 - lark:

syntax for multiline comments

from sly import Lexer, Parser

class MyLexer(Lexer):
    # Create it befor defining regex for Tokens
    tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }

    ignore = ' \t'

    # Tokens
    NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
    ONE_LINE_COMMENT = '\/\/.*'
    MULTI_LINE_COMMENT = '{(.|\n)*}'

    # Ignored pattern
    ignore_newline = r'\n+'

    # Extra action for newlines
    def ignore_newline(self, t):
        self.lineno += t.value.count('\n')

    # Work with errors
    def error(self, t):
        print("Illegal character '%s'" % t.value[0])
        self.index += 1

if __name__ == '__main__':
    
    text =  """  beGIn west   WEST north//comment1 
north       north west East east south
// comment west
{
    { comment1 }
    comment2
    { comment3 }
}
 end
"""
    
    lexer = MyLexer()
    tokens = lexer.tokenize(text)
    for item in tokens:
        print(item.type, ':', item.value)

结果：

NAME : beGIn
NAME : west
NAME : WEST
NAME : north
ONE_LINE_COMMENT : //comment1 
NAME : north
NAME : north
NAME : west
NAME : East
NAME : east
NAME : south
ONE_LINE_COMMENT : // comment west
MULTI_LINE_COMMENT : {
    { comment1 }
    comment2
    { comment3 }
}
NAME : end

【讨论】：

也许将所有内容都设为tokens 并稍后过滤tokens 会更好。但我跳过了这个想法。
词法分析器使用 peek 作为指针，当 peek 面对一个字母时，它会向前移动，直到它面对一个非字母或数字，这就是令牌的制作方式，那么我们必须检查是否令牌是否在符号表中。我心里也有你的想法，我不确定它是否正确，但假设我们有一个像start west// north 这样的字符串，如果我们进行标记，我们将有start, west, //, north 作为标记。我不认为是否有这样的编译器。
您对忽略以// 开头的 cmets 有何看法。 peek 一次检查一个字符，但现在我们必须检查两个字符，它必须忽略字符直到行尾！！！！ @furas
我认为编译器可能会将west、// north 标记为两个元素，然后它可以简单地删除// north。我从不从头开始编写编译器——我宁愿使用 C/C++ 工具 lex/yacc 或 Python 模块 ply 或 sly ，其工作方式类似于 lex/yacc 。但我从未尝试过标记 cmets。
我添加了带有模块sly 的示例，它可以识别ONE_LINE_COMMENT 和MULTI_LINE_COMMENT。它只是为此使用regex。