【问题标题】:Regular expression for parsing similar assembler instructions解析类似汇编指令的正则表达式
【发布时间】:2013-09-29 10:33:24
【问题描述】:

介绍有点长,请多多包涵。 :)

我正在为一个用汇编程序编写的大型源文件编写一个简单的基于正则表达式的解析器。这些指令中的大多数只是移动、加法、减法和跳跃,但这是一个非常大的文件,我需要将其移植到两种不同的语言中,而且我懒得手动操作。这是要求,我对此无能为力(所以请不要回答“你为什么不简单地使用 ANTLR”之类的问题)。

所以,在我做了一些预处理之后(我已经做了这部分:替换了定义和宏并去除了多余的空白和 cmets),我现在基本上必须逐行读取文件并将一行或可能多行解析为“中间" 指令,然后我将使用它来生成或多或少的 1 对 1 等价物(使用实际的整数算术和一堆 GOTO)。

所以,假设我可以拥有所有这些不同的寻址模式:

我可以走两种不同的方式:

  1. 有一个 MOV 正则表达式可以处理所有这些情况,或者
  2. 为每种指令类型启用多个 MOV 正则表达式。这种方法的问题是我必须非常仔细地设计每个正则表达式以避免任何歧义。而且似乎会有很多重复,因为源和目标操作数共享许多寻址模式。

我的问题是:如果我对所有指令都有一个正则表达式,我应该如何指定我的组和捕获以便能够简单地区分不同的模式?

还是我只是简单地捕获所有内容,然后在初始匹配后处理源/目标地址?

例如一个相当简单的匹配所有正则表达式是:

^MOV\s+(?<dest>[^\s,]+)[\s,]*(?<src>[^\s,]+)$

(用cmets分成多行):

^MOV              (?#instruction)
\s+               (?#some whitespace)
(?<dest>[^\s,]+)  (?#match everything except whitespace and comma)
\s*,\s*           (?#match comma, allow some whitespace)
(?<src>[^\s,]+)   (?#match everything except whitespace and comma)$

所以,我当然可以这样做,然后分别处理 destsrc 组。但是创建一个讨厌的复杂正则表达式来匹配下表中的所有情况会更好吗?在这种情况下,我不确定如何解释这些捕获以了解匹配的寻址模式。

我正在使用 C#,如果这有什么不同的话。

【问题讨论】:

  • 不要总是尝试用只是纯正则表达式来解决这样的问题,实际上除了正则表达式可以带来的输出之外,我们还需要更多的预处理和后处理 .
  • 为什么不从上到下解析呢? .. 为什么需要正则表达式?
  • 嗯,你的cmets有道理,我可能想多了。问题是,我已经有一堆用于其他指令的正则表达式,这些指令比手动迭代字符更快,但对于这些具有多个变体的指令(如 MOV),最好简单地匹配操作码和然后使用几个 if 子句解析其余部分。
  • 需要移植到 what 其他两种语言?其他两个装配工?类似的指令集?类似的语法? “移植”各个指令是什么意思?

标签: c# regex parsing assembly abstract-syntax-tree


【解决方案1】:

您正在发现当您尝试将词法分析器带到解析器的工作时会发生什么。我认为你的大部分困难在于试图用正则表达式做太多事情。

是的,我将推荐一个像 ANTLR 或类似的解析器。

如果你走这条路,你会写很多小正则表达式来识别标记(“MOV”、“#”、“[”、...),然后你会写一个语法来定义如何这些组成指令。如果不出意外,这使得简单地编写解析部分变得容易得多。

您可以看到looks like 的汇编代码。 (使用ANTLR以外的系统,但思路相同)。这写起来非常简单,并且没有任何关于尝试编写一个正则表达式来统治他们的痛苦。 [我在一个晚上做了那个例子,并用它解析了大量的源]。

您不清楚“端口”是什么意思。假设您将使用另一种汇编语法,如果不是另一种机器架构的话。要做到这一点,您需要访问各种指令部分(所有可能的 MOV 指令的单个正则表达式不会给您)。这就是解析和生成树的美妙之处:所有这些部分都暴露给你,嵌入它们所属的结构中。您甚至可以从多个汇编语言语句生成单个指令,因为树包含整个程序。 (在具有千兆字节 RAM 的系统上,就树大小而言,相当大并不意味着太多)。

【讨论】:

  • 好吧,我想你是对的。没有“正确”的方式可以按照我现在的方式来做,除了走另一个方向。没关系,指令集非常有限,所以我已经在匹配二进制指令后手动解析了操作数类型。
  • 我会说 ANTLR 对于这个问题来说太过分了。你的第一种方法对我来说很好。
【解决方案2】:

这是一个正则表达式,几乎可以满足您的需求(您必须针对实际数据表单进行编辑;即,而不是所有寄存器标签 ax、bx、...我只是使用了“reg”等)

 (?<Opt1>MOV\s*Rw,\sRw)
|(?<Opt2>MOV\s*Rw,\s\#data4)
|(?<Opt3>MOV\s*Rw,\s\#data16)
|(?<Opt4>MOV\s*Rw,\s\[Rw\])
|(?<Opt5>MOV\s*Rw,\s\[Rw\+\])
|(?<Opt6>MOV\s*\[Rw\],\sRw)
|(?<Opt7>MOV\s*\[-Rw\],\sRw)
|(?<Opt8>MOV\s*\[Rw\],\s\[Rw\])
|(?<Opt9>MOV\s*\[Rw\+\],\s\[Rw\])
|(?<OptA>MOV\s*\[Rw\],\s\[Rw\+\]) 

使用这些数据:

MOV Rw, Rw
MOV Rw, #data4
MOV Rw, #data16
MOV Rw, [Rw]
MOV Rw, [Rw+]
MOV [Rw], Rw
MOV [-Rw], Rw
MOV [Rw], [Rw]
MOV [Rw+], [Rw]
MOV [Rw], [Rw+]

RegexBuddy 生成这个:

Match 1:    MOV Rw, Rw       0      10
Group "Opt1":   MOV Rw, Rw       0      10
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 2:    MOV Rw, #data4      12      14
Group "Opt1" did not participate in the match
Group "Opt2":   MOV Rw, #data4      12      14
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 3:    MOV Rw, #data16     28      15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3":   MOV Rw, #data16     28      15
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 4:    MOV Rw, [Rw]        45      12
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4":   MOV Rw, [Rw]        45      12
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 5:    MOV Rw, [Rw+]       59      13
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5":   MOV Rw, [Rw+]       59      13
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 6:    MOV [Rw], Rw        74      12
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6":   MOV [Rw], Rw        74      12
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 7:    MOV [-Rw], Rw       88      13
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7":   MOV [-Rw], Rw       88      13
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 8:    MOV [Rw], [Rw]     103      14
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8":   MOV [Rw], [Rw]     103      14
Group "Opt9" did not participate in the match
Group "OptA" did not participate in the match
Match 9:    MOV [Rw+], [Rw]    119      15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9":   MOV [Rw+], [Rw]    119      15
Group "OptA" did not participate in the match
Match 10:   MOV [Rw], [Rw+]    136      15
Group "Opt1" did not participate in the match
Group "Opt2" did not participate in the match
Group "Opt3" did not participate in the match
Group "Opt4" did not participate in the match
Group "Opt5" did not participate in the match
Group "Opt6" did not participate in the match
Group "Opt7" did not participate in the match
Group "Opt8" did not participate in the match
Group "Opt9" did not participate in the match
Group "OptA":   MOV [Rw], [Rw+]    136      15

【讨论】:

    猜你喜欢
    • 2016-08-30
    • 1970-01-01
    • 2012-10-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多