解析挑战 - 修复损坏的语法答案

【问题标题】：Parsing Challenge - Fixing Broken Syntax解析挑战 - 修复损坏的语法
【发布时间】：2018-12-20 22:43:05
【问题描述】：

我有数千行使用特定非标准语法的代码。我需要能够使用不支持此语法的不同编译器编译代码。我试图自动化需要进行的更改，但对正则表达式等不是很好。我失败了。

这是我想要实现的目标：目前在我的代码中，使用以下可能的语法调用/访问对象的方法和变量：

call obj.method()
obj.method( )
obj.method( arg1, arg2, kwarg1=kwarg1 )
obj1.var = obj2.var2

相反，我希望它是：

call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2

并且我想在不影响以下可能出现的“.”的情况下进行这些更改：

十进制数：

a = 1.0
b = 1.d0

逻辑运算符（注意可能的空格和方法调用）：

if (a.or.b) then
    if ( a .and. .not.(obj.l1(1.d0)) ) then

任何被注释的内容（感叹号“！”用于此目的）

!>I am a commented line.
   ! > I am.a commented line with..leading blanks and extra periods.1.
b=a1.var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )

引号中的任何内容（即字符串文字）

c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '

有谁知道如何解决这个问题。我想正则表达式是自然的方法，但我对任何事情都持开放态度。（如果有人关心：代码是用 fortran 编写的。ifort 对“.”语法很满意；gfortran 不是）

【问题讨论】：

将所有输入示例放入 1 个文件中，这样我们只需使用 1 个输入文件进行测试并为该文件提供预期的输出。添加一个必须处理的更复杂的示例，例如c="I am a string!"; b=a1.var() 会很棘手，因为在这种情况下，! 不是评论的开始。
隔离方法调用非常简单。困难的部分是匹配obj1.var = obj2.var2 之类的东西而不匹配b = 1.d0。我不确定您是否能够编写足够紧凑的模式来更改您想要的内容，而不会更改更多而不是您想要的。
也许你可以尝试分两步完成

标签： regex parsing sed

【解决方案1】：

您是否考虑过使用flex 解决问题？它使用正则表达式，但更高级，因为它尝试不同的模式并返回最长的匹配选项。规则可能如下所示：

%%                                           /* rule part of the program */
!.*\n                     printf(yytext);    /* ignore comments */
\".*\"|'.*'               printf(yytext);    /* ignore strings */
[^A-Za-z_][0-9]+\.        printf(yytext);    /* ignore numbers */
".and."|".or."|".not."    printf(yytext);    /* ignore logical operators */
\.                        printf("%%");      /* now, replace the . by % */
[^\.]                     printf(yytext);    /* ignore everything else */

%%                                           /* invoke the program */
int main() {
    yylex();
}

您可能需要修改第三行。目前，如果没有从A 到Z、从a 到z 或在数字前的字符_，它会忽略出现在任意位数之后的任何.。如果标识符中有更多的合法字符，您可以添加它们。

如果一切正确，您应该可以将其转换为程序。将其复制到名为lex.l 的文件中并执行：

$ flex -o lex.yy.c lex.l
$ gcc -o lex.out lex.yy.c -lfl

那么你就有了 C 程序 lex.out。您可以在命令行中使用它：

cat unreplaced.txt | ./lex.out > replaced.txt

这使用与 Ed Mortons 建议相同的原则，但它使用弹性，因此我们可以跳过组织。在某些情况下它仍然会失败，例如在字符串中包含 \"。

示例输入

call obj.method()
obj.method( )
obj.method( arg1, arg2, kwarg1=kwarg1 )
obj1.var = obj2.var2
a = 1.0
b = 1.d0
if (a.or.b) then
    if ( a .and. .not.(obj.l1(1.d0)) ) then
!>I am a commented line.
   ! > I am.a commented line with..leading blanks and extra periods.1.
b=a1.var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1.var()

输出

call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
a = 1.0
b = 1.d0
if (a.or.b) then
    if ( a .and. .not.(obj%l1(1.d0)) ) then
!>I am a commented line.
   ! > I am.a commented line with..leading blanks and extra periods.1.
b=a1%var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1%var()

【讨论】：

@EdMorton 感谢您的想法。我刚刚添加了它，它确实给出了相同的输出。

【解决方案2】：

如果没有语言解析器，您将无法 100% 稳健地执行此操作（例如，如果您在双引号字符串中包含 \"，则在某些情况下以下操作会失败 - 易于处理，但只是您的使用未涵盖的许多可能失败之一案例），但这将处理您迄今为止向我们展示的内容以及更多内容。它使用 GNU awk 进行 gensub() 和第三个参数来匹配()。

示例输入：

$ cat file
call obj.method()
obj.method( )
obj.method( arg1, arg2, kwarg1=kwarg1 )
obj1.var = obj2.var2
a = 1.0
b = 1.d0
if (a.or.b) then
    if ( a .and. .not.(obj.l1(1.d0)) ) then
!>I am a commented line.
   ! > I am.a commented line with..leading blanks and extra periods.1.
b=a1.var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1.var()

预期输出：

$ cat out
call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
a = 1.0
b = 1.d0
if (a.or.b) then
    if ( a .and. .not.(obj%l1(1.d0)) ) then
!>I am a commented line.
   ! > I am.a commented line with..leading blanks and extra periods.1.
b=a1%var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1%var()

脚本：

$ cat tst.awk
{
    # give us the ability to use @<any other char> strings as a
    # replacement/placeholder strings that cannot exist in the input.
    gsub(/@/,"@=")

    # ignore all !s inside double-quoted strings
    while ( match($0,/("[^"]*)!([^"]*")/,a) ) {
        $0 = substr($0,1,RSTART-1) a[1] "@-" a[2] substr($0,RSTART+RLENGTH)
    }

    # ignore all !s inside single-quoted strings
    while ( match($0,/('[^']*)!([^']*')/,a) ) {
        $0 = substr($0,1,RSTART-1) a[1] "@-" a[2] substr($0,RSTART+RLENGTH)
    }

    # Now we can separate comments from what comes before them
    comment = gensub(/[^!]*/,"",1)
    $0      = gensub(/!.*/,"",1)

    # ignore all .s inside double-quoted strings
    while ( match($0,/("[^"]*)\.([^"]*")/,a) ) {
        $0 = substr($0,1,RSTART-1) a[1] "@#" a[2] substr($0,RSTART+RLENGTH)
    }

    # ignore all .s inside single-quoted strings
    while ( match($0,/('[^']*)\.([^']*')/,a) ) {
        $0 = substr($0,1,RSTART-1) a[1] "@#" a[2] substr($0,RSTART+RLENGTH)
    }

    # convert all logical operators like a.or.b to a@#or@#b so the .s wont get replaced later
    while ( match($0,/\.([[:alpha:]]+)\./,a) ) {
        $0 = substr($0,1,RSTART-1) "@#" a[1] "@#" substr($0,RSTART+RLENGTH)
    }

    # convert all obj.var and similar to obj%var, etc.
    while ( match($0,/\<([[:alpha:]]+[[:alnum:]_]*)[.]([[:alpha:]]+[[:alnum:]_]*)\>/,a) ) {
        $0 = substr($0,1,RSTART-1) a[1] "%" a[2] substr($0,RSTART+RLENGTH)
    }

    # Convert all @#s in the precomment text back to .s
    gsub(/@#/,".")

    # Add the comment back
    $0 = $0 comment

    # Convert all @-s back to !s
    gsub(/@-/,"!")

    # Convert all @=s back to @s
    gsub(/@=/,"@")

    print
}

运行脚本及其输出：

$ awk -f tst.awk file
call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
a = 1.0
b = 1.d0
if (a.or.b) then
    if ( a .and. .not.(obj%l1(1.d0)) ) then
!>I am a commented line.
   ! > I am.a commented line with..leading blanks and extra periods.1.
b=a1%var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1%var()

【讨论】：