在perl中用普通破折号搜索和替换十进制破折号的正则表达式？答案

【问题标题】：Regular expressions to search and replace decimal dashes with a normal dash in perl?在perl中用普通破折号搜索和替换十进制破折号的正则表达式？
【发布时间】：2015-09-15 09:36:18
【问题描述】：

我目前需要一个正则表达式来搜索和替换所有 |–|与 |-|。我正在更换|`|与 |'|它正在使用：

while($_ =~ s/`/'/g)
{
  print "Line: '$.'. ";
  print "Found '$&'. ";
}

但是，使用相同的正则表达式不适用于我的以下所有尝试：

while($_ =~ s/\–/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

while($_ =~ s/\&#8211/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

while($_ =~ s/\&ndash/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}
while($_ =~ s/\–/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

while($_ =~ s/&#8211/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

while($_ =~ s/&ndash/-/g)
{
  print "Line: '$.'. ";
  print "Found '$&'.\n";
}

目前的脚本如下：

#!/usr/bin/perl
use strict;
use warnings;
my $FILE;
my $filename = 'NoDodge.c';

open($FILE,"<service.c") or die "File not opened";
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
while (<$FILE>)
{
  while($_ =~ s/`/'/g)
  {
    print "Line: '$.'. ";
    print "Found '$&'. ";
  }
  while($_ =~ s/\&#8211/-/g)
  {
    print "Line: '$.'. ";
    print "Found '$&'.\n";
  }
  print $fh $_;
}
close $fh;
print "\nCompleted\n";

当前结果示例：

行：'152'。找到'`'。

行：'162'。找到'`'。

完成

解决方案：由鲍罗丁提供，

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open qw/ :std :encoding(utf8) /;

my $FILE;
my $fh;
my $readfile = 'service.c';
my $writefile = 'NoDodge.c';

open($FILE,'<',$readfile) or die qq{Unable to open "$readfile" for input: $!};
open($fh, '>',$writefile) or die qq{Unable to open "$writefile" for output: $!};
while (<$FILE>)
{
  while(s/–/-/g)
  {
    print "Found: $& on Line: $.\n";
  }

  while(s/`/'/g)
  {
    print "Found: $& on Line: $.\n";
  }

  print $fh $_;
}
close $fh;
close $FILE;
print "\nService Migrated to $writefile\n";

示例输出：

找到：- 在线：713

发现：`在线：713

找到：-在线：724

发现：`在线：724

发现：`在线：794

服务迁移到 NoDodge.c

【问题讨论】：

不需要$i 来计算行号。您可以使用$.，它保存当前文件句柄行号。见 perlvar。
谢谢simbabque，我去看看

标签： regex perl encoding utf-8

【解决方案1】：

您需要在程序顶部添加use utf8，否则 Perl 将看到构成短划线 (E28093) 的 UTF-8 编码的各个字节。也无需指定 $_ 作为替换对象，因为它是默认值，并且您无需转义短划线，因为它不是正则表达式模式中的特殊字符

use utf8;

...

while( s/–/-/g ) { ... }

或者您可能希望使用 Unicode 名称使其更清晰，因为您要替换的内容一目了然。在这种情况下，您不需要 use utf8，只要您命名每个非 ASCII 字符而不是按字面意思使用它，就像这样

while( s/\N{EN DASH}/-/g ) { ... }

您还需要打开文件——输入和输出——以 UTF-8 编码。最简单的方法是将 UTF-8 设置为默认模式。您可以在程序顶部附近添加这一行

use open qw/ :std :encoding(utf8) /;

或者您可以像这样以 UTF-8 编码显式打开每个文件

my $filename = 'NoDodge.c';

open my $in_fh, '<:encoding(utf8)', 'service.c'
        or die qq{Unable to open "service.c" for input: $!};

open my $out_fh, '>:encoding(utf8)', $filename
        or die qq{Unable to open "$filename" for output: $!};

【讨论】：

您好，我已经实现了上述内容，但正则表达式仍然没有在 service.c 文件中找到 En Dashes，我还将编辑器的编码更改为 UTF-8
@Andrew：您是否使用正确的编码打开文件？它是 UTF-8 文件吗？您可以在 pastebin.com 上发布文件并在此处放置链接吗？
不幸的是，该文件包含机密信息，因此我无法发布，但我可以向您保证它是 utf-8 编码的
@Andrew：好的，你将它打开为 UTF-8 文件了吗？
@Andrew：如果您的示例代码是正确的，那么您甚至还没有打开这些文件作为 UTF-8 文件。我已经添加到我的解决方案中来解释如何做到这一点