Perl URL替换答案

【问题标题】：Perl URL replacePerl URL替换
【发布时间】：2017-11-11 17:21:44
【问题描述】：

我正在努力完成以下任务，

从文本中提取所有 url。
如果域属于白名单，则将其替换为修改后的网址。

以下是代码。

$text = '<a href="http://www.amazon.de/Lenovo-Moto-Smartphone-Android-schwarz/dp/B01FLZC8ZI"><img src="http://www.testurl.de/Sasdfhopr.jpg" width="80%"></a>';

$regex = '(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])?';

@whiteList = ("www.amazon.de");

while ($text =~ /$regex/g) {
       # regex result has following groups as matches
       # $1 = scheme
       # $2 = domain
       # $3 = query parameters

       # check if domain is in white list
       if ( grep( /^$2$/, @whiteList ) ) {
           # build new url
           $new = "http://test.xyz.pqr/url=".$1."://".$2.$3;

           # recreate old url
           $old = $1."://".$2.$3;

           # replace it here, but its not replacing
           $text =~ s/$old/$new/g;

           # but as an example replacing 
           # domain name with test, its working. 
           # it appears to be something to with back slash or forward 
           # slashes
           $text =~ s/$2/test/g;
         }
    } print $text;

任何帮助或提示都会很棒。因为我是 perl 编程的新手。

【问题讨论】：

缺少use strict; use warnings;。
@melpomene 很好，在严格和警告之后。我收到诸如“我的未使用”之类的警告。但问题仍然没有解决。谢谢。
@bharatesh: use strict 和 use warnings 'all' 不会神奇地修复您的程序，它们会引起您对代码中错误的注意，然后您必须自己修复。这样做，如果您仍然无法自己修复它，请发布您的新代码。
@SinanÜnür: "Use Regexp::Common" 这是我的第一个想法，但在过去尝试做类似的事情后，我发现Regexp::Common::URI 在半成品状态。它似乎不支持 URL 上的权限或片段，如果您使用 $RE{URI}{HTTP}{-keep}，则捕获的字段不会记录在案。我开始按照这些思路为 OP 组装解决方案，但意识到我也需要 URI 并很快放弃了。我可能会写信给维护人员，看看是否可以修复。
@SinanÜnür：问题是URI 无法自行在文本块中找到 URL。正如我所描述的，我考虑过同时使用Regexp::Common::URI 和URI 的解决方案，但它很快变得站不住脚。我完全有可能犯了一个错误，但我找不到它，所以给维护者发了电子邮件。如果您想进行实验，那么我的示例是 http://user:pass@www.example.com:88/path?query#fragment，模块可以识别它，但没有使用 {-keep} 选项正确拆分为多个部分。即使有，这些部分也没有记录。

标签： string perl replace

【解决方案1】：

我会结合使用Regexp::Common 和Regexp::Common::URI 来定位网址，并且 URI 解析和转换它们

您的最小数据样本没有帮助，但这是我使用该数据的想法的证明

use strict;
use warnings 'all';

use Regexp::Common 'URI';
use URI;
use List::Util 'any';

use constant NEW_HOST => 'test.xyz.pqr';

my $text = <<'END';
<a href="http://www.amazon.de/Lenovo-Moto-Smartphone-Android-schwarz/dp/B01FLZC8ZI">
<img src="http://www.testurl.de/Sasdfhopr.jpg" width="80%">
</a>
END

my @white_list = qw/ www.amazon.de /;

$text =~ s{ ( $RE{URI}{HTTP} ) } {
    my $uri = URI->new($1);
    my $host = $uri->host;
    $uri->host(NEW_HOST) if any { $host eq $_ } @white_list;
    $uri->as_string;
}exg;

print $text, "\n";

输出

<a href="http://test.xyz.pqr/Lenovo-Moto-Smartphone-Android-schwarz/dp/B01FLZC8ZI">
<img src="http://www.testurl.de/Sasdfhopr.jpg" width="80%">
</a>

【讨论】：

代码已经在任何在线编译器中运行，没有任何问题。
谢谢，我会尝试运行你的代码，因为它看起来更干净。

【解决方案2】：

$old 中的 URL 包含当您在模式匹配中使用它时，Perl 的正则表达式引擎将其视为模式一部分的字符，而不是文字字符。

$text =~ s/$old/$new/g;

你需要逃避那些。您可以使用\Q 和\E 命令来做到这一点。

$text =~ s/\Q$old\E/$new/g;

这应该可以解决问题，假设您的其余代码都可以正常工作，而我还没有尝试过。

【讨论】：