【问题标题】：Perl Regex: Matching From Start of File to PatternPerl 正则表达式：从文件开头到模式匹配
【发布时间】：2015-11-13 07:34:14
【问题描述】：

我有一个 XML 文件，其中包含许多 HTTP 响应，包括 HTTP 标头，我想将各个响应写入文件，其中仅包含内容而不是标头。我正在努力删除文件开头的 HTTP 标头，而不会弄乱其余部分

#!/usr/bin/perl
use XML::Simple;
use MIME::Base64;
use URI::Escape;

#CheckArgs
....
my $input = $ARGV[0];

# Parse XML
my $xml = new XML::Simple;
my $data = $xml->XMLin("$input");

# Iterate through the file
for (my $i=0; $i < @{$data->{item}}; $i++){ 
    my $status = $data->{item}[$1]->{status};
    my $path = $data->{item}[$i]->{path};
    if ($status != "200") {
        print "Skipping $path due to status of $status\n";
        next;
    }
    print "$status $path\n";
    my $filename = uri_escape($path);
    # The Content is Base64 Encoded
    my $encoded = $data->{item}[$i]->{response}->{content};
    my $decoded = decode_base64($encoded);

    # Remove HTTP headers
    $decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//gm; 
    open(IMGFILE, "> $filename") or die("Can't open $filename: ".$@);
    binmode IMGFILE;
    print IMGFILE $decoded;
    close IMGFILE;
}

$decoded 在搜索和替换之前看起来像这样

HTTP/1.1 200 OK
Server: nginx
Date: Thu, 12 Nov 2025 20:79:99 GMT
Content-Type: application/pdf
Content-Length: 88151
Last-Modified: Mon, 14 Sep 2025 20:79:99 GMT
Connection: keep-alive
ETag: "123123-123546"
Expires: Thu, 19 Nov 2025 20:79:99 GMT
Cache-Control: max-age=123456
Accept-Ranges: bytes


%PDF-1.6
%âãÏÓ
54 0 obj
<< 
/Linearized 1 
/O 56 
/H [ 720 305 ] 
/L 45164 
/E 7644 
/N 10 
/T 43966 
>> 
endobj
[Lots more binary and text]

所以我试图从文件的开头匹配以下行的两个新行的第一个实例：

$decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//m;
# s => Search Replace
# ^ => Start of file
# (.*?) => Non-greedy match anything including \r and \n
# ((\r\n)|\n|\r){2} => two new lines 
# // => Replace with empty string
# m multiline to allow . to match \r\n

在玩了很多正则表达式之后，我无法获得我想要的结果，从上面的示例中，我希望我的新文件以字符 %PDF-1.6 开头，这些字符以及它们之后的所有内容都应该保持不变。请注意，PDF 文件只是一个示例，我希望它可以处理许多其他文件类型。

编辑 1

$decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//m; 
# matches \r\n due to or. So Try
$decoded =~ s/^(.*?)((\r\n)|([^\r]\n)|(\r[^\n])){2}//m;

【问题讨论】：

((\r\n)|\n|\r){2} 是错误的，因为它可以匹配单个换行符\r\n，将其更改为(?:\n\n|\r\n?\r\n?)
s/^.*?\R{2,}//s怎么样
@Borodin 你把它作为答案，我会给你互联网积分！（它成功了，谢谢！）

标签： regex perl

【解决方案1】：

m 多行以允许 .匹配 \r\n

/m 修饰符仅影响 ^ 和 $ 字符。你需要/s 允许. 匹配LF

((\r\n)|\n|\r){2} => 两个新行

已经有一个元字符可以做到这一点 - \R

我建议类似

$decoded =~ s/^.*?\R{2,}//s

会做你想做的事

【讨论】：

非常感谢，这里有一些网点！但看起来你已经有不少了:)
@DavidWaters：很高兴为您提供帮助。忽略数字，我只是一个碰巧知道一些编程知识的普通人！