【问题标题】:Perl - Parse text file with tags for data dump into new text filePerl - 使用标签解析文本文件以将数据转储到新的文本文件中
【发布时间】:2015-01-20 04:46:46
【问题描述】:

我收到了一个 .txt 文件中的数据,我需要将其格式化为可以上传到数据库的内容。文本以任何 .根据标签,需要将数据转储到特定的 txt 文件中并用制表符分隔。我一生中很少使用 Perl,但我知道 Perl 可以轻松处理这种类型的应用程序,我只是不知道从哪里开始。在 Java、SQL 和 R 之外,我毫无用处。这是一个我有近 1,000 个要处理的单个条目的示例):

<PaperTitle>True incidence of all complications following immediate and delayed breast reconstruction.</PaperTitle>
<Abstract>BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p &lt; 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome.</Abstract>
<BookTitle>Book1</BookTitle>
<Publisher>Publisher01, Boston</Publisher>
<Edition>1st</Edition>
<EditorList>
    <Editor>
        <LastName>Lewis</LastName>
        <ForeName>Philip M</ForeName>
        <Initials>PM</Initials>
    </Editor>
    <Editor>
        <LastName>Kiffer</LastName>
        <ForeName>Michael</ForeName>
        <Initials>M</Initials>
    </Editor>
</EditorList>
<Page>19-28</Page>
<Year>2008</Year>
<AuthorList>
                <Author ValidYN="Y">
                    <LastName>Sullivan</LastName>
                    <ForeName>Stephen R</ForeName>
                    <Initials>SR</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Fletcher</LastName>
                    <ForeName>Derek R D</ForeName>
                    <Initials>DR</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Isom</LastName>
                    <ForeName>Casey D</ForeName>
                    <Initials>CD</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Isik</LastName>
                    <ForeName>F Frank</ForeName>
                    <Initials>FF</Initials>
                </Author>
</AuthorList>
//

PaperTitle、Abstract、Page,需要进入Papers.txt文件

PaperTitle、BookTitle、Edition、Publisher 和 Year 需要进入 Book.txt 文件

PaperTitle,所有编辑器数据LastName、ForeName、Initials都需要进入Editors.txt

PaperTitle,所有作者信息 LastName, ForeName, Initials 需要进入 Authors.txt

// 标记条目的结束。所有文件都需要制表符分隔。 虽然我不会拒绝完成的代码,但我希望至少有一些想法能让我朝着正确的方向发展,至少解析出其中一个文件(如 Book.txt)的代码我很可能会弄清楚从那里。非常感谢。”

【问题讨论】:

  • 我会先看看使用 Config::General 模块来处理解析和 Text::CSV_XS 模块来生成输出文件。
  • 听起来你需要XML::Twig。请显示该数据将产生的文件内容。

标签: perl parsing text tabs tags


【解决方案1】:

请检查这个: 使用严格; 使用警告; 使用 Cwd;

#Get Directory
my $dir = getcwd();

#Grep files from the directory
opendir(DIR, $dir) || die "Couldn't open/read the $dir: $!";
my @AllFiles = grep(/\.txt$/i, readdir(DIR));
closedir(DIR);

#Check files are available 
if(scalar(@AllFiles) ne '')
{
    #Create Text Files as per Requirement
    open(PAP, ">$dir/Papers.txt") || die "Couldn't able to create the file: $!";
    open(BOOK, ">$dir/Book.txt") || die "Couldn't able to create the file: $!";
    open(EDT, ">$dir/Editors.txt") || die "Couldn't able to create the file: $!";
    open(AUT, ">$dir/Authors.txt") || die "Couldn't able to create the file: $!";
}
else {  die "File Not found...$dir\n"; } #Die if not found files
foreach my $input (@AllFiles)
{
    print "Processing file $input\n";
    open(IN, "$dir/$input") || die "Couldn't able to open the file: $!";
    local $/; $_=<IN>; my $tmp=$_;
    close(IN);
    #Loop from <PaperTitle> to // end slash
    while($tmp=~m/(<PaperTitle>((?:(?!\/\/).)*)\/\/)/gs)
    {
        my $LoopCnt = $1;
        my ($pptle) = $LoopCnt=~m/<PaperTitle>([^<>]*)<\/PaperTitle>/g;
        my ($abstr) = $LoopCnt=~m/<Abstract>([^<>]*)<\/Abstract>/gs;
        my ($pgrng) = $LoopCnt=~m/<Page>([^<>]*)<\/Page>/g;
        my ($bktle) = $LoopCnt=~m/<BookTitle>([^<>]*)<\/BookTitle>/g;
        my ($edtns) = $LoopCnt=~m/<Edition>([^<>]*)<\/Edition>/g;
        my ($publr) = $LoopCnt=~m/<Publisher>([^<>]*)<\/Publisher>/g;
        my ($years) = $LoopCnt=~m/<Year>([^<>]*)<\/Year>/g;

        my ($EditorNames, $AuthorNames) = "";
        $LoopCnt=~s#<EditorList>((?:(?!<\/EditorList>).)*)</EditorList>#
        my $edtList = $1; my @Edlines = split/\n/, $edtList;
        my $i ='1'; \#Editor Count to check
        foreach my $EdsngLine(@Edlines)
        {
            if($EdsngLine=~m/<LastName>([^<>]*)<\/LastName>/)
            {  $EditorNames .= $i."".$1."\t"; $i++; }
            elsif($EdsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/)
            {  $EditorNames .= $1."\t"; }
            elsif($EdsngLine=~m/<Initials>([^<>]*)<\/Initials>/)
            {  $EditorNames .= $1."\t"; }
        }
        #esg;
        $LoopCnt=~s#<AuthorList>((?:(?!<\/AuthorList>).)*)</AuthorList>#
        my $autList = $1; my @Autlines = split/\n/, $autList;
        my $j ='1'; \#Author Count to check
        foreach my $AutsngLine(@Autlines)
        {
            if($AutsngLine=~m/<LastName>([^<>]*)<\/LastName>/)
            {  $AuthorNames .= $j."".$1."\t"; $j++; }
            elsif($AutsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/)
            {  $AuthorNames .= $1."\t"; }
            elsif($AutsngLine=~m/<Initials>([^<>]*)<\/Initials>/)
            {  $AuthorNames .= $1."\t"; }
        }
        #esg;

        #Print the output in the crossponding text files
        print PAP "$pptle\t$abstr\t$pgrng\t//\n";
        print BOOK "$pptle\t$bktle\t$edtns\t$publr\t$years\t//\n";
        print EDT "$pptle\t$EditorNames//\n";
        print AUT "$pptle\t$AuthorNames//\n";
    }
}

print "Process Completed...\n";

#Don't forget to close the files
close(PAP);
close(BOOK);
close(EDT);
close(AUT);
#End

【讨论】:

  • 使用正则表达式解析 XML 没有任何借口。
  • @Borodin:我会对使用 XML 模块感兴趣。您能否完成代码,然后我将在我的程序中进一步移动。提前致谢。
  • 感谢@Borodin 和 ssr1012 在这里的帮助。我应该指定另一件事。我将不得不在许多文件上运行这个脚本(例如:BC_Book、EC_Book、CC_Book 等)。总共15个文件。我想在每次编译脚本时连接数据,或添加到文件中,但这里每次都会创建新文件。我应该能够自己跟踪代码,但是我很懒惰/被这个项目的其他方面所困扰。在这里提供额外的帮助将不胜感激!
【解决方案2】:

这个例子应该对你有所帮助。它使用XML::Twig,因为我建议提取Papers.txt 输出文件的字段。记录分隔符设置为"//\n",以便一次性读取整个数据块,并在解析块之前将其包装在&lt;Paper&gt;...&lt;/Paper&gt;标签中以使其成为有效的XML。

use strict;
use warnings;
use 5.010;
use autodie;

use XML::Twig;

my $twig = XML::Twig->new;

open my $fh, '<', 'papers.txt';
local $/ = "//\n";

while (<$fh>) {
  $twig->parse("<Paper>\n$_\n</Paper>\n");
  my $root = $twig->root;
  say $root->field($_) for qw/ PaperTitle Abstract Page/;
  say '---';
}

输出

True incidence of all complications following immediate and delayed breast reconstruction.
BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p < 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome.
19-28
---

【讨论】:

  • 感谢@Borodin 在这里提供的帮助。这距离我可以使用代码来实现我自己的完整程序还很远。我仍然理解您在这里所做的一切,感谢您的帮助。
猜你喜欢
  • 2012-10-09
  • 1970-01-01
  • 1970-01-01
  • 2015-10-14
  • 1970-01-01
  • 2013-03-06
  • 2012-10-25
  • 1970-01-01
  • 2021-02-28
相关资源
最近更新 更多