【问题标题】:Using AWK to mail merge two files producing reformated third file使用 AWK 邮件合并两个文件,生成重新格式化的第三个文件
【发布时间】:2020-08-01 21:02:42
【问题描述】:

这不是一个问题。这也是我的第一篇文章。我不是新手,但我也只是 awk 的初学者。

最近我需要从最初未存储为 xml 的两组数据中生成一些 .xml 配置文件。

我搜索了很多关于 AWK 的帮助,但我意识到 99% 的提供的脚本都使用了高级 AWK 技术,这让初学者很难理解。我相信这会降低兴趣并提高学习曲线。

EG。 awk '{/ERROR/}' /log/messages

对于一个不怎么做 awk 脚本的人来说不太容易知道那里发生了什么,但它确实可以做很多事情。

所以在这里我将提供一个新手来完成这样的任务。作为回报

我想提出建议

  1. 更优化的新手版本。
  2. 经过优化的高级版本,带有适当的解释,有助于过渡。

$./test1.awk Samfig2.cfg user1.tsv $ls 配置文件* cfg2ZR6ZS29XXOF.xml cfg42IXEIGOQ0FG.xml cfg759YUZKTS368.xml cfgNTQALYCPLE06.xml cfgYDMWJVLO6YWS.xml

test1.awk

#!/usr/bin/awk -f
BEGIN { 
        configfile=ARGV[1]
        Userfile=ARGV[2]
      
        if (ARGV[2] == "") {
                        print "ERROR: Need two files Usage "ENVIRON["_"]" Config.cfg Users.tsv" >"/dev/stderr"
                        exit }
       ARGV[1] = ""   # We want to control the manipulation of files
       ARGV[2] = ""
        FS = "=" ;  # this is being done dynamically, no need here (oh yes setting here cause almost 90% execution reduction)
        getline Header < Userfile;  # advance the Header line and get the headers
         gsub("\r","",Header);  # My production version doesnt need this but the sample data seem to include \r on the end field
        HeaderN=split(Header,Headarray,"\t");

# Expand begin block to include {} below to prevent pause for input       }
#{   

while ((getline User < Userfile) >0 )  # Read row from field into variable User do all the blocks below based on the number of records in Userfile.
   { 
    gsub("\r","",User); # My production version doesnt need this but the sample data seem to include \r on the end field
    n=split(User,Detailsarray,"\t");           # split row stored in User into array called Detailsarray n stores the total number of elements with FS =\t
    filetostore=("cfg" Detailsarray[HeaderN] ".xml"); # Were are storing each file based on Last Header value in the user file
    Recordtmp=""                               #To reduce file IO will append to string then output later. 
    Recordtmp ="<?xml version=\42""1.0\42 encoding=\42utf-8\42?>";   #\42 is the double quote ". Result is  <?xml version="1.0" encoding="utf-8"?>  
                                                                #without the "" set you would get <?xml version=.0" encoding="utf-8"?> as it would interpret as \421 
    Recordtmp = Recordtmp "\n<users_provision version=\42""1\42>";
    Recordtmp = Recordtmp "\n<config version=\42""1\42>"; 
    
    for(i=1; i<=HeaderN; i++)  # We could also use n instead of HeaderN but just incase I'm maintaining base on the initial header
        Recordtmp = Recordtmp  "\n    <" Headarray[i] ">" Detailsarray[i] "</" Headarray[i] ">"; 
                      
    while ((getline < configfile) >0 )
        {    
            Recordtmp = Recordtmp  "\n    <" $1 ">" $2 "</" $1 ">";
         }
    Recordtmp = Recordtmp  "\n</config>"; 
    Recordtmp = Recordtmp  "\n</users_provision>\n";
    
    close(configfile);
   
    print (Recordtmp)> filetostore;
    close(filetostore);     

   }
#}

# END {  # Had to expand begin block to avoid pause issue
      close(Userfile);
     }

Samfig.cfg

URL=msn.com
Dealer=RealtorSales
SQRFT=3600
Taxes=6000
Asking=1,800,000
Built=July/2019
Listed=07/12/2109
MSRP=2,000,000
Kitchen=5
Baths=2.5
floors=3
Rooms=5

user1.tsv

Name    StreeNum    StreetName  City    State   ZIP IDcard
Ashanti Simmons 138 Jockey Hollow Avenue    Phillipsburg    NJ  08865   2ZR6ZS29XXOF
Bobby Marshall  7985 E.     Beech Road  Flemington  NJ  08822   YDMWJVLO6YWS
Marianna Quinn  8950    Main St.    Moses Lake  WA  98837   42IXEIGOQ0FG
Jaslyn Fuentes  9581    Lafayette Dr.   Hummelstown PA  17036   NTQALYCPLE06
Cory Jordan 26  Randall Mill Street Bay City    MI  48706   759YUZKTS368

cfg2ZR6ZS29XXOF.xml的内容

<?xml version="1.0" encoding="utf-8"?>
<users_provision version="1">
<config version="1">
    <Name>Ashanti Simmons</Name>
    <StreeNum>138</StreeNum>
    <StreetName>Jockey Hollow Avenue</StreetName>
    <City>Phillipsburg</City>
    <State>NJ</State>
    <ZIP>08865</ZIP>
    <IDcard>2ZR6ZS29XXOF</IDcard>
    <URL>msn.com</URL>
    <Dealer>RealtorSales</Dealer>
    <SQRFT>3600</SQRFT>
    <Taxes>6000</Taxes>
    <Asking>1,800,000</Asking>
    <Built>July/2019</Built>
    <Listed>07/12/2109</Listed>
    <MSRP>2,000,000</MSRP>
    <Kitchen>5</Kitchen>
    <Baths>2.5</Baths>
    <floors>3</floors>
    <Rooms>5</Rooms>
</config>
</users_provision>

这些可以做的改进。

  1. 从命令行将 FS/split 值读入变量。
  2. 如果配置文件中存在的默认值在数据文件中不为空,则仅替换它。

【问题讨论】:

  • (嗯,我无法获取要清理的初始“调用脚本”代码的格式,即$./test1.awk 等)否则您似乎对@987654327 @工作。我不明白你的评论# Expand begin block to include {} below to prevent pause for input。有时最好的解决方案是在BEGIN{}END{} 内进行所有处理。如果你输出的 XML 通过xmllint 验证,就宣告胜利!从awk 获得更多高级特性 XML 是一个真正的“学习机会”,请参阅xmlawk(将尝试找到链接)。祝你好运。
  • sourceforge.net/projects/gawkextlib 可能会有所帮助(代价是学习曲线更陡峭;-))。祝你好运。
  • 因为我取消了 ARGV 值,所以当脚本写入中间块时,awk 暂停了脚本等待用户输入。因此是评论。

标签: awk merge


【解决方案1】:

这样的?

$ awk 'function bt(t)    {return "<"t">"}
       function et(t)    {return bt("/"t)}
       function tag(t,v) {return bt(t) v et(t)}
       function prolog() {return bt("?xml version=\"1.0\" encoding=\"utf-8\"?")}
       function start(t) {return bt(t " version=\"1\"")}

       NR==FNR {split($0,a,"="); ks[NR]=a[1]; vs[NR]=a[2]; nk=NR; next}
       FNR==1  {n=split($0,header); next}
               {file="cfg" $NF ".xml"
                print prolog() > file
                print start("users_provision") > file
                print start("config") > file
                for(i=1;i<=NF;i++) print "\t" tag(header[i],$i) > file
                for(i=1;i<=nk;i++) print "\t" tag(ks[i], vs[i]) > file
                print et("config") > file
                print et("user_provision") > file
                close(file)}' config FS='\t' user

用一些辅助函数来简化代码的主体。但是没有错误检查或验证。

生产

$ cat cfg2ZR6ZS29XXOF.xml

<?xml version="1.0" encoding="utf-8"?>
<users_provision version="1">
<config version="1">
        <Name>Ashanti Simmons</Name>
        <StreeNum>138</StreeNum>
        <StreetName>Jockey Hollow Avenue</StreetName>
        <City>Phillipsburg</City>
        <State>NJ</State>
        <ZIP>08865</ZIP>
        <IDcard>2ZR6ZS29XXOF</IDcard>
        <URL>msn.com</URL>
        <Dealer>RealtorSales</Dealer>
        <SQRFT>3600</SQRFT>
        <Taxes>6000</Taxes>
        <Asking>1,800,000</Asking>
        <Built>July/2019</Built>
        <Listed>07/12/2109</Listed>
        <MSRP>2,000,000</MSRP>
        <Kitchen>5</Kitchen>
        <Baths>2.5</Baths>
        <floors>3</floors>
        <Rooms>5</Rooms>
</config>
</user_provision>

除非您正在处理大型文件,否则我认为优化应该易于维护。尽管大多数 awk 脚本都是短暂的,但如果结构和 cmets 适当,它可以在很长一段时间内发挥作用。

【讨论】:

  • 这是一个很好的优化版本。我不知道 awk 的过程编程的可能性。我看不到 0 美元上讨厌的 \r 的替换。我看到您使用了 2 个数组(键字符串和值字符串)来允许对第一个输入文件进行后处理。为了那些后来路过的人的利益,我更喜欢一些评论。尽管如此还是喜欢它。
【解决方案2】:

您似乎错过了 awk 的要点,即它为您读取输入文件,因此您编写了一个 awk 脚本,就像您用一堆 while 编写 C 程序一样——阅读 BEGIN 部​​分中的循环以手动执行 awk 自动执行的操作。我认为这就是你想要做的:

$ cat tst.awk
BEGIN {
    FS = "\t"
    fmt = "   <%s>%s</%s>\n"
}
{ sub(/\r$/,"") }
NR == FNR {
    tag = val = $0
    sub(/=.*/,"",tag)
    sub(/[^=]+=/,"",val)
    comm = comm sprintf(fmt, tag, val, tag)
    next
}
FNR == 1 {
    for (i=1; i<=NF; i++) {
        tags[i] = $i
    }
    next
}
{
    out = "cfg" $NF ".xml"

    print "<?xml version=\"1.0\" encoding=\"utf-8\"?>"  > out
    print "<users_provision version=\"1\">"             > out
    print "<config version=\"1\">"                      > out

    for (i=1; i<=NF; i++) {
        printf fmt, tags[i], $i, tags[i]                > out
    }

    printf "%s", comm                                   > out

    print "</config>"                                   > out
    print "</users_provision>"                          > out

    close(out)
}

.

$ awk -f tst.awk Samfig.cfg user1.tsv

.

$ head -50 cfg*.xml
==> cfg2ZR6ZS29XXOF.xml <==
<?xml version="1.0" encoding="utf-8"?>
<users_provision version="1">
<config version="1">
   <Name>Ashanti Simmons</Name>
   <StreeNum>138</StreeNum>
   <StreetName>Jockey Hollow Avenue</StreetName>
   <City>Phillipsburg</City>
   <State>NJ</State>
   <ZIP>08865</ZIP>
   <IDcard>2ZR6ZS29XXOF</IDcard>
   <URL>msn.com</URL>
   <Dealer>RealtorSales</Dealer>
   <SQRFT>3600</SQRFT>
   <Taxes>6000</Taxes>
   <Asking>1,800,000</Asking>
   <Built>July/2019</Built>
   <Listed>07/12/2109</Listed>
   <MSRP>2,000,000</MSRP>
   <Kitchen>5</Kitchen>
   <Baths>2.5</Baths>
   <floors>3</floors>
   <Rooms>5</Rooms>
</config>
</users_provision>

==> cfg42IXEIGOQ0FG.xml <==
<?xml version="1.0" encoding="utf-8"?>
<users_provision version="1">
<config version="1">
   <Name>Marianna Quinn</Name>
   <StreeNum>8950</StreeNum>
   <StreetName>Main St.</StreetName>
   <City>Moses Lake</City>
   <State>WA</State>
   <ZIP>98837</ZIP>
   <IDcard>42IXEIGOQ0FG</IDcard>
   <URL>msn.com</URL>
   <Dealer>RealtorSales</Dealer>
   <SQRFT>3600</SQRFT>
   <Taxes>6000</Taxes>
   <Asking>1,800,000</Asking>
   <Built>July/2019</Built>
   <Listed>07/12/2109</Listed>
   <MSRP>2,000,000</MSRP>
   <Kitchen>5</Kitchen>
   <Baths>2.5</Baths>
   <floors>3</floors>
   <Rooms>5</Rooms>
</config>
</users_provision>

==> cfg759YUZKTS368.xml <==
<?xml version="1.0" encoding="utf-8"?>
<users_provision version="1">
<config version="1">
   <Name>Cory Jordan</Name>
   <StreeNum>26</StreeNum>
   <StreetName>Randall Mill Street</StreetName>
   <City>Bay City</City>
   <State>MI</State>
   <ZIP>48706</ZIP>
   <IDcard>759YUZKTS368</IDcard>
   <URL>msn.com</URL>
   <Dealer>RealtorSales</Dealer>
   <SQRFT>3600</SQRFT>
   <Taxes>6000</Taxes>
   <Asking>1,800,000</Asking>
   <Built>July/2019</Built>
   <Listed>07/12/2109</Listed>
   <MSRP>2,000,000</MSRP>
   <Kitchen>5</Kitchen>
   <Baths>2.5</Baths>
   <floors>3</floors>
   <Rooms>5</Rooms>
</config>
</users_provision>

==> cfgNTQALYCPLE06.xml <==
<?xml version="1.0" encoding="utf-8"?>
<users_provision version="1">
<config version="1">
   <Name>Jaslyn Fuentes</Name>
   <StreeNum>9581</StreeNum>
   <StreetName>Lafayette Dr.</StreetName>
   <City>Hummelstown</City>
   <State>PA</State>
   <ZIP>17036</ZIP>
   <IDcard>NTQALYCPLE06</IDcard>
   <URL>msn.com</URL>
   <Dealer>RealtorSales</Dealer>
   <SQRFT>3600</SQRFT>
   <Taxes>6000</Taxes>
   <Asking>1,800,000</Asking>
   <Built>July/2019</Built>
   <Listed>07/12/2109</Listed>
   <MSRP>2,000,000</MSRP>
   <Kitchen>5</Kitchen>
   <Baths>2.5</Baths>
   <floors>3</floors>
   <Rooms>5</Rooms>
</config>
</users_provision>

==> cfgYDMWJVLO6YWS.xml <==
<?xml version="1.0" encoding="utf-8"?>
<users_provision version="1">
<config version="1">
   <Name>Bobby Marshall</Name>
   <StreeNum>7985</StreeNum>
   <StreetName>E. Beech Road</StreetName>
   <City>Flemington</City>
   <State>NJ</State>
   <ZIP>08822</ZIP>
   <IDcard>YDMWJVLO6YWS</IDcard>
   <URL>msn.com</URL>
   <Dealer>RealtorSales</Dealer>
   <SQRFT>3600</SQRFT>
   <Taxes>6000</Taxes>
   <Asking>1,800,000</Asking>
   <Built>July/2019</Built>
   <Listed>07/12/2109</Listed>
   <MSRP>2,000,000</MSRP>
   <Kitchen>5</Kitchen>
   <Baths>2.5</Baths>
   <floors>3</floors>
   <Rooms>5</Rooms>
</config>
</users_provision>

读取.cfg 文件时,我按照我的方式填充/使用tagval 变量,而不是将FS 设置为=,然后使用$1$2 或类似的,这样即使任何值包含=,脚本也会成功,例如Dealer=List=&gt;Sold.

【讨论】:

  • 这也不错。我没有错过那个 awk 点,它只是妨碍了我。事实上,我实际上在输入列表上的配置文件之前传递了用户文件,并认为 awk 按顺序处理文件的方式带来了太多问题。我喜欢您先构建公共部分,然后继续粘贴到每个用户的输出文件。对 head 和 tail 标记也可以做同样的事情。
  • wrt it was just getting in my way - 那是因为您误解了如何使用 awk。是的,有多种替代方案,但我希望它尽可能简单,因为您只是在学习,并且在打印之前从头部和尾部文本构建字符串没有明显的好处。
猜你喜欢
  • 1970-01-01
  • 2012-06-16
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-08-23
  • 1970-01-01
相关资源
最近更新 更多