您(最初)要求提供基于awk 的解决方案。正如其他人在 cmets 中提到的那样,有更好的工具来完成这项工作。也就是说,基于4.9 Multiple-Line Records 和4.7 Defining Fields by Content,您可以尝试以下操作:
$ awk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
[...]
$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
-
RS = ";\n" 在这里假设您的输入文件有多个 ID;Name;value1;value2;DESCRIPTION;valueX;valueY; 记录,并且这些记录用 ; 分隔(这是您的示例中 valueY 之后的 ;),后跟 newline。
-
FPAT = "([^;]+)|(\"<p.+p>\")" 是告诉(g)awk 记录字段的样子的“尽力而为”的方法。您可能需要根据需要对其进行修改。实际上说的是有两种字段格式(参见(...)|(...))。第一种字段格式捕获不包含; 的字符串,用于捕获除DESCRIPTION 之外的所有字段。第二种字段格式捕获以"< 开头并以>" 结尾的字符串。
针对一个有2个ID;Name;value1;value2;DESCRIPTION;valueX;valueY;的文件:
$ cat testfile
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
NF = 7
$1 = ID
$2 = Name
$3 = value1
$4 = value2
$5 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
$6 = valueX
$7 = valueY
NF = 7
$1 = ID
$2 = Name
$3 = value1
$4 = value2
$5 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
$6 = valueX
$7 = valueY