sed/regex 模式来搜索和替换文件名中的数字答案

【问题标题】：sed/regex pattern to search and replace numbers in a filenamesed/regex 模式来搜索和替换文件名中的数字
【发布时间】：2016-09-06 06:13:18
【问题描述】：

我有一组 3 个文件，用日期编码：

abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv

最后三个数字代表日期：

2815
11816
112116

我需要使用单个正则表达式过滤器仅提取与文件名中的日期相对应的数字，该过滤器还将结果转换为 MMDDYY 格式：

020815
110816
112116

感谢您的帮助！

【问题讨论】：

标签： regex string awk sed grep

【解决方案1】：

awk -F'[_.]' '{printf "%02d%02d%02d\n",$(NF-3),$(NF-2),$(NF-1)}'

【讨论】：

只是想知道您的解决方案如何正确处理年份字段。我正在尝试类似的行，但它也将 0 附加到 year 中。结果，year=15 变成了 015。简而言之，额外的零被附加到 year。
非常好，我喜欢字段数的负偏移量。干净地解决了可能在日期之前有下划线和数字的文件名。@RandomUser，它按原样对我有用，你有数据示例吗？

【解决方案2】：

正如其他人所指出的，sed 并不是这项工作最优雅的工具。使用 perl，

fn='abc1_bbb_yyy_2_8_15.csv abd1_bba_yzy_11_8_16.csv aby1_qba_yay_11_21_16.csv'
for x in $fn; do
  echo $x | perl -n -e 'printf("%02d%02d%02d\n",/(\d+)_(\d+)_(\d+)\./)'
done

如果你真的被限制使用sed，那么这里有一个方法。第一个正则表达式将零添加到前面有下划线的数字。第二个查找后跟下划线或点的数字字符串，并删除每次出现的最后 2 个数字以外的所有数字。最后一个提取最后一个 6 位数字的字符串，前面是任何内容，但后面是非数字。

for x in $fn; do
  echo $x | sed -e 's/_\([0-9]\)/_0\1/g' \
    -e 's/[0-9]*\([0-9]\{2\}\)[_.]/\1/g' \
    -e 's/.*\([0-9]\{6\}\)[^0-9]*$/\1/'
done

结果：

$ for x in $fn; do
>       echo $x | sed -e 's/_\([0-9]\)/_0\1/g' \
>         -e 's/[0-9]*\([0-9]\{2\}\)[_.]/\1/g' \
>         -e 's/.*\([0-9]\{6\}\)[^0-9]*$/\1/'
>     done
020815
110816
112116

【讨论】：

【解决方案3】：

这似乎是一个可以尝试使用 sed 解决的有趣问题。

我更喜欢 TessellatingHeckler 的 perl 方法。 :-)

edit：睡过之后，我更喜欢 jthill 的 awk 方法。
尝试使用 sed 解决问题在技术上很有趣，但我不想长期使用它。

foo.dat

示例数据文件...

$ cat foo.dat
abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv
$

示例结果

请注意 sed -r 启用常规 epxression 扩展。

$ sed -rf foo.sed < foo.dat
020815
110816
112116
$

foo.sed

通常我不会这么冗长。 :-)

但我认为 cmets 会使目的更明确。

# Put a wedge between "prefix" and "date.CSV" part.
# We don't salvage the .csv extension, that drops off here.
# Note the space padding before/after \1, we'll use that shortly.
s/([0-9_]+)\.csv/ \1 /g
#    in:  "abc1_bbb_yyy_2_8_15.csv"
#    out: "abc1_bbb_yyy _2_8_15 "
# (If I knew how to do non-greedy matching in sed we could
# strip the prefix e.g. "abc1_bb_yyy" part here as well,
# but if we try that we end up with just "_15 ", e.g. our
# other month & day get eaten).
# Hence sacrificial space character that our
# next substitution will use to cut the prefix.

# Cut the prefix.
# strip up to, but not including, the first non-space char.
# (I don't think you can do non-greedy matching in sed).

s/^.* ([^ ])/\1/
#    in:  "abc1_bbb_yyy _2_8_15 "
#    out:              "_2_8_15 "

# change our underscores to two space chars.
# (turns out we need two intermediate spaces for
# the next substitution to work as a single "global" substitution)
s/_/  /g
#    in:   "_2_8_15 "
#    out:  "  2  8  15 "
# At this point all of our month/day/year parts 
# have *two* spaces between them.

# Next we do zero-padding if necessary.
s/ ([0-9]) / 0\1 /g
# Important: we're looking for a single space before
# and after any single digit.
#    in:  "  2  8  15 "
#   out:  " 02 08  15 "
# input broken out by single chars with "spc"= 1 space char.
#         +---+---+---+---+---+---+---+---+---+---+---+
# input:  |spc|spc| 2 |spc|spc| 8 |spc|spc| 1 | 5 |spc|
#         +---+---+---+---+---+---+---+---+---+---+---+
#              \         / \         /     no match, not
#               \       /   \       /      a single digit.
#                \     /     \     /
#                match 1     match 2
#               /       \   /       \
#              /         \ /         \
#         +---+---+---+---+---+---+---+---+---+---+---+
# result  |spc| 0 | 2 |spc| 0 | 8 |spc|spc| 1 | 5 |spc|
#         +---+---+---+---+---+---+---+---+---+---+---+
# Without "two spaces" between digits this
# would require 3 separate substitutions...
# doing a single global e.g. s/ ([0-9]) / 0\1 /

# Pretty much done, just strip the spaces.
s/ //g
#   in:   " 02 08  15 "
#   out:  "020815"

【讨论】：

【解决方案4】：

试试这个：

REST=cat # 不管你的管道的其余部分是什么......

( cat <<EOF
abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv
EOF
)\
| cut -d_ -f4-6 \
| cut -d. -f1 \
| sed -e 's/\([0-9][0-9]*\)/0\1/g' \
    -e 's/0\([0-9][0-9]\)/\1/g' \
    -e 's/_//g' \
| $REST

【讨论】：

我很想对复杂性和丑陋投反对票，但我会留下这个评论。如果你使用sed 和一堆常规替换，你几乎不需要浪费两个cut 进程。与以往一样，基于正则表达式的方法不是特别透明或易读，这会导致后期维护成本很高。

【解决方案5】：

将文件名放入 t.txt

abc1_bbb_yyy_2_8_15.csv
abd1_bba_yzy_11_8_16.csv
aby1_qba_yay_11_21_16.csv

然后

$ cat t.txt | perl -p -e 's/(?<=_)(\d)(?=_)/0\1/g' | perl -p -e 's/.*(\d\d)_(\d\d)_(\d\d)\.csv/\1\2\3/'
020815
110816
112116

这不完全是 sed/awk/grep，因为 sed 不能进行环视，我现在不想 AWK，但它是正则表达式和 *nixy。

[编辑：好吧，不喜欢 Perl 的投票者，我的方法是先用 0 作为单个数字的前缀，然后提取两位数对。 sed 在没有环视或非捕获组的情况下很难做到这一点，但这是一个 sed 答案，它使用了@jgreve 的想法，即首先插入一个楔子。这还包括 YYYYMMDD 格式的输出，假设所有年份都是 20：

#                  #wedge        #single n to 0n            #extract __dd__mm__yy                                   to 20yymmdd
cat t.txt | sed -e 's/_/__/g' -e 's/_\([0-9]\)_/_0\1_/g' -e 's/.*__\([0-9][0-9]\)__\([0-9][0-9]\)__\([0-9][0-9]\)\.csv/20\3\2\1/'

]

【讨论】：

现在如何使用输出作为日期命令的输入将结果转换为 YYYYMMDD 格式？
我会将第二个 perl 正则表达式替换部分从 \1\2\3 更改为 \3\2\1 以交换顺序，它会将它们放在 YYMMDD 顺序中。为了使它成为 YYYY，我很想跳过 date 命令，只需将 20 放在前面，就像 20\3\2\1 一样，它在 2000 年到 2099 年之间的所有年份都会有好处。