【问题标题】:Reverse newline tokenization in one-token per line files? - Unix在每行文件一个标记中反向换行标记化? - Unix
【发布时间】:2014-03-13 19:24:55
【问题描述】:

How to separate tokens in line using Unix? 表明可以使用sedxargs 标记文件。

有没有办法反其道而行之?

[在:]

some
sentences
are
like
this.

some
sentences
foo
bar
that

[出]:

some sentences are like this.
some sentences foo bar that

每个句子的唯一分隔符是\n\n。我本可以在 python 中完成以下操作,但是 有 unix 方式吗?

def per_section(it):
  """ Read a file and yield sections using empty line as delimiter """
  section = []
  for line in it:
    if line.strip('\n'):
      section.append(line)
    else:
      yield ''.join(section)
      section = []
  # yield any remaining lines as a section too
  if section:
    yield ''.join(section)

print ["".join(i).replace("\n"," ") for i in per_section(codecs.open('outfile.txt','r','utf8'))]

[输出:]

[u'some sentences are like this. ', u'some sentences foo bar that ']

【问题讨论】:

  • 总是5个字吗?用点.检查新行何时更改的模式是什么?
  • 不,不总是5个字,5个字是巧合。

标签: python unix sed awk xargs


【解决方案1】:

您可以使用awk 命令如下:

awk -v RS="\n\n" '{gsub("\n"," ",$0);print $0}' file.txt 

将记录分隔符设置为\n\n,这意味着字符串被标记在一组由空行分隔的行中。现在,在将所有 \n 替换为空格字符后打印该令牌。

【讨论】:

    【解决方案2】:

    使用 awk 更容易处理这种任务:

    awk -v RS="" '{$1=$1}7' file
    

    如果你想在每一行中保留多个空格,你可以

    awk -v RS="" -F'\n' '{$1=$1}7' file
    

    用你的例子:

    kent$  cat f
    some
    sentences
    are
    like
    this.
    
    some
    sentences
    foo
    bar
    that
    
    kent$  awk -v RS=""  '{$1=$1}7' f   
    some sentences are like this.
    some sentences foo bar that
    

    【讨论】:

      【解决方案3】:
      sed -n --posix 'H;$ {x;s/\n\([^[:cntrl:]]\{1,\}\)/\1 /gp;}' YourFile
      

      基于空行分隔,每个字符串的长度也可能不同

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多