【问题标题】:Remove <s> and </s> from all lines in file从文件中的所有行中删除 <s> 和 </s>
【发布时间】:2017-08-10 21:06:13
【问题描述】:

我正在处理一个相当大的文件,我将使用它来创建 word2vec 嵌入。该文件每行包含一个句子,所有行都以 开始标记和 结束标记开始。现在我想做的是使用 sed 删除开始和结束标签,但我不知道该怎么做。

我试过了

sed myfile 's/<s> //g' > resultfile
sed resultfile 's/ </s>//g' > finalfile

但这会产生“命令后的额外字符”错误。

如果有人能给我正确的模式,我会非常高兴。提前致谢!

【问题讨论】:

    标签: regex sed


    【解决方案1】:

    试试这个:

    sed 's#</\?s>##g' file
    
    • 这将一次性删除 &lt;s&gt;&lt;/s&gt;
    • # 是 sed 的 s 命令的分隔符,因为您的模式已经有斜线。
    • &lt;/\?s&gt; 是正则表达式,它匹配&lt;s&gt; and &lt;/s&gt;

    【讨论】:

      【解决方案2】:

      您的参数顺序错误。

      尝试使用:

      sed -e 's/<[^>]*>//g' myfile.txt
      

      删除任何 html标签

      参考:Sed remove tags from html file

      【讨论】:

      • 这有不同的作用!
      • 这就是我提到的原因......它会删除任何 html 标签
      • 但是你在回答 OP 的问题吗?
      • 你是对的,这不是确切的答案,只是一个快速帮助将@Henrik 与问题的解决方案对齐
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-04-03
      • 1970-01-01
      • 2022-11-15
      • 1970-01-01
      • 1970-01-01
      • 2011-03-20
      相关资源
      最近更新 更多