【发布时间】:2017-03-12 17:29:33
【问题描述】:
我正在尝试解析一个文本文件以在 python 中对其进行一些统计。为此,我想用标记替换一些标点符号。这种标记的一个示例是终止句子的所有标点符号(.!? 变为 <EndS>)。我设法使用正则表达式做到了这一点。现在我正在尝试解析引号。因此,我认为,我需要一种区分开头引号和结尾引号的方法。我正在逐行读取输入文件,但我无法保证引号会被平衡。
例如:
"Death to the traitors!" cried the exasperated burghers.
"Go along with you," growled the officer, "you always cry the same thing over again. It is very tiresome."
应该变成这样:
[Open] Death to the traitors! [Close] cried the exasperated burghers.
[Open] Go along with you, [Close] growled the officer, [Open] you always cry the same thing over again. It is very tiresome. [Close]
是否可以使用正则表达式来做到这一点?有没有更简单/更好的方法来做到这一点?
【问题讨论】:
标签: python regex parsing nlp quotes