【问题标题】:split text by periods except in certain cases [duplicate]除某些情况外,按句点拆分文本[重复]
【发布时间】:2021-09-21 05:34:33
【问题描述】:

我目前正在尝试按句子拆分包含整个文本文档的字符串,以便将其转换为 csv。当然,我会使用句点作为分隔符并执行str.split('.'),但是,该文档包含缩写“即”和“例如”在这种情况下,我想忽略句点。

例如,

原句:During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors.

结果列表:["During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing", "ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."]

到目前为止,我唯一的解决方法是替换所有 'i.e' 和 'e.g.' 'ie' 和 'eg' 既低效又不合语法。我正在摆弄 Python 的正则表达式库,我怀疑它可以提供我想要的答案,但我对它的了解充其量只是新手。

这是我第一次在这里发布问题,如果我使用了不正确的格式或措辞,我深表歉意。

【问题讨论】:

标签: python regex split


【解决方案1】:

这个应该可以的!

import re

p = "During this time, it became apparentt hat vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."

list = []
while(len(p) > 0):
 string = ""
 while(True):
  match = re.search("[A-Z]+[^A-Z]+",p)
  if(match == None):
      break
  p = p[len(match.group(0)):]
  string += match.group(0)
  if(match.group(0).endswith(". ") ):
      break
 list.append(string)



print(list)

【讨论】:

  • 刚刚编辑了原始问题以包含一个示例
  • 使用 string.split(". ") 仍然没有给出预期的结果,因为它在 'i.e.目标...”
  • 是的,这也是部分正确的仍在寻找解决方案。
  • 是的,我认为最后是正确的。这是一个很难的问题
【解决方案2】:

请参阅How can I split a text into sentences?,它建议使用natural language toolkit

通过一个例子更深入地解释为什么这样做:

我叫 I. Brown。我敢打赌我会让一个句子难以解析。没有人比我更适合这项任务。

你如何把它分成不同的句子?

您需要正则表达式无法捕获的语义(正式句子通常由主语、宾语和动词组成)。正则表达式 syntax 做得很好,但不是 semantics(意思)。

为了证明这一点,其他人建议的答案涉及大量复杂的正则表达式并且相当慢,有 115 票,会与我的简单句子不同。

这是一个 NLP 问题,所以我链接到一个给出 NLP 包的答案。

【讨论】:

    【解决方案3】:

    这是一个粗略的实现。

    inp = input()
    res = []
    last = 0
    for x in range(len(inp)):
        if (x>1):
            if (inp[x] == "." and inp[x-2] != "."):
                if (x < len(inp)-2):
                    if (inp[x+2] != "."):
                        res.append(inp[last:x])
                        last = x+2
    res.append(inp[last:-1])
    print(res)
    

    如果我使用你的输入,我会得到这个输出(希望这就是你要找的):

    ['During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing', 'ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors']
    

    注意:如果您使用的文本不符合语法规则(字母之间或开始新句子后没有空格...),您可能需要调整此代码

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-11-03
      • 2019-04-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-03-27
      相关资源
      最近更新 更多