正则表达式分隔由句号链接的两个单词答案

【问题标题】：Regex to seperate two word linked by a full stop正则表达式分隔由句号链接的两个单词
【发布时间】：2020-07-08 19:34:04
【问题描述】：

我正在处理一个文件，我发现用句号链接的单词，我认为这是一个错误，我想更正它，所以我正在寻找正则表达式来做。

['<repdns text="boys.aussi" />']
['<repdns text="interpretation.une" />']
['<repdns text="catastrophe.michelle" />']
['<repdns text="paquerettes.ewan" />']
['<repdns text="amour.hugh" />']

我实际上读取了一个文件并使用 treetagger 来获取引理，但是出现了像上面这样的错误，所以我需要在使用 treetagger 函数之前更正它们。我被困在使用哪个正则表达式上，因为我不希望带有“.com”或“.org”的单词分开

a = [' boys.aussi', 'interpretation.une', 'amour.hugh', 'amy.com', 'frenchemabassy.net']

alphabet = "([a-z][...])"
alphabets = "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)[.]"
starters = "(M|Mr|Mme|Sr|Dr)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits = "([0-9])"

# sÃ©pare les phrases

def normalize(text):  # do_lower=False):
    text = re.sub(alphabets + "[.]" + alphabets,)
    
    
    return text

normalize(a)

期待

a = [' boys. aussi', 'interpretation. une', 'amour. hugh', 'amy.com', 'frenchemabassy.net']

【问题讨论】：

"因为我不希望单词带有 '.com' 或 '.org'" 请注意 '.com'、'.org' 和 '。 net' 不是唯一的 TLD（顶级域）。你将如何确保点不属于一个？
这可能有帮助吗？(\w+\.)(?!org|net|com)(\w+)。相应地添加其他 TLD。如果有帮助，请告诉我。
还有一个我也想知道的案例。 amy.co.in 会发生什么？字符串是否总是只包含一个 .（句点字符）？
你希望这行代码text = re.sub(alphabets + "[.]" + alphabets,)做什么？
您似乎想同时介绍语言部分和网络约定部分。用正则表达式解析语言真的是不可能的。如果您将变量命名为语言词性并不重要，它只是不能那样工作。

标签： python python-3.x regex list

【解决方案1】：

在你的正则表达式中使用一个否定的前瞻断言，这样'.'替换为'。 ' 仅当它后面没有任何特殊的互联网顶级域名时：

import re

def normalize(text):
    return re.sub(r'\.(?!(com|net|org|io|gov))', '. ', text)

a = [' boys.aussi', 'interpretation.une', 'amour.hugh', 'amy.com', 'frenchemabassy.net']
a = [normalize(s) for s in a]
print(a)

打印：

[' boys. aussi', 'interpretation. une', 'amour. hugh', 'amy.com', 'frenchemabassy.net']

请注意，我只是使用您在 websites 变量中拥有的 TLD 列表；还有很多人想添加。

【讨论】：