计算上下文中靠近其他单词的单词的存在答案

【问题标题】：Counting presence of words within context near other words计算上下文中靠近其他单词的单词的存在
【发布时间】：2022-01-14 22:59:36
【问题描述】：

我正在尝试评估两个命名的单词组是否出现在彼此相距 25 个单词的范围内。我有两个问题，它们可能非常相关：

我正在使用这种方法来评估某些单词是否彼此靠近 http://www.regular-expressions.info/near.html 。原来的计数器似乎可以工作，但我想把我的代码分成两部分来仔细检查。但是，当我这样做时，我的“counter3”会产生重复计算问题（即，在应该只计算销售时计算购买的单词）。除了使用 Python 而不是 perl 之外，这与 Counting presence of words within context (near other words) 几乎完全相同。

text = "CompanyA sells Androids and Robots. Androids are then purchased and resold by Company"

counter1 = Counter(re.findall(r'\b((?:Androids\W+(?:\w+\W+){0,25}?\W+sells|purchased|resold)|sells|purchased|resold\W+(?:\w+\W+){0,25}?Androids)\b',text, re.DOTALL))

#to ensure my code is working correctly, I then want to split counter1 into two parts. However, counter 3 is giving me a double counting issues: 
counter2 = Counter(re.findall(r'\b((?:Androids\W+(?:\w+\W+){0,25}?/W+sells|purchased|resold))\b',text, re.DOTALL))
counter3 = Counter(re.findall(r'\b((?:sells|purchased|resold\W+(?:\w+\W+){0,25}?/W+Androids))\b',text, re.DOTALL))

#Result: counter1= Counter({'sells': 1, 'purchased': 1, 'resold': 1})
#Result: counter2 = Counter({'purchased': 1, 'resold': 1})
#Result: counter3= Counter({'sells': 1, 'purchased': 1})


#I have also tried the below variation, which corrects counter3, but then causes an issue with counter2
counter2 = Counter(re.findall(r'\b((Androids)\W+(?:\w+\W+){0,25}?(sells|purchased|resold))\b',text, re.DOTALL))
counter3 = Counter(re.findall(r'\b((sells|purchased|resold)\W+(?:\w+\W+){0,25}?(Androids))\b',text, re.DOTALL))

#result counter2 = Counter({('Androids and Robots. Androids are then purchased',           'Androids','purchased'): 1})
#result counter3 = Counter({('sells Androids', 'sells', 'Androids'): 1})

接下来我想为单词组创建变量，然后在我的正则表达式中引用它们。我正在关注此参考How to use a variable inside a regular expression?。但是，我仍然有问题（也许一旦问题 1 得到回答，它就会引导我找到问题 2 的答案）

Group1 ='Androids'
Group2 = 'sells |purchased |resold '

counter2 = Counter(re.findall(rf'\b(?:{Group1}\W+(?:\w+\W+){0,25}?{Group2})\b',text, re.DOTALL))
counter3 = Counter(re.findall(rf'\b(?:{Group2}\W+(?:\w+\W+){0,25}?{Group1})\b',text, re.DOTALL))


#Result - counter2 = Counter({'': 2})
#Result - counter3 = Counter({'': 2})

#interestingly, if I try an alternative variation (i.e., removing ?:), which fixed counter3 in my first question, it does not fix the issue when I try to reference the variables 

counter2 = Counter(re.findall(rf'\b({Group1}\W+(?:\w+\W+){0,25}?{Group2})\b',text, re.DOTALL))
counter3 = Counter(re.findall(rf'\b({Group2}\W+(?:\w+\W+){0,25}?{Group1})\b',text, re.DOTALL))

#Result - counter2 = Counter({('purchased ', ''): 1, ('resold ', ''): 1})
#Result counter3 = Counter({('sells ', ''): 1, ('purchased ', ''): 1})

任何帮助都会很棒，因为我觉得我有点疯狂地尝试不同的变体来使这段代码正常工作！谢谢！

【问题讨论】：

text 中同时拥有“Andriods”和“Androids”。这是故意的吗？您拥有所有这些代码，但您从未真正用英语说明您实际上要计算的是什么（“计算购买的单词”有点模糊）以及什么你期望输出是。如果您尝试在text 字符串中以任一顺序匹配由（'sells'、'purchased'、'resold'）之一分隔的'Andriods'（注意拼写），那么只有一个匹配项，即'销售 Andriods' 那么为什么在第 2 部分中有Group1 = 'Androids'（注意拼写）？为什么你会期望看到“购买”这个词？
如果您确实在寻找('sells', 'purchased', 'resold) 之一，您看到“已购买”只是因为您的正则表达式不正确。而不是 sells|purchased|resold，你应该有 (?:sells|purchased|resold)
感谢您对此进行调查！当我将正则表达式更改为包含 (?:sells|purchased|resold) 时，计数器最终为空；（结果 = Counter() counter3 = Counter(re.findall(r'\b((?:sells|purchased|resold)\W+(?:\w+\W+){0,25}(?:Androids))\ b',text, re.DOTALL))
另外，感谢您发现我的拼写错误。我已经更新了代码以在整个过程中使用“Androids”......结果没有改变，这告诉我我遇到的问题比我意识到的要多。对于柜台 2，我希望 Counter = "purchased", "resold"，因为 Androids 这个词出现在购买和转售之前。然后对于柜台 3，我期待 Counter = "sells"，因为单词 sells 出现在单词 Androids 之前。
我认为您应该更新您的问题。如果您正在寻找“销售”、“购买”或“转售”，请用英语这样说，不要让我们从错误的正则表达式中猜测这一点。如果没有，请仍然说出您要匹配的内容。但我确实相信您的正则表达式不符合您引用的链接的模式。

标签： python regex count counter findall

【解决方案1】：

如果您在 25 个单词内寻找由“sells”或“purchased”或“resold”之一分隔的“Androids”，那么以下内容将找到所有匹配项，并为您计算所有匹配项跨越比赛。如果你想要不同的东西，那么你应该用简单的英语说出你想要的（这严格基于 OP 提供的简单但逻辑替换的链接）：

import re
from collections import Counter

text = "CompanyA sells Androids and Robots. Androids are then purchased and resold by Company"
regex = r'\b(?:Androids\W+(?:\w+\W+){0,25}?(?:sells|purchased|resold)|(?:sells|purchased|resold)\W+(?:\w+\W+){0,25}?Androids)\b'
matches = re.findall(regex, text)
print(matches)
c = Counter()
for match in matches:
    c.update(match.split())
print(c)

打印：

['sells Androids', 'Androids are then purchased']
Counter({'Androids': 2, 'sells': 1, 'are': 1, 'then': 1, 'purchased': 1})

当您将您正在寻找的内容插入链接提供的模式时，该模式专为单个单词匹配而设计，因为您有“或”的情况（这是一个选择满足匹配的单词），由于优先级，您必须在由|分隔的单词组周围使用括号。并且为了不引入额外的捕获组，它必须是非捕获括号，即(?: ... )。

现在，如果您想以不同的方式计算事物，请以此为起点。但是请注意，当您开始添加捕获组时会发生什么，以了解 if 如何影响 findall 方法。

【讨论】：