【发布时间】:2022-01-14 22:59:36
【问题描述】:
我正在尝试评估两个命名的单词组是否出现在彼此相距 25 个单词的范围内。我有两个问题,它们可能非常相关:
- 我正在使用这种方法来评估某些单词是否彼此靠近 http://www.regular-expressions.info/near.html 。原来的计数器似乎可以工作,但我想把我的代码分成两部分来仔细检查。但是,当我这样做时,我的“counter3”会产生重复计算问题(即,在应该只计算销售时计算购买的单词)。除了使用 Python 而不是 perl 之外,这与 Counting presence of words within context (near other words) 几乎完全相同。
text = "CompanyA sells Androids and Robots. Androids are then purchased and resold by Company"
counter1 = Counter(re.findall(r'\b((?:Androids\W+(?:\w+\W+){0,25}?\W+sells|purchased|resold)|sells|purchased|resold\W+(?:\w+\W+){0,25}?Androids)\b',text, re.DOTALL))
#to ensure my code is working correctly, I then want to split counter1 into two parts. However, counter 3 is giving me a double counting issues:
counter2 = Counter(re.findall(r'\b((?:Androids\W+(?:\w+\W+){0,25}?/W+sells|purchased|resold))\b',text, re.DOTALL))
counter3 = Counter(re.findall(r'\b((?:sells|purchased|resold\W+(?:\w+\W+){0,25}?/W+Androids))\b',text, re.DOTALL))
#Result: counter1= Counter({'sells': 1, 'purchased': 1, 'resold': 1})
#Result: counter2 = Counter({'purchased': 1, 'resold': 1})
#Result: counter3= Counter({'sells': 1, 'purchased': 1})
#I have also tried the below variation, which corrects counter3, but then causes an issue with counter2
counter2 = Counter(re.findall(r'\b((Androids)\W+(?:\w+\W+){0,25}?(sells|purchased|resold))\b',text, re.DOTALL))
counter3 = Counter(re.findall(r'\b((sells|purchased|resold)\W+(?:\w+\W+){0,25}?(Androids))\b',text, re.DOTALL))
#result counter2 = Counter({('Androids and Robots. Androids are then purchased', 'Androids','purchased'): 1})
#result counter3 = Counter({('sells Androids', 'sells', 'Androids'): 1})
- 接下来我想为单词组创建变量,然后在我的正则表达式中引用它们。我正在关注此参考How to use a variable inside a regular expression?。但是,我仍然有问题(也许一旦问题 1 得到回答,它就会引导我找到问题 2 的答案)
Group1 ='Androids'
Group2 = 'sells |purchased |resold '
counter2 = Counter(re.findall(rf'\b(?:{Group1}\W+(?:\w+\W+){0,25}?{Group2})\b',text, re.DOTALL))
counter3 = Counter(re.findall(rf'\b(?:{Group2}\W+(?:\w+\W+){0,25}?{Group1})\b',text, re.DOTALL))
#Result - counter2 = Counter({'': 2})
#Result - counter3 = Counter({'': 2})
#interestingly, if I try an alternative variation (i.e., removing ?:), which fixed counter3 in my first question, it does not fix the issue when I try to reference the variables
counter2 = Counter(re.findall(rf'\b({Group1}\W+(?:\w+\W+){0,25}?{Group2})\b',text, re.DOTALL))
counter3 = Counter(re.findall(rf'\b({Group2}\W+(?:\w+\W+){0,25}?{Group1})\b',text, re.DOTALL))
#Result - counter2 = Counter({('purchased ', ''): 1, ('resold ', ''): 1})
#Result counter3 = Counter({('sells ', ''): 1, ('purchased ', ''): 1})
任何帮助都会很棒,因为我觉得我有点疯狂地尝试不同的变体来使这段代码正常工作!谢谢!
【问题讨论】:
-
text中同时拥有“Andriods”和“Androids”。这是故意的吗?您拥有所有这些代码,但您从未真正用英语说明您实际上要计算的是什么(“计算购买的单词”有点模糊)以及什么你期望输出是。如果您尝试在text字符串中以任一顺序匹配由('sells'、'purchased'、'resold')之一分隔的'Andriods'(注意拼写),那么只有一个匹配项,即'销售 Andriods' 那么为什么在第 2 部分中有Group1 = 'Androids'(注意拼写)?为什么你会期望看到“购买”这个词? -
如果您确实在寻找
('sells', 'purchased', 'resold)之一,您看到“已购买”只是因为您的正则表达式不正确。而不是 sells|purchased|resold,你应该有 (?:sells|purchased|resold) -
感谢您对此进行调查!当我将正则表达式更改为包含 (?:sells|purchased|resold) 时,计数器最终为空; (结果 = Counter() counter3 = Counter(re.findall(r'\b((?:sells|purchased|resold)\W+(?:\w+\W+){0,25}(?:Androids))\ b',text, re.DOTALL))
-
另外,感谢您发现我的拼写错误。我已经更新了代码以在整个过程中使用“Androids”......结果没有改变,这告诉我我遇到的问题比我意识到的要多。对于柜台 2,我希望 Counter = "purchased", "resold",因为 Androids 这个词出现在购买和转售之前。然后对于柜台 3,我期待 Counter = "sells",因为单词 sells 出现在单词 Androids 之前。
-
我认为您应该更新您的问题。如果您正在寻找“销售”、“购买”或“转售”,请用英语这样说,不要让我们从错误的正则表达式中猜测这一点。如果没有,请仍然说出您要匹配的内容。但我确实相信您的正则表达式不符合您引用的链接的模式。
标签: python regex count counter findall