【问题标题】:How to retrieve the whole sentence around a selected word?如何检索选定单词周围的整个句子?
【发布时间】:2018-09-28 08:21:48
【问题描述】:

我想找到一个选定的单词,并从它之前的第一个句点 (.) 到它之后的第一个句点 (.) 取所有内容。

示例:

在文件中调用'text.php'

'The price of blueberries has gone way up. In the year 2038 blueberries have 
 almost tripled in price from what they were ten years ago. Economists have 
 said that berries may going up 300% what they are worth today.'

代码示例:(我知道如果我使用这样的代码,我可以在单词 ['that'] 之前找到 +5 并在单词之后找到 +5,但我想找到前后句点之间的所有内容一句话。)

import re

text = 'The price of blueberries has gone way up, that might cause trouble for farmers.
In the year 2038 blueberries have almost tripled in price from what they were ten years 
ago. Economists have said that berries may going up 300% what they are worth today.'

find = 
re.search(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,5}that(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,5}", text)
done = find.group()
print(done)

返回:

'blueberries has gone way up, that might cause trouble for farmers'

我希望它返回每个带有 ['that'] 的句子。

示例返回(我想要得到的):

'The price of blueberries has gone way up, that might cause trouble for farmers',
'Economists have said that berries may going up 300% what they are worth today'

【问题讨论】:

  • 这个capital = (if get('test') then get('friendly')) 应该做什么?这种语法在 Python 中是不可接受的吗?描述你的脚本的逻辑

标签: python python-2.7 python-requests full-text-search


【解决方案1】:

我会这样做:

text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
for sentence in text.split('.'):
    if 'that' in sentence:
        print(sentence.strip())

.strip() 只是为了修剪多余的空格,因为我在 . 上拆分。

如果你确实想使用re 模块,我会使用这样的:

text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"[^.]+that[^.]+", text)
results = map(lambda x: x.strip(), results)
print(results)

为了得到同样的结果。


注意事项:

  • 如果句子中有thatcher 之类的词,该句子也会被打印出来。在第一个解决方案中,您可以改用if 'that' in sentence.split(): 以便将字符串拆分为单词,在第二个解决方案中,您可以使用re.findall(r"[^.]+\bthat\b[^.]+", text)(注意\b 标记;它们代表单词边界)。

  • 脚本依靠句点 (.) 来限制句子。如果句子本身包含使用句点的单词,则结果可能不是预期的结果(例如,对于句子 Dr. Tom is sick yet again today, so I'm substituting for him.,脚本会发现 Dr 作为一个句子,Tom is sick yet again today, so I'm substituting for him. 作为另一个句子)


编辑:要在 cmets 中回答您的问题,我将进行以下更改:

解决方案 1:

text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
sentences = text.split('.')
for i, sentence in enumerate(sentences):
    if 'almost' in sentence:
        before = '' if i == 0 else sentences[i-1].strip()
        middle = sentence.strip()
        after = '' if i == len(sentences)-1 else sentences[i+1].strip()
        print(". ".join([before, middle, after]))

解决方案 2:

text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"(?:[^.]+\. )?[^.]+almost[^.]+(?:[^.]+\. )?", text)
results = map(lambda x: x.strip(), results)
print(results)

请注意,这些可能会产生重叠的结果。例如。如果文本是a. b. b. c.,并且您正在尝试查找包含b 的句子,您将得到a. b. bb. b. c

【讨论】:

  • 这工作得非常好,正是我正在寻找的,只是为了我自己的好奇心,是否有一种方法可以在找到的那个句子之后找到这个句子。我认为这可以通过使用与您提供的相同的代码来实现,但跳过每隔一个“。”(句点)
  • @johnsmith 抱歉,我不太清楚你的意思:s
  • 假设我在脚本中输入了“几乎”这个词而不是“那个”这个词,并且想要得到所有三个句子,所以它返回:“蓝莓的价格已经上涨了。在 2038 年,蓝莓的价格几乎是十年前的三倍。经济学家曾表示,浆果的价值可能会上涨 300%。我将在脚本中添加什么以不仅返回包含“几乎”这个词的句子,而且还返回它之前和之后的句子。这可能吗?
  • @johnsmith 啊!是的,这是可能的,但放入 cmets 并不容易,我会在几个后将其添加到我的答案中
  • @johnsmith 确定添加
【解决方案2】:

这个函数应该可以完成这项工作:

old_text = 'test 1: test friendly, test 2: not friendly, test 3: test friendly, test 4: not friendly, test 5: not friendly'

replace_dict={'test 1':'tested 1','not':'very'}

功能:

def replace_me(text,replace_dict):
     for key in replace_dict.keys():
          text=text.replace(str(key),str(replace_dict[key]))
     return text

结果:

 print(replace_me(old_text,replace_dict))
 Out: 'tested 1: test friendly, test 2: very friendly, test 3: test friendly, test 4: very friendly, test 5: very friendly'

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-07-11
    • 2013-03-09
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多