【问题标题】:Trim string (both left and right) to nearest word or sentence将字符串(左右)修剪到最近的单词或句子
【发布时间】:2018-01-20 10:20:16
【问题描述】:

我正在编写一个函数,它可以在较大的文本中找到一个靠近相同字符串的字符串。到目前为止一切都很好,只是不漂亮。

我无法将生成的字符串修剪为最接近的句子/整个单词,而不会留下任何字符。修剪距离基于关键字两侧的字数。

keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

with 1 word distance (either side of key word) it should result in:
2 occurrences found
"This marble is..."
"...this marble. Kwoo-oooo-waaa!"

with 2 word distance:
2 occurrences found
"Right. This marble is as..."
"...as this marble. Kwoo-oooo-waaa! Ahhhk!"

到目前为止我得到的是基于字符,而不是单词距离。

2 occurrences found
"ght. This marble is as sli"
"y as this marble. Kwoo-ooo"

但是,正则表达式可以将其拆分为最接近的整个单词或句子。这是实现这一目标的最 Pythonic 方式吗?这是我到目前为止所得到的:

import re

def trim_string(s, num):
  trimmed = re.sub(r"^(.{num}[^\s]*).*", "$1", s) # will only trim from left and not very well
  #^(.*)(marble)(.+) # only finds second occurrence???

  return trimmed

s = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
t = "Marble"


if t.lower() in s.lower():

  count = s.lower().count(t.lower())
  print ("%s occurrences of %s" %(count, t))

  original_s = s

  for i in range (0, count):
    idx = s.index(t.lower())
    # print idx

    dist = 10
    start = idx-dist
    end = len(t) + idx+dist
    a = s[start:end]

    print a
    print trim_string(a,5)

    s = s[idx+len(t):]

谢谢。

【问题讨论】:

  • 你想如何处理空白?如果您只考虑“单词”之间的单个空格,您可以在输入文本上使用.split(),然后使用列表索引来操作子集并将单词重新加入单个字符串。如果这对你有好处,它会让你不再使用正则表达式。
  • 我不希望结果中有任何前导或尾随空格,如果这就是你的意思。包含省略号 (...) 是为了说明该字符串在该点已断开。

标签: python regex string python-2.7 split


【解决方案1】:

more_itertools.adajacent1 是一个探测相邻元素的工具。

import operator as op
import itertools as it

import more_itertools as mit


# Given
keyword = "marble"
iterable = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

代码

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['This marble is', 'this marble. Kwoo-oooo-waaa!']

neighbors = mit.adjacent(pred, words, distance=2)
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['Right. This marble is as', 'as this marble. Kwoo-oooo-waaa! Ahhhk!']

OP 可以根据需要调整这些结果的最终输出。


详情

给定的字符串已被拆分为 words 的可迭代对象。定义了一个简单谓词2,如果在可迭代对象中找到关键字(或带有尾随句点的关键字),则返回True

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)
list(neighbors)

(bool, word) 元组列表从more_itertools.adjacent 工具返回:

输出

[(False, 'Right.'),
 (True, 'This'),
 (True, 'marble'),
 (True, 'is'),
 (False, 'as'),
 (False, 'slippery'),
 (False, 'as'),
 (True, 'this'),
 (True, 'marble.'),
 (True, 'Kwoo-oooo-waaa!'),
 (False, 'Ahhhk!')]

第一个索引是True,表示任何有效出现的关键字和距离为 1 的相邻单词。我们使用这个布尔值和 itertools.groupby 来查找连续的相邻项目并将其组合在一起。例如:

neighbors = mit.adjacent(pred, words, distance=1)
[(k, list(g)) for k, g in it.groupby(neighbors, op.itemgetter(0))]

输出

[(False, [(False, 'Right.')]),
 (True, [(True, 'This'), (True, 'marble'), (True, 'is')]),
 (False, [(False, 'as'), (False, 'slippery'), (False, 'as')]),
 (True, [(True, 'this'), (True, 'marble.'), (True, 'Kwoo-oooo-waaa!')]),
 (False, [(False, 'Ahhhk!')])]

最后,我们应用条件过滤False 组并将字符串连接在一起。

neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]

输出

['This marble is', 'this marble. Kwoo-oooo-waaa!']

1more_itertools 是一个第三方库,它实现了许多有用的工具,包括itertools recipes

2注意,对于任何标点符号的关键字当然可以使用更强的谓词,但这个是为了简单起见。

【讨论】:

    【解决方案2】:

    如果您忽略标点符号,您可以不使用re 执行此操作:

    import itertools as it
    import string
    
    def nwise(iterable, n):
        ts = it.tee(iterable, n)
        for c, t in enumerate(ts):
            next(it.islice(t, c, c), None)
        return zip(*ts)
    
    def grep(s, k, n):
        m = str.maketrans('', '', string.punctuation)
        return [' '.join(x) for x in nwise(s.split(), n*2+1) if x[n].translate(m).lower() == k]
    
    In []
    keyword = "marble"
    sentence = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
    print('...\n...'.join(grep(sentence, keyword, n=2)))
    
    Out[]:
    Right. This marble is as...
    ...as this marble. Kwoo-oooo-waaa! Ahhhk!
    
    In []:
    print('...\n...'.join(grep(sentence, keyword, n=1)))
    
    Out[]:
    This marble is...
    ...this marble. Kwoo-oooo-waaa!
    

    【讨论】:

      【解决方案3】:

      使用this answer 中的ngrams() 函数,这是一种方法,它只获取所有n-gram,然后选择中间带有keyword 的那些:

      def get_ngrams(document, n):
          words = document.split(' ')
          ngrams = []
          for i in range(len(words)-n+1):
              ngrams.append(words[i:i+n])
          return ngrams
      
      keyword = "marble"
      string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
      
      n = 3
      pos = int(n/2 - .5)
      # ignore punctuation by matching the middle word up to the number of chars in keyword
      result = [ng for ng in get_ngrams(string, n) if ng[pos][:len(keyword)] == keyword]
      

      【讨论】:

        【解决方案4】:

        您可以使用此正则表达式在marble 的任一侧匹配最多 N 个非空白子字符串:

        2 个字:

        (?:(?:\S+\s+){0,2})?\bmarble\b\S*(?:\s+\S+){0,2}
        

        RegEx 拆分:

        (?:(?:\S+\s+){0,2})? # match up to 2 non-whitespace string before keyword (lazy)
        \bmarble\b\S*        # match word "marble" followed by zero or more non-space characters
        (?:\s+\S+){0,2}      # match up to 2 non-whitespace string after keyword
        

        RegEx Demo

        1 字正则表达式:

        (?:(?:\S+\s+){0,1})?\bmarble\b\S*(?:\s+\S+){0,1}
        

        【讨论】:

        • 正则表达式还将捕获诸如 - marbleilz 应该是 \W* 在单词 not \S* 之后的单词。
        • 如果. 后面没有空格,这仍然有一个错误,例如 - regex101.com/r/8HAdYg/3
        • 可能是:(?:(?:\S+\s+){0,2})?\bmarble\b\S?(?:\s*\S+){0,2},但我们不知道单词之间缺少空格是否是一个现实的用例。只有 OP 可以告诉我们。
        • 这很公平:)
        猜你喜欢
        • 2016-07-11
        • 1970-01-01
        • 1970-01-01
        • 2020-01-18
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2013-06-22
        • 2022-01-13
        相关资源
        最近更新 更多