【发布时间】:2018-01-20 10:20:16
【问题描述】:
我正在编写一个函数,它可以在较大的文本中找到一个靠近相同字符串的字符串。到目前为止一切都很好,只是不漂亮。
我无法将生成的字符串修剪为最接近的句子/整个单词,而不会留下任何字符。修剪距离基于关键字两侧的字数。
keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
with 1 word distance (either side of key word) it should result in:
2 occurrences found
"This marble is..."
"...this marble. Kwoo-oooo-waaa!"
with 2 word distance:
2 occurrences found
"Right. This marble is as..."
"...as this marble. Kwoo-oooo-waaa! Ahhhk!"
到目前为止我得到的是基于字符,而不是单词距离。
2 occurrences found
"ght. This marble is as sli"
"y as this marble. Kwoo-ooo"
但是,正则表达式可以将其拆分为最接近的整个单词或句子。这是实现这一目标的最 Pythonic 方式吗?这是我到目前为止所得到的:
import re
def trim_string(s, num):
trimmed = re.sub(r"^(.{num}[^\s]*).*", "$1", s) # will only trim from left and not very well
#^(.*)(marble)(.+) # only finds second occurrence???
return trimmed
s = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
t = "Marble"
if t.lower() in s.lower():
count = s.lower().count(t.lower())
print ("%s occurrences of %s" %(count, t))
original_s = s
for i in range (0, count):
idx = s.index(t.lower())
# print idx
dist = 10
start = idx-dist
end = len(t) + idx+dist
a = s[start:end]
print a
print trim_string(a,5)
s = s[idx+len(t):]
谢谢。
【问题讨论】:
-
你想如何处理空白?如果您只考虑“单词”之间的单个空格,您可以在输入文本上使用
.split(),然后使用列表索引来操作子集并将单词重新加入单个字符串。如果这对你有好处,它会让你不再使用正则表达式。 -
我不希望结果中有任何前导或尾随空格,如果这就是你的意思。包含省略号 (...) 是为了说明该字符串在该点已断开。
标签: python regex string python-2.7 split