您可以使用正则表达式来查找'symptoms' 之后的第一个单词,并可选择更多以逗号、mabye 空格和更多单词字符开头的匹配项:
import re
pattern = r"symptoms\s+(\w+)(?:,\s*(\w+))*"
regex = re.compile(pattern)
t = "kathy has symptoms cold,cough her gender is female. john's symptoms hunger, thirst."
symptoms = regex.findall(t)
print(symptoms)
输出:
[('cold', 'cough'), ('hunger', 'thirst')]
解释:
r"symptoms\s+(\w+)(?:,\s*(\w+))*"
# symptoms\s+ literal symptoms followed by 1+ whitepsaces
# (\w+) followed by 1+ word-chars (first symptom) as group 1
# (?:, )* non grouping optional matches of comma+spaces
# (\w+) 1+ word-chars (2nd,..,n-th symptom) as group 2-n
另一种方式:
import re
pattern = r"symptoms\s+(\w+(?:,\s*\w+)*(?:\s+and\s+\w+)?)"
regex = re.compile(pattern)
t1 = "kathy has symptoms cold,cough,fever and noseitch her gender is female. "
t2 = "john's symptoms hunger, thirst."
symptoms = regex.findall(t1+t2)
print(symptoms)
输出:
['cold,cough,fever and noseitch', 'hunger, thirst']
这仅适用于“英国”英语——美国的方式
"kathy has symptoms cold,cough,fever, and noseitch"
只会导致cold,cough,fever, and 匹配。
您可以在',' 和" and " 拆分每个单独的匹配项以获得您的唯一原因:
sym = [ inner.split(",") for inner in (x.replace(" and ",",") for x in symptoms)]
print(sym)
输出:
[['cold', 'cough', 'fever', 'noseitch'], ['hunger', ' thirst']]