【问题标题】:Python re.findall() returning empty listPython re.findall() 返回空列表
【发布时间】:2017-03-18 23:11:30
【问题描述】:

我正在尝试将一些单词与正则表达式匹配,并为此编写了一个 python 代码。奇怪的是 re.findall() 在比赛中返回空列表。但是,模式和文本文件在 regxr.com 中显示匹配。这是代码

pat1 = '(\S+)_(?:JJ)_\S+\b(?:\s+)(\S+)_(?:NN|NNS)_\S+\b'
pat2 = '(\S+?)_(?:RR|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat3 = '(\S+?)_(?:JJ)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat4 = '(\S+?)_(?:NN|NNS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat5 = '(\S+?)_(?:RB|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:VB|VBD|VBN|VBG)_\S+\b(?:\s+)\S*?_\S+?_\S+\b'

def process_file(content):
res = []
for line in content:
    matches = re.findall(pat1,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat2,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat3,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat4,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
    matches = re.findall(pat5,line)
    for m in matches:
        m = (m[0],m[1])
        phrase = '%s %s' % m
        res.append(phrase)
return res

def main(path):
   contents = []
   f = open(path)
   for line in f:
      contents.append(line)
   f.close()
   result = process_file(contents) 
   print result

这是我正在使用的文本文件:

sydney_NN_B-NP lumet_NN_I-NP 是_VBZ_B-VP the_DT_B-NP 主任_NN_I-NP 其_WP$_B-NP 工作_NN_I-NP 发生_VBZ_B-VP to_TO_I-VP 是_VB_I-VP of_IN_B-PP 变化_VBN_B-NP 质量_NN_I-NP ._._B-O he_PRP_B-NP 是_VBZ_B-VP 赞_VBN_I-VP 为_IN_B-PP 一些_DT_B-NP 的_IN_B-PP _DT_B-NP 最重要_RBS_I-NP 重要_JJ_I-NP 片_NNS_I-NP 的_IN_B-PP _DT_B-NP 以前_JJ_I-NP 几十年_NNS_I-NP ,_,_B-O 喜欢_IN_B- PP 十二_CD_B-NP 愤怒_JJ_I-NP men_NNS_I-NP ,_,_B-O serpico_NN_B-NP or_CC_B-O the_DT_B-NP 判决_NN_I-NP ._._B-O 但是_CC_B-O ,_,_I-O in_IN_B-PP _DT_B-NP 相同_JJ_I-NP 时间_NN_I-NP ,_,_B-O 几乎_RB_B-NP 任意_DT_I-NP of_IN_B-PP 这样_JJ_B-NP 珍珠_NNS_I-NP 是_VBZ_B-VP 跟随_VBN_I-VP 由_IN_B- PP stinkers_NNS_B-NP that_WDT_B-NP 篮子_VBP_B-VP lumet's_JJ_B-NP 信誉_NN_I-NP ._._B-O a_DT_B-NP 陌生人_NN_I-NP 在_IN_B-PP us_PRP_B-NP ,_,_B-O 1992_CD_B-NP 扯掉_NN_I-NP of_IN_B-PP peter_NN_B-NP 堰的_JJ_I-NP 见证人_NN_I-NP ,_,_B-O 属于_VBZ_B-VP to_TO_B -PP the_DT_B-NP 后者_NN_I-NP 类别_NN_I-NP ._._B-O the_DT_B-NP 女主角_NN_I-NP of_IN_B-PP this_DT_B-NP 电影_NN_I-NP is_VBZ_B-VP emily_JJ_B-NP eden_FW_I-NP (_(_B-O melanie_JJ_B-NP griffith_NN_I-NP )_)BO ,,_I -O强硬_JJ_B-NP女士_NN_I-NP警察_NN_I-NP谁_WP_B-NP有时_RB_B-ADVP显示_VBZ_B-VP太_RB_B-NP很多_JJ_I-NP热情_NN_I-NP in_IN_B-PP战斗_VBG_B-VP坏_JJ_B-NP家伙_NNS_I-NP on_IN_B-PP the_DT_B-NP街道_NNS_IN-NP of_IN_B-PP new_JJ_B-NP york_NN_I-NP ._._B-O 在_IN_B-PP one_CD_B-NP of_IN_B-PP such_JJ_B-NP actions_NNS_I-NP ,_,_B-O her_PRP$_B-NP partner_NN_I-NP nick_NN_I-NP (_(_B-O jamey_JJ_B-NP sheridan_NNS_I-NP )_)_B-O got_VBD_B-VP 受伤_VBN_I-VP 和_CC_B-O as_IN_B-PP a_DT_B-NP 结果_NN_I-NP ,_,_B-O she_PRP_B-NP 变成_VBZ_B-VP 郁闷_JJ_B-ADJP ._._B-O in_IN_B-PP order_NN_B-NP to_TO_B-VP help_VB_I-VP her_PRP_B-NP recovery_VB_B-VP ,_,_B-O bosses_NNS_B-NP give_VBP_B-VP her_PRP_B-NP 而不是_RB_I-NP easy_JJ_I-NP task_NN_I-NP of_IN_B-PP locating_VBG_B-VP missing_VBG_B- NP 珠宝商_NNS_I-NP who_WP_B-NP 属于_VBD_B-VP to_TO_B-PP hassidic_JJ_B-NP 犹太人_NN_I-NP 社区_NN_I-NP ._._B-O emily_NN_B-NP开始_VBZ_B-VP调查_NN_B-NP和_CC_B-O很快_RB_B-VP意识到_VBZ_I-VP那_IN_B-SBAR_DT_B-NP案件_NN_I-NP涉及_VBZ_B-VP谋杀_NN_B-NP._._B-O 结论_VBG_B-VP那_IN_B-SBAR_DT_B-NP犯罪者_NN_I-NP属于_VBZ_B-VP to_TO_B-PP社区_NN_B-NP ,_,_B-O she_PRP_B-NP决定_VBZ_B-VP to_TO_I-VP go_VB_I-VP卧底_JJ_B-ADJP ._._B-O that_DT_B-NP 不是_RB_B-O easy_JJ_B-ADJP ,_,_B-O 因为_IN_B-SBAR her_PRP$_B-NP 现代_JJ_I-NP 方式_NNS_I-NP 是_VBP_B-VP 冲突_VBG_I-VP 与_IN_B-PP 传统主义者_NN_B-NP 方式_NNS_I-NP ._._B -O things_NNS_B-NP get_VBP_B-VP even_RB_B-NP more_RBR_B-ADJP复杂_JJ_I-ADJP when_WRB_B-ADVP she_PRP_B-NP发展_VBZ_B-VP情怀_NNS_B-NP for_IN_B-PP年轻_JJ_B-NP cabalistic_JJ_I-NP学者_NN_I-NP ariel_NN_I-NP (_(_B-Oeric NP thal_NN_I-NP )_)BO .._I-O 使用_VBG_B-VP peter_NN_B-NP weir's_JJ_I-NP 公式_NN_I-NP 不是_:_B-O_DT_B-NP 最大_JJS_I-NP 缺陷_NN_I-NP of_IN_B-PP this_DT_B-NP 电影_NN_I-NP ._._B-O even_RB_B-NP the_DT_I-NP lame_JJ_I-NP and_CC_I-NP unspiring_JJ_I-NP 犯罪_NN_I-NP 神秘_NN_I-NP subplot_NN_I-NP 作品_VBZ_B-VP to_TO_B-PP the_DT_B-NP 确定_JJ_I-NP 范围_NN_I-NP ._._B-O 但是_CC_B-O the_DT_B-NP最差_JJS_I-NP侮辱_NN_I-NP to_TO_B-PP观众的_JJ_B-NP观众_NN_I-NP是_VBZ_B-VP可怕_JJ_B-NP误投_NN_I-NP of_IN_B-PP梅兰妮_JJ_B-NP格里菲斯_NN_I-NP._._B-O the_DT_B-NP 作者_NN_I-NP of_IN_B-PP this_DT_B-NP 评论_NN_I-NP never_RB_B-ADVP 喜欢_VBD_B-VP this_DT_B-NP 女演员_NN_I-NP 非常_RB_B-ADVP much_RB_I-ADVP ,_,_B-O 但是_CC_I-O she_PRP_B-NP 是_VBD_B-VP at_IN_B- ADVP 最少_JJS_I-ADVP 可容忍_JJ_B-ADJP in_IN_B-PP some_DT_B-NP of_IN_B-PP her_PRP$_B-NP 角色_NNS_I-NP ._._B-O 角色_NN_B-NP of_IN_B-PP emily_JJ_B-NP eden_NNS_I-NP ,_,_B-O 不幸的是_RB_B-ADVP ,_,_B-O 不是_VBZ_I-O one_CD_B-NP of_IN_B-PP 他们_PRP_B-NP ._._B-O first_RB_B-ADVP of_IN_B-PP all_DT_B-NP ,_,_B-O she_PRP_B-NP 不能her_PRP$_B-NP 尝试_NN_I-NP to_TO_B-VP pass_VB_I-VP for_IN_B-PP 正统_JJ_B-NP 犹太人_JJ_I-NP 女人_NN_I-NP 不是_RB_B-O 很多_RB_B-ADJP 更好_JJR_I-ADJP ._._B-O 剧本_NN_B-NP by_IN_B-PP robert_JJ_B-NP j_NN_I-NP ._._B-O avrech_NNS_B-NP make_VBZ_B-VP things_NNS_B-NP even_RB_B-ADJP 更糟_JJR_I-ADJP with_IN_B-PP some_DT_B-NP formulaic_JJ_I-NP red_JJ_I-NP herring_NN_I-NP subplots NP(_(_B-O场景_NN_B-NP涉及_VBG_B-VP两个_CD_B-NP意大利_JJ_I-NP黑帮_NNS_I-NP被_VBD_B-VP几乎_RB_B-ADJP太_RB_I-ADJP痛苦_JJ_I-ADJP到_TO_B-VP手表_VB_I-VP)_)BO。 em>._I-O 但是_CC_B-O ,_,_I-O on_IN_B-PP the_DT_B-NP 其他_JJ_I-NP 手_NN_I-NP ,_,_B-O 其他_JJ_B-NP 演员_NNS_I-NP 是_VBP_B-VP 更多_RBR_B-ADJP 说服_JJ_I-ADJP (_(_B-O lee_NN_B- NP richardson_NN_I-NP as_IN_B-PP an_DT_B-NP old_JJ_I-NP rabbi_NN_I-NP ,_,_B-O thal_JJ_B-ADJP as_IN_B-PP ariel_NN_B-NP and_CC_B-O charm_JJ_B-NP mia_NN_I-NP sara_NN_I-NP as_IN_B-PP his_PRP$_B- NP 意图_VBN_I-NP 新娘_NN_I-NP )_)BO ,,_I-O 和_CC_I-O the_DT_B-NP 摄影_NN_I-NP by_IN_B-PP andrzej_JJ_B-NP bartkowiak_NN_I-NP 非常_RB_B-ADVP 有效_RB_I-ADVP 创建_VBZ_B-VP气氛_NN_B-NP 的_IN_B-PP 温暖_NN_B-NP 何时_WRB_B-ADVP _DT_B-NP 场景_NNS_I-NP 采取_VBP_B-VP 地点_NN_B-NP in_IN_B-PP hassidic_JJ_B-NP 社区_NN_I-NP ._._B-O 还有_RB_B-ADVP ,_,_B-O the_DT_B-NP 电影_NN_I-NP 可能_MD_B-VP 教育_VB_I-VP 观众_NNS_B-NP 关于_IN_B-PP hassidic_JJ_B-NP 文化_NN_I-NP ._._B-O that_DT_B-NP 是_VBZ_B-VP _DT_B-NP only_JJ_I-NP 事情_NN_I-NP 那_WDT_B-NP 防止_VBZ_B-VP it_PRP_B-NP 从_IN_B-PP 转向_VBG_B-VP 变成_IN_B-PP 总_JJ_B-NP 浪费_NN_I-NP _IN_B-PP 时间_NN_B-NP ._._B-哦

【问题讨论】:

    标签: python regex findall


    【解决方案1】:

    你被反斜杠咬了!反斜杠用于 Python 字符串中的转义字符(与许多其他语言一样)。例如,\n 表示“换行符”,\r 表示“回车”……而\b 表示“退格”,又名\x08

    你的所有表达中都有\b

    所以当你写的时候:

    >>> pat1 = '...\b...'
    

    你得到:

    >>> pat1
    '...\x08...'
    

    有两种方法可以解决此问题。您可以使用另一个反斜杠来转义每个反斜杠,如下所示:

    >>> pat1 = '...\\b...'
    >>> pat1
    '...\\b...'
    

    请注意,您会在那里看到\\,因为那是字符串的 Python 表示;如果我们要打印出pat1,我们会得到:

    >>> print pat1
    ...\b...
    

    更简单的解决方法是将正则表达式字符串标记为“原始字符串”:

    反斜杠 () 字符用于转义具有特殊含义的字符,例如换行符、反斜杠本身或引号字符。字符串文字可以选择以字母 r' orR' 为前缀;此类字符串称为原始字符串,并且对反斜杠转义序列使用不同的规则。

    换句话说:

    pat1 = r'(\S+)_(?:JJ)_\S+\b(?:\s+)(\S+)_(?:NN|NNS)_\S+\b'
    pat2 = r'(\S+?)_(?:RR|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
    pat3 = r'(\S+?)_(?:JJ)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
    pat4 = r'(\S+?)_(?:NN|NNS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
    pat5 = r'(\S+?)_(?:RB|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:VB|VBD|VBN|VBG)_\S+\b(?:\s+)\S*?_\S+?_\S+\b'
    

    有了这个变化,我使用你的样本数据得到了匹配:

    >>> re.findall(pat1, data)
    [('important', 'films'), ('previous', 'decades'), ('angry', 'men'), ('same', 'time'), ('such', 'pearls'), ("lumet's", 'reputation'), ("weir's", 'witness'), ('melanie', 'griffith'), ('tough', 'lady'), ('much', 'enthusiasm'), ('bad', 'guys'), ('new', 'york'), ('such', 'actions'), ('jamey', 'sheridan'), ('easy', 'task'), ('hassidic', 'jew'), ('modern', 'manners'), ('cabalistic', 'scholar'), ('eric', 'thal'), ("weir's", 'formula'), ('unispiring', 'crime'), ('certain', 'extent'), ("viewer's", 'audience'), ('terrible', 'miscasting'), ('melanie', 'griffith'), ('emily', 'eden')]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-03-10
      相关资源
      最近更新 更多