【问题标题】:How to split the characters of a string by spaces and then resultant elements of list by special characters and numbers and then again join them?如何用空格分割字符串的字符,然后用特殊字符和数字分割列表的结果元素,然后再次加入它们?
【发布时间】:2021-10-26 11:59:23
【问题描述】:

所以,我想要做的是将字符串中的一些单词转换为字典中它们各自的单词并保持原样。例如,输入如下:

standarisationn("well-2-34 2   @$%23beach bend com")

我想输出为:

"well-2-34 2 @$%23bch bnd com"

我使用的代码是:

def standarisationn(addr):
a=re.sub(',', ' ', addr)
lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
              "arcade":"arc",
               "apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
               "av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
              "beach":"bch",
              "bend":"bnd",
              "blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
              "boul":"blvd","boulevard":"blvd","boulv":"blvd",
              "bottm":"bot","bottom":"bot",
              "branch":"br","brnch":"br",
              "brdge":"brg","bridge":"brg",
              "bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
              "camp":"cmp",
              "canyn":"cny","canyon":"cny","cnyn":"cny",
              "southwest":"sw" ,"northwest":"nw"}

temp=re.findall(r"[A-Za-z0-9]+|\S", a)
print(temp)
res = []
for wrd in temp:
     res.append(lookp_dict.get(wrd,wrd))
res = ' '.join(res)
return str(res) 

但它给出了错误的输出:

'well - 2 - 34 2 @ $ % 23beach bnd com'

那是空格太多,甚至没有将“海滩”转换为“bch”。所以,这就是问题所在。我认为是先用空格分割字符串,然后用特殊字符和数字分割结果元素,然后使用字典,然后首先用不带空格的特殊字符连接分隔的字符串,然后用空格连接所有列表。有人可以建议如何解决这个问题或任何更好的方法吗?

【问题讨论】:

    标签: python dictionary join split python-re


    【解决方案1】:

    您可以使用字典的键构建您的正则表达式,确保它们不包含在另一个单词中(即不直接在字母之前或之后):

    import re
    def standarisationn(addr):
        addr = re.sub(r'(,|\s+)', " ", addr)
        lookp_dict = {"allee":"ale","alley":"ale","ally":"ale","aly":"ale",
                    "arcade":"arc",
                    "apartment":"apt","aprtmnt":"apt","aptmnt":"apt",
                    "av":"ave","aven":"ave","avenu":"ave","avenue":"ave","avn":"ave","avnue":"ave",
                    "beach":"bch",
                    "bend":"bnd",
                    "blfs":"blf","bluf":"blf","bluff":"blf","bluffs":"blf",
                    "boul":"blvd","boulevard":"blvd","boulv":"blvd",
                    "bottm":"bot","bottom":"bot",
                    "branch":"br","brnch":"br",
                    "brdge":"brg","bridge":"brg",
                    "bypa":"byp","bypas":"byp","bypass":"byp","byps":"byp",
                    "camp":"cmp",
                    "canyn":"cny","canyon":"cny","cnyn":"cny",
                    "southwest":"sw" ,"northwest":"nw"}
    
        for wrd in lookp_dict:
            addr = re.sub(rf'(?:^|(?<=[^a-zA-Z])){wrd}(?=[^a-zA-Z]|$)', lookp_dict[wrd], addr)
        return addr
    
    print(standarisationn("well-2-34 2   @$%23beach bend com"))
    

    表达式分为三部分:

    • ^ 匹配字符串的开头
    • (?&lt;=[^a-zA-Z]) 是一个lookbehind(即非捕获表达式),检查前面的字符是否为字母
    • {wrd} 是字典的键
    • (?=[^a-zA-Z]|$) 是前瞻(即非捕获表达式),检查后面的字符是字母还是字符串的结尾

    输出:

    well-2-34 2 @$%23bch bnd com
    

    编辑:如果将循环替换为以下内容,则可以编译整个表达式并仅使用一次 re.sub:

    repl_pattern = re.compile(rf"(?:^|(?<=[^a-zA-Z]))({'|'.join(lookp_dict.keys())})(?=([^a-zA-Z]|$))")
    addr = re.sub(repl_pattern, lambda x: lookp_dict[x.group(1)], addr)
    

    如果您的字典增长,这应该会快得多,因为我们使用您的所有字典键构建了一个表达式:

    • ({'|'.join(lookp_dict.keys())}) 被解释为 (allee|alley|...
    • re.sub 中的 lambda 函数将匹配元素替换为 lookp_dict 中的相应值(有关此内容的更多详细信息,请参见例如 this link

    【讨论】:

    • 嗨,这个有问题。假设我们有一个像这样的输入:standarisationn("well-2-34 2apartment alleyyenence @$%23beach Bend com") 它还会将 alleyyance 转换为 aleyance,即转换不应该发生的“alley”,因为 alleyance 是完全不同的词.基本上应该很难匹配。谢谢。
    • 好的,我明白了!检查我的更新答案!是不是更像你要找的东西?
    • 我认为这可以解决问题。谢谢。如果您有任何有助于更好地理解代码的正则表达式链接,请分享任何链接。
    • 我用解释和 re.sub 中处理 lambda 的链接更新了我的答案。希望这一切都有意义!
    • 你是对的!您必须主动检查匹配的字符串是在字符串的开头还是结尾...我相应地更新了答案
    猜你喜欢
    • 1970-01-01
    • 2019-11-03
    • 1970-01-01
    • 2019-03-19
    • 1970-01-01
    • 2020-05-16
    • 2022-10-07
    • 2014-10-05
    • 1970-01-01
    相关资源
    最近更新 更多