【问题标题】:Find and replace an obfuscated word or phrase in a string查找和替换字符串中的混淆词或短语
【发布时间】:2019-08-14 03:05:02
【问题描述】:

我正在尝试在字符串中查找单个单词或 n 个单词的短语,然后将其替换为星号。挑战是即使单词或 n 词短语被某些字符混淆,我也想这样做。

假设如下。 REPLACE_CHAR 是我想用来替换单词或 n 词短语的字符。 ILLEGAL_CHAR 是我想忽略的字符。我也希望这忽略大小写。

REPLACE_CHAR = "*"
ILLEGAL_CHAR = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

示例一

在这里,我想用星号替换“dolor”。在字符串中,您可以看到“dolor”存在,但它被随机符号和大写混淆了。

string = "Lorem ipsum %@do^l&oR sit amet"
find = "dolor"

想法结果将是"Lorem ipsum ***** sit amet",其中星号的数量与找到的单词的长度相匹配。

示例二

在这里,我想用星号替换“dolor sit”,同时保留空格。在字符串中,您可以看到“dolor sit”存在,但它被随机符号和大写混淆了。

string = "Lorem ipsum %@do^l&oR s%)i!T~ amet"
find = "dolor sit"

想法结果将是"Lorem ipsum ***** *** amet",其中星号的数量与找到的单词的长度相匹配。


更新 #1

此解决方案基于 @Ajax1234 响应构建。

我们不使用re.sub 删除ILLEGAL_CHAR,而是使用translate 并在函数外部构建表。这有轻微的性能提升。

import re

REPLACE_CHAR = "*"
ILLEGAL_CHAR = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

trans = str.maketrans("", "", ILLEGAL_CHAR)
text = "Lorem ipsum %@do^l&oR sit amet"
token = "dolor sit"

def replace(data, token):
    data = data.translate(trans)
    return re.sub(token, lambda x:' '.join('*'*len(i) for i in x.group().split(' ')), data, flags=re.I)

print(replace(text, token))

【问题讨论】:

  • 不知道使用regex 的方法,但您可以使用Longest common sub sequence. 找到匹配模式
  • 当您感兴趣的单词或短语之外有混淆字符时,您想要什么行为?例如,如果字符串是string = "Lorem ipsum %@do^l&amp;oR s%)i!T~ amet ()*+,-./:;&lt;=&gt;?@[\\]^_

标签: python regex


【解决方案1】:
import re

ignore_chars = "!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~"

string = "Lorem ipsum %@do^l&oR s%)i!T~ amet"

clean_string = "".join(char for char in string if char not in ignore_chars)

bad_words = ["dolor", "sit"]

for bad_word in bad_words:
    pattern = f"\\b{bad_word}\\b"
    replace = "*" * len(bad_word)
    clean_string = re.sub(pattern, replace, clean_string, flags=re.IGNORECASE)

print(clean_string)

输出:

Lorem ipsum ***** *** amet

【讨论】:

  • 在使用 n 词短语时,这似乎不保留空格。例如,将["dolor sit"] 用于bad_words
【解决方案2】:

您可以使用re.sub 删除非法字符,然后使用re.I 应用另一个re.sub

import re
def replace(word, target):
   w = re.sub('[\!"#\$%\&\'\(\)\*\+,\-\./:;\<\=\>\?@\[\]\^_`\{\|\}~]+', '', word)
   return re.sub(target, lambda x:' '.join('*'*len(i) for i in x.group().split(' ')), w, flags=re.I)

string = "Lorem ipsum %@do^l&oR sit amet"
find = "dolor"
r = replace(string, find)

输出:

'Lorem ipsum ***** sit amet'

string = "Lorem ipsum %@do^l&oR s%)i!T~ amet"
find = "dolor sit"
r = replace(string, find)

输出:

'Lorem ipsum ***** *** amet'

【讨论】:

  • 使用 translate(在函数外部创建转换表)可以提高性能。针对 100 万次迭代进行测试,翻译速度快了约 1 秒。
  • @mondoshawan 很高兴知道。您可以使用translate 发布您的解决方案吗?
【解决方案3】:

有了re.sub 供您使用,取消和重新混淆单词并不难!这里已经有很多很好的答案了;这个设计易于编辑,特别是如果您打算从用户或其他外部来源获得输入。

#we'll be using regex to solve this problem
import re


#establish some constants - these can be changed later, or even read as user input
REPLACE_CHAR = "*"
ILLEGAL_CHAR = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'


#your search string - this can be read as user input
search = "Lorem ipsum %@do^l&oR sit amet"

#this regex will remove the illegal characters - specifically, it substitutes an empty 
#character ('') in place of any illegal character we find. 
#note that since the brackets are included here, the user can directly input illegal 
#symbols themselves without worrying about formatting
strip = re.sub('['+ILLEGAL_CHAR+']', '', search)


#the string to obfuscate - this can also be read as user input
find = "ipsum dolor sit"

#this splits the words on spaces, so there's spaces between tee asterisks
find_words = find.split(' ')


#now we'll check each find_word - we'll look for it in the string, and if we find it,
#we'll replace it with asterisks of the same length as the original word. 
#(we'll use a ranged for loop to go over the words)
for f_word in find_words:

  #check each f_word to see if it appears in the string. note "flags=re.I" - this 
  #tells our regex to use case-insensitive matching
  if(re.search(f_word, strip, flags=re.I)):

    #we found a word! check the length of the word, then substitute an equal number of
    #REPLACE_CHARs
    strip = re.sub(f_word, (REPLACE_CHAR * len(f_word)), strip, flags=re.I)

#ta-daa!
print(strip)

Try it here!

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-06-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-07-04
    • 2020-08-09
    相关资源
    最近更新 更多