删除字符串的前 X 个单词和分隔符 - 带有多个分隔符答案

【问题标题】：delete first X words and delimiters of a string - with multiple delimiters删除字符串的前 X 个单词和分隔符 - 带有多个分隔符
【发布时间】：2016-06-23 17:55:45
【问题描述】：

我有一个字符串，例如manipulate widgets add,1,2,3（抱歉，我无法更改格式）。

我想删除前 X 个单词以及所有之前它们的分隔符。

我们以3为例，从而删除manipulate widgets add，留下,1,2,3

或者，把manipulate,widgets,add,1,2,3删掉两个字（manipulate,widgets），留下,add,1,2,3

我可以用words = re.split('[' + delimiters + ']',inputString.strip())将字符串拆分成一个列表，但我不能简单地删除第X个单词

比如说，

for i in range(1, numWorsdToRemove):
            del words[0]

然后是return ' '.join(words)，因为这给了我1 2 3 4。

我怎样才能做到并保留未删除单词的原始分隔符？

为了有趣，输入字符串可以在单词之间包含多个空格或制表符；只有一个逗号，但在它之前/之后也可能有空格：

manipulate ,widgets add , 1, 2 , 3

请注意，单词不能保证是唯一的，因此我不能在要删除的单词之后获取单词的索引并使用它来返回位置子字符串。

[更新] 我接受了 'Kasramvd 解决方案，但后来发现它没有正确处理 remover('LET FOUR = 2 + 2', 2) 或 remover('A -1 B text.txt', 2)，所以现在我提供赏金。

[Update++] 分隔符是空格、制表符和逗号。其他所有内容（包括等号、减号等）都是单词的一部分（尽管如果有必要，如果回答者告诉我将来如何添加新的分隔符，我会很高兴）

【问题讨论】：

你认为分隔符是什么，你认为单词是什么？
分隔符是空格和制表符。其他所有内容都是单词的一部分（尽管如果有必要的话，如果回答者告诉我将来如何添加新的分隔符，我会很高兴）
在第二种情况下，您仍然将逗号视为分隔符：manipulate,widgets,add,1,2,3 --> ,add,1,2,3。第一个和第二个逗号被视为分隔符。
你是对的 (+1)。重新考虑时，逗号应该被视为分隔符。感谢您指出了这一点;我已经更新了问题。

标签： python regex string split

【解决方案1】：

很难确定您对“分隔符”和“单词”的定义是什么。例如，在A -1 B text.txt 的情况下，应将-1 视为一个单词，还是将字符串视为没有单词可删除。这很容易使用 Kasramvd 提供的正则表达式进行定制。例如，如果您将 -1 视为“单词”，那么这基本上可以解决问题：

import re


def remover(s, n):
    return re.sub(r'^(\s?\s*[^\s]+\s?){%s}' % n, '', s)

s = 'manipulate widgets add,1,2,3'

print('\nString is: {}\n'.format(s))
[print('Remove {}: '.format(x), remover(s, x)) for x in range(4)]

s = 'LET FOUR = 2 + 2 '

print('\nString is: {}\n', s)
[print('Remove {}: '.format(x), remover(s, x)) for x in range(7)]

s = 'A -1 B C text.txt'

print('\nString is: {}\n', s)
[print('Remove {}: '.format(x), remover(s, x)) for x in range(6)]

产生：

String is: manipulate widgets add,1,2,3

Remove 0:  manipulate widgets add,1,2,3
Remove 1:  widgets add,1,2,3
Remove 2:  add,1,2,3
Remove 3:  

String is: {}
 LET FOUR = 2 + 2 
Remove 0:  LET FOUR = 2 + 2 
Remove 1:  FOUR = 2 + 2 
Remove 2:  = 2 + 2 
Remove 3:  2 + 2 
Remove 4:  + 2 
Remove 5:  2 
Remove 6:  

String is: {}
 A -1 B C text.txt
Remove 0:  A -1 B C text.txt
Remove 1:  -1 B C text.txt
Remove 2:  B C text.txt
Remove 3:  C text.txt
Remove 4:  text.txt
Remove 5:

但是= 呢？ = 应该是“单词”还是“分隔符”或什么？如果规则不同，请告诉我们真正的规则是什么。

【讨论】：

分隔符是空格和制表符。其他所有内容都是单词的一部分（尽管如果有必要，如果回答者告诉我将来如何添加新的分隔符，我会很高兴）。 LET X = 42是四个字
我将其更新为使用空格作为分隔符。如果您有额外的分隔符，只需将其添加到正则表达式的 [^\s] 部分。那基本上是在寻找不是空格的字符。

【解决方案2】：

我觉得这个方法很简单，不用正则表达式：

def delete_leading_words(string, num_words, delimeters=' \t,'):
    if num_words == 0:
        return string

    i = 0
    while i < len(string) and string[i] in delimeters:
        i += 1
    while i < len(string) and string[i] not in delimeters:
        i += 1

    return delete_leading_words(string[i:], num_words-1, delimeters)

【讨论】：

谢谢 Mawg。但是我意识到有可能尝试删除太多单词并出现越界错误。请参阅编辑后的代码以进行修复（在 while 循环中检查边界）。

【解决方案3】：

这似乎适用于您的测试用例：

>>> def remover(line, words):
...   parsed = re.split('(\s*,{0,1}\s*)', line, maxsplit=words)
...   return ''.join(parsed[-2:]).lstrip()
... 
>>> remover('LET FOUR = 2 + 2', 2)
'= 2 + 2'
>>> remover('A -1 B text.txt', 2)
'B text.txt'
>>> remover('manipulate widgets add,1,2,3', 3)
',1,2,3'
>>> remover('manipulate,widgets,add,1,2,3', 2)
',add,1,2,3'
>>> remover('manipulate  ,widgets     add ,  1, 2  ,    3', 2)
'add ,  1, 2  ,    3'

不清楚与前导空格有什么关系。如果应该保留，可以删除lstrip()。

【讨论】：

【解决方案4】：

@原始海报。请编辑测试用例，因为您的某些陈述似乎是矛盾的。您的第二个测试用例将逗号视为分隔符。但它也会在余数中留下逗号，这是第二个问题。要么是分隔符，要么不是。

    # testcases  : string , #of words to remove, desired answer
s=['manipulate widgets add,1,2,3',
   'manipulate,widgets,add,1,2,3',
   'manipulate  ,widgets     add ,  1, 2  ,    3',
   'manipulate  ,widgets     add ,  1, 2  ,    3',
   'LET X = 42',
   'LET FOUR = 2 + 2',
   'LET FOUR = 2 + 2',
   'A -1 B text.txt'']

X= [3,2,2,3,3,2,3,2]   

a= [',1,2,3',
    'add,1,2, 3',
    'add ,  1, 2  ,    3',
    ',  1, 2  ,    3',
    '42',
    '= 2 +2',
    '2 +2',
    'B text.txt']

#Just to make it interesting, the input string can contain multiple spaces or tabs between words;
#only one comma, but that might also have spaces pre/suc-ceeding it    
# <-- does that make the comma a word? 

# only delimiters are space and tab, not commas      
# <-- **does that make a single standing comma a word? **
# **2nd test case is contradictory to later statements, as comma is a delimiter here!**

【讨论】：

我很抱歉。逗号不是分隔符。我会更新问题。

【解决方案5】：

你可以使用re.sub():

>>> def remover(s, n):
...     return re.sub(r'^(\s?\b\w+\b\s?){%s}'%n,'', s)

演示：

>>> remover(s,3)
',1,2,3'
>>> remover(s,2)
'add,1,2,3'
>>> remover(s,1)
'widgets add,1,2,3'
>>> remover(s,0)
'manipulate widgets add,1,2,3'

【讨论】：

这不适用于字符串 LET FOUR = 2 + 2 ，当我要求它删除前 3 个单词时:-(
当字符串A B -1 C中有负数时也不起作用（注意减号不是分隔符）在要剪切的地方使用E.G. remove ('A -1 B text.txt', 2)

【解决方案6】：

s1='manipulate widgets add,1,2,3'
# output desired ',1,2,3'
s2='manipulate,widgets,add,1,2,3'
# delete two words (manipulate,widgets) and leave ,add,1,2,3
s3='manipulate  ,widgets     add ,  1, 2  ,    3'
# delete 2 or 3 words

import re

# for illustration 
print re.findall('\w+',s1)
print re.findall('\w+',s2)
print re.findall('\w+',s3)
print


def deletewords(s,n):
    a= re.findall('\w+',s)
    return ','.join(a[n:])

# examples for use    
print deletewords(s1,1)   
print deletewords(s2,2)    
print deletewords(s3,3)

输出：

['manipulate', 'widgets', 'add', '1', '2', '3']
['manipulate', 'widgets', 'add', '1', '2', '3']
['manipulate', 'widgets', 'add', '1', '2', '3']

widgets,add,1,2,3
add,1,2,3
1,2,3

【讨论】：

【解决方案7】：

下面的方法怎么样：

from itertools import islice
import re

text = "manipulate widgets,.  add,1,2,3"

for x in islice(re.finditer(r'\b(\w+?)\b', text), 2, 3):
    print text[x.end():]

这将显示：

,1,2,3

【讨论】：

【解决方案8】：

你可以像这样定义正则表达式

>>> import re
>>> regEx = re.compile(r'(\s*,?\s*)')

这意味着，一个可选的逗号后面或前面有零个或多个空白字符。括号是创建一个匹配组，它将在拆分期间保留分隔符。

现在根据 RegEx 进行拆分，然后跳过您不想要的实际元素数量，以及与这些元素对应的分隔符数量（例如，如果您要跳过三个元素，那么将有两个三个元素之间的分隔符。因此，您需要从拆分数据中删除前五个元素），最后将它们连接起来。

例如，

>>> def splitter(data, count):
...     return "".join(re.split(regEx, data)[count + (count - 1):])
... 
>>> splitter("manipulate,widgets,add,1,2,3", 2)
',add,1,2,3'
>>> splitter("manipulate widgets add,1,2,3", 3)
',1,2,3'

【讨论】：