按第二个空格分割字符串答案

【问题标题】：Split strings by 2nd space按第二个空格分割字符串
【发布时间】：2013-08-31 08:53:40
【问题描述】：

输入：

"The boy is running on the train"

预期输出：

["The boy", "boy is", "is running", "running on", "on the", "the train"]

在 python 中实现这一目标的最简单的解决方案是什么。

【问题讨论】：

Koustav Ghosal 的解决方案是最快的，请参阅我的编辑。

标签： python string split

【解决方案1】：

line="The boy is running on the train"
words=line.split()
k=[words[index]+' '+words[index+1] for index in xrange(len(words)-1)]
print k

输出

['The boy', 'boy is', 'is running', 'running on', 'on the', 'the train']

【讨论】：

哇！没想到这么多！谢谢:)
我最喜欢这个解决方案，但不是因为性能原因。我只是觉得这是最简单最容易理解的。
@Koustav 我发现使用标识符c 来指定一个整数是一个小缺陷：这会造成暂时的混乱，因为人们首先认为它指定了一个字符。而且，正如 Paul McGuire 指出的那样，xrange 会更好
我将 c 更改为“索引”。我希望现在更直观。我还将范围更改为 xrange。但在 Python 3 中，只有出于显而易见的原因才必须使用“范围”。感谢您的所有努力和评论。

【解决方案2】：

你在所有空格上拆分，然后重新加入对：

words = inputstr.split()
secondwords = iter(words)
next(secondwords)

output = [' '.join((first, second)) 
          for first, second in zip(words, secondwords)]

演示：

>>> inputstr = "The boy is running on the train"
>>> words = inputstr.split()
>>> secondwords = iter(words)
>>> next(secondwords)  # output is ignored
'The'
>>> [' '.join((first, second)) for first, second in zip(words, secondwords)]
['The boy', 'boy is', 'is running', 'running on', 'on the', 'the train']

【讨论】：

【解决方案3】：

import re

s = "The boy is running on the train"

print map(' '.join,re.findall('([^ \t]+)[ \t]+(?=([^ \t]+))',s))

编辑

Koustav Ghosal 的解决方案是最快的：

import re
from time import clock
from itertools import izip
from collections import defaultdict

s = "The boy is    running on the train"

z = 200
p = '%-9.6f %6.1f%%  %s'
rgx = re.compile('([^ \t]+)[ \t]+(?=([^ \t]+))')
R = defaultdict(list)

for rep in xrange(3000):

    t0 = clock()
    for i in xrange(z):
        map(' '.join,re.findall('([^ \t]+)[ \t]+(?=([^ \t]+))',s))
    te1 = clock()-t0
    R['e1'].append(te1)

    t0 = clock()
    for i in xrange(z):
        map(' '.join,rgx.findall(s))
    te2 = clock()-t0
    R['e2'].append(te2)

    t0 = clock()
    for i in xrange(z):
        words = s.split()
        secondwords = iter(words)
        next(secondwords)
        [' '.join((first, second))
         for first, second in zip(words, secondwords)]
    tM1 = clock()-t0
    R['M1'].append(tM1)

    t0 = clock()
    for i in xrange(z):
        words = s.split()
        secondwords = iter(words)
        next(secondwords)
        [' '.join((first, second))
         for first, second in izip(words, secondwords)]
    tM2 = clock()-t0
    R['M2'].append(tM2)

    t0 = clock()
    for i in xrange(z):
        words = s.split()
        secondwords = iter(words)
        next(secondwords)
        [' '.join(x)
         for x in izip(words, secondwords)]
    tM3 = clock()-t0
    R['M3'].append(tM3)

    t0 = clock()
    for i in xrange(z):
        words=s.split()
        [words[c]+' '+words[c+1] for c in range(len(words)-1)]
    tK1 = clock() - t0
    R['K1'].append(tK1)

    t0 = clock()
    for i in xrange(z):
        words=s.split()
        [words[c]+' '+words[c+1] for c in xrange(len(words)-1)]
    tK2 = clock() - t0
    R['K2'].append(tK2)

tmax = min(R['e1'])
for k,s in (('e1','eyquem with re.findall(pat,string)'),
            ('e2','eyquem with compiled_regex.findall(string)'),
            ('M1','Martijn Pieters'),
            ('M2','Martijn Pieters with izip'),
            ('M3','Martijn Pieters with izip and direct join'),
            ('K1','Koustav Ghosal'),
            ('K2','Koustav Ghosal with xrange')):
    t = min(R[k])
    print p % (t,t/tmax*100,s)

Python 2.7 的结果

0.007127   100.0%  eyquem with re.findall(pat,string)
0.004045    56.8%  eyquem with compiled_regex.findall(string)
0.003887    54.5%  Martijn Pieters
0.002522    35.4%  Martijn Pieters with izip
0.002152    30.2%  Martijn Pieters with izip and direct join
0.002030    28.5%  Koustav Ghosal
0.001856    26.0%  Koustav Ghosal with xrange

【讨论】：

如果您想将速度作为衡量好坏的标准，请使用 timeit 模块。在更改 Martijn Peters 的解决方案以删除元组解包/重新打包，并将 Koustav Ghosal 更改为使用 xrange 而不是 range 之后，我可以让这些测试时间波动，有时 Martijn 的速度更快，有时 Koustav 的速度更快。 timeit 将运行多个测试，抛出极值，并给出平均时间。仍然不完美，但比基于单次运行代码的选择要好一些。
@Paul McGuire 我放弃使用 timeit 是因为 1/ 我从不提醒它的使用细节 2/ 我仍然没有完全理解 Timer.timeit 的确切含义 3/ Timer.repeat 中有注释请注意，在文档中，代码 sn-p 执行时间的可能最佳度量是 sn-p 多次重复的结果时间中的最小值，并且 “您应该查看整个向量和应用常识而不是统计数据” 因此，我更喜欢使用完全合理的时钟，而不是 Timer.repeat 完成的统计数据，IMO，通过以下引用
@Paul McGuire time.clock() “这是用于对 Python 或计时算法进行基准测试的函数。” 对于 Linux 系统，“分辨率通常优于一微秒。” 适用于 Windows 系统 (docs.python.org/2/library/time.html#time.clock)
当我使用clock() 测量时间时，我不会仅根据运行代码得出结论。即使我给出的 SO 答案不会重复几个措施，我也会运行它几次，然后我选择在我看来为所有执行时间提供最低值的运行；甚至，我在多次重复时为每个代码 sn-p 选择最低的执行时间，然后将其分组以比较多次运行获得的最小值。
@Paul MacGuire 考虑到您的评论，我修改了上面的答案以使代码自行完成所有这些工作。 - 结果是库斯塔夫的解决方案仍然是最好的

【解决方案4】：

或者，itertools.combinations 的解决方案：

>>> s = "The boy is running on the train"
>>> seen = set()
>>> new = []
>>> for tup in itertools.combinations(s.split(), 2):
...     if tup[0] not in seen:
...             new.append(' '.join(tup))
...             seen.add(tup[0])
... 
>>> print new
['The boy', 'boy is', 'is running', 'running on', 'on the', 'the train']

虽然这确实不是 itertools.combinations 应该用于的：p。

【讨论】：