Python - 如何在某个符号后删除所有行中的所有字符？答案

【问题标题】：Python - how to delete all characters in all lines after some sign?Python - 如何在某个符号后删除所有行中的所有字符？
【发布时间】：2014-06-01 22:03:12
【问题描述】：

我想删除@ 符号后所有行中的所有字符。我写了一段代码：

#!/usr/bin/env python
import sys, re, urllib2
url = 'http://varenhor.st/wp-content/uploads/emails.txt'
document = urllib2.urlopen(url)
html = document.read()

html2 = html[0]
for x in html.rsplit('@'):
    print x

但它只删除@ 符号并将其余字符复制到下一行。那么我该如何修改这段代码，删除@之后所有行中的所有字符？我应该使用正则表达式吗？

【问题讨论】：

标签： python regex

【解决方案1】：

你分裂的次数太多了；改用str.rpartition() 并忽略@ 之后的部分。每行这样做：

for line in html.splitlines():
    cleaned = line.rpartition('@')[0]
    print cleaned

或者，对于较旧的 Python 版本，将 str.rsplit() 限制为仅 1 个拆分，并且再次仅获取第一个结果：

for line in html.splitlines():
    cleaned = line.rsplit('@', 1)[0]
    print cleaned

我使用str.splitlines() 来干净地拆分文本，而不管换行样式如何。您也可以直接循环遍历urllib2 响应文件对象：

url = 'http://varenhor.st/wp-content/uploads/emails.txt'
document = urllib2.urlopen(url)
for line in document:
    cleaned = line.rpartition('@')[0]
    print cleaned

演示：

>>> import urllib2
>>> url = 'http://varenhor.st/wp-content/uploads/emails.txt'
>>> document = urllib2.urlopen(url)
>>> for line in document:
...     cleaned = line.rpartition('@')[0]
...     print cleaned
... 
ADAKorb...
AllisonSarahMoo...
Artemislinked...
BTBottg...
BennettLee...
Billa...
# etc.

【讨论】：

【解决方案2】：

您可以使用 Python 的切片表示法：

import re
import sys
import urllib2

url = 'http://varenhor.st/wp-content/uploads/emails.txt'
document = urllib2.urlopen(url)
html = document.read()

for line in html.splitlines():
    at_index = line.index('@')
    print line[:at_index]

由于字符串是序列，您可以对它们进行切片。例如，

hello_world = 'Hello World'
hello = hello_world[:5]
world = hello_world[6:]

请记住，切片会返回一个新序列并且不会修改原始序列。

【讨论】：

感谢您的回答，但是当我运行此代码时，我看到了错误： Traceback (last recent call last): File "a.py", line 11, in at_index = line.index ('@') ValueError: substring not found
您是否在提供的 url (http://varenhor.st/wp-content/uploads/emails.txt) 上运行它？我设法运行它。该错误表明其中一行没有“@”字符
无论如何，@Martijn 的解决方案更好（尽管效率不高；在这种情况下这不是问题）:-) 但我很想知道为什么你不能做到这一点上班！
编辑，对不起s16h，你的解决方案当然可以，我手动输入的，但是当我粘贴时，效果很好。

【解决方案3】：

既然你已经imported re，你可以使用它：

document = urllib2.urlopen(url)
reg_ptn = re.compile(r'@.*')
for line in document:
    print reg_ptn.sub('', line)

【讨论】：

正则表达式很慢；但很高兴看到如何以不同的方式解决这个问题。
在您提出索赔之前进行测试。
已知正则表达式很慢；但无论如何，您的代码：11143 次函数调用（11046 次原始调用）在 1.044 秒内。 Matrijn 的代码：6248 个函数调用（6152 个原始调用）在 0.459 秒内。我的代码，6253 个函数调用（6157 个原始调用）在 0.471 秒内。