去除破坏 readline() 的不需要的字符答案

【问题标题】：Stripping out unwanted characters that are breaking readline()去除破坏 readline() 的不需要的字符
【发布时间】：2019-08-12 18:15:28
【问题描述】：

我正在编写一个小脚本来遍历版权通知电子邮件的大文件夹并查找相关信息（IP 和时间戳）。我已经找到了解决一些小的格式障碍的方法（有时 IP 和 TS 在不同的行上，有时在同一行，有时在不同的地方，时间戳有 4 种不同的格式，等等）。

我遇到了一个奇怪的问题，我正在解析的一些文件在一行中间喷出奇怪的字符，破坏了我对 readline() 返回的解析。在文本编辑器中阅读时，有问题的行看起来很正常，但 readline() 会在 IP 中间读取一个 '=' 和两个 '\n' 字符。

例如

Normal return from readline():
"IP Address: xxx.xxx.xxx.xxx"

Broken readline() return:
"IP Address: xxx.xxx.xxx="

The next two lines after that being:
""
".xxx"

知道如何解决这个问题吗？我真的无法控制可能导致这种情况的问题，我只是需要处理它而不会太疯狂。

相关函数，供参考（我知道很乱）：

def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while  not ("Torrent Hash Value: " in iplabel):
    iplabel = ce.readline()

ipraw = ce.readline()
if ("File Size" in ipraw):
    ipraw = ce.readline()

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
    return ip[0]
    ce.close()
else:
    ipraw = ce.readline()
    ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
    if ip:
        return ip[0]
        ce.close()
    else:
        return ("No IP found in: " + ipraw)
        ce.close()

【问题讨论】：

您确定只有两个\n 之前有一个= 字符吗？其他一些 IP 有一些其他字符，如 = 并且可能不止一个？如果您只有=\n\n，您可以通过在最后一个IP 部分.xxx 之前添加(?:=\n*)? 来编写您的IP 正则表达式来说明这一点。
问题是我只是在将行读入字符串后才应用正则表达式，并且换行符将字符串分开。我的第一直觉是阅读 3 行，将它们连接起来，然后是正则表达式，但是如果每次都运行脚本，那将是一个相当大的额外负载，如果我只是将它插入，那将是相当意大利面条代码另一个：最后，因为如果“正常”搜索不起作用，我需要保存行位置并返回到它。
如果您的数据被拆分为多行，我建议您至少通过组合两行来处理一个字符串，并且在每个步骤中再读取一行并丢弃第一行并加入第二行与下一个新行对齐并以这种方式迭代，否则捕获/提取正确的模式对您来说将很困难。
最后只是保存了较早的读取行，将它们组合起来，然后使用 re.sub 删除（=\r*\n），它可以工作（原来还有一个 \r 字符在 = 和 \n 之间，这令人困惑一分钟）。感谢您的帮助。
如果您已经解决了问题，请添加并接受它作为答案，而不是将解决方案放在问题中。

标签： python regex email quoted-printable

【解决方案1】：

您正在处理的至少一些电子邮件似乎已被编码为quoted-printable。

此编码用于使 8 位字符数据可在 7 位（仅限 ASCII）系统上传输，但它也强制执行 76 个字符的固定行长度。这是通过插入一个软换行符来实现的，该换行符由“=”和行尾标记组成。

Python 提供了quopri 模块来处理来自quoted-printable 的编码和解码。从quoted-printable 解码您的数据将删除这些软换行符。

作为一个例子，让我们使用你问题的第一段。

>>> import quopri
>>> s = """I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.)."""

>>> # Encode to latin-1 as quopri deals with bytes, not strings.
>>> bs = s.encode('latin-1')

>>> # Encode
>>> encoded = quopri.encodestring(bs)
>>> # Observe the "=\n" inserted into the text.
>>> encoded
b"I'm writing a small script to run through large folders of copyright notice=\n emails and finding relevant information (IP and timestamp). I've already f=\nound ways around a few little formatting hurdles (sometimes IP and TS are o=\nn different lines, sometimes on same, sometimes in different places, timest=\namps come in 4 different formats, etc.)."

>>> # Printing without decoding from quoted-printable shows the "=".
>>> print(encoded.decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice=
 emails and finding relevant information (IP and timestamp). I've already f=
ound ways around a few little formatting hurdles (sometimes IP and TS are o=
n different lines, sometimes on same, sometimes in different places, timest=
amps come in 4 different formats, etc.).

>>> # Decode from quoted-printable to remove soft line breaks.
>>> print(quopri.decodestring(encoded).decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).

要正确解码，需要处理整个消息正文，这与您使用readline 的方法相冲突。解决此问题的一种方法是将解码的字符串加载到缓冲区中：

import io

def getIP(em):
    with open(em, 'rb') as f:
        bs = f.read()
    decoded = quopri.decodestring(bs).decode('latin-1')

    ce = io.StringIO(decoded)
    iplabel = ""
    while  not ("Torrent Hash Value: " in iplabel):
        iplabel = ce.readline()
        ...

如果您的文件包含完整的电子邮件（包括标题），那么使用 email 模块中的工具将自动处理此解码。

import email
from email import policy

with open('message.eml') as f:
    s = f.read()
msg = email.message_from_string(s, policy=policy.default)
body = msg.get_content()

【讨论】：

【解决方案2】：

已解决，如果其他人有类似的问题，请将每一行保存为一个字符串，将它们合并在一起，然后 re.sub() 将它们取出，记住 \r 和 \n 字符。我的解决方案有点意大利面，但可以防止在每个文件上执行不需要的正则表达式：

def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while  not ("Torrent Hash Value: " in iplabel):
    iplabel = ce.readline()

ipraw = ce.readline()
if ("File Size" in ipraw):
    ipraw = ce.readline()

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
    return ip[0]
    ce.close()
else:
    ipraw2 = ce.readline()                              #made this a new var
    ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw2)
    if ip:
        return ip[0]
        ce.close()
    else:
        ipraw = ipraw + ipraw2                          #Added this section
        ipraw = re.sub(r'(=\r*\n)', '', ipraw)          #
        ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
        if ip:
            return ip[0]
            ce.close()
        else:
            return ("No IP found in: " + ipraw)
            ce.close()

【讨论】：