【问题标题】:Access all fields in mbox using mailbox使用邮箱访问 mbox 中的所有字段
【发布时间】:2020-12-20 07:10:03
【问题描述】:

我正在尝试对 mbox 格式的电子邮件进行一些处理。

经过搜索,尝试了一些试错https://docs.python.org/3/library/mailbox.html#mbox

使用下面列出的测试代码,我已经完成了我想做的大部分事情(即使我必须编写代码来解码主题)。

我发现这有点偶然,特别是查找“主题”字段所需的关键似乎是反复试验,我似乎找不到任何方法来列出消息的候选者。 (我知道这些字段可能因电子邮件而异。)

谁能帮我列出可能的值?

我还有一个问题;电子邮件可能包含许多“已收到:”字段,例如

Received: from awcp066.server-cpanel.com
Received: from mail116-213.us2.msgfocus.com ([185.187.116.213]:60917)
    by awcp066.server-cpanel.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256)

我有兴趣按时间顺序访问 FIRST - 我很乐意搜索,但似乎找不到任何方法来访问文件中的第一个以外的任何内容。

#! /usr/bin/env python3
#import locale
#2020-08-31

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
import base64, quopri

def isbqencoded(s):
    """
    Test if Base64 or Quoted Printable strings
    """
    return s.upper().startswith('=?UTF-8?')

def bqdecode(s):
    """
    Convert UTF-8 Base64 or Quoted Printable string to str
    """
    nd = s.find('?=', 10)
    if s.upper().startswith('=?UTF-8?B?'):   # Base64
        bbb = base64.b64decode(s[10:nd])
    elif s.upper().startswith('=?UTF-8?Q?'): # Quoted Printable
        bbb = quopri.decodestring(s[10:nd])
    return bbb.decode("utf-8")

def sdecode(s):
    """
    Convert possibly multiline Base64 or Quoted Printable strings to str
    """
    outstr = ""
    if s is None:
        return outstr
    for ss in str(s).splitlines():   # split multiline strings
        sss = ss.strip()
        for sssp in sss.split(' '):   # split multiple strings
            if isbqencoded(sssp):
                outstr += bqdecode(sssp)
            else:
                outstr += sssp
            outstr+=' '
        outstr = outstr.strip()
    return outstr

INBOX = '~/temp/2020227_mbox'

print('Messages in ', INBOX)
mymail = mailbox.mbox(INBOX)
print('Values = ', mymail.values())
print('Keys = ', mymail.keys())
# print(mymail.items)
# for message in mailbox.mbox(INBOX):
for message in mymail:

#     print(message)
    subject = message['subject']
    to = message['to']
    id = message['id']
    received = message['Received']
    sender = message['from']
    ddate = message['Delivery-date']
    envelope = message['Envelope-to']


    print(sdecode(subject))
    print('To ', to)
    print('Envelope ', envelope)
    print('Received ', received)
    print('Sender ', sender)
    print('Delivery-date ', ddate)
#     print('Received ', received[1])

基于this answer我简化了Subject解码,得到了类似的结果。

我仍在寻找访问标头其余部分的建议 - 特别是如何访问多个“已接收:”字段。

#! /usr/bin/env python3
#import locale
#2020-09-02

"""
Extract Subject from MBOX file
"""

import os, time
import mailbox
from email.parser import BytesParser
from email.policy import default

INBOX = '~/temp/2020227_mbox'
print('Messages in ', INBOX)

mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)

for _, message in enumerate(mymail):
    print("date:  :", message['date'])
    print("to:    :", message['to'])
    print("from   :", message['from'])
    print("subject:", message['subject'])
    print('Received: ', message['received'])

    print("**************************************")

【问题讨论】:

    标签: python email mbox


    【解决方案1】:

    电子邮件消息对象提供了一个get_all方法,它返回一个标头的所有实例,因此我们可以使用它来获取接收到的标头的所有值。

    for header in message.get_all('received'):
        print('Received', header)
    

    每个标头都是UnstructuredHeader 的一个实例。这对于识别最早的 Received 标头不是很有帮助,因为需要对标头进行解析以提取日期以便对其进行排序。

    但是,根据引用 RFC 的this answer,接收到的标头始终插入到消息的开头。 docstring for EmailMessage.get_all() 声明:

    返回命名字段的所有值的列表。 这些将按照它们在原始文件中出现的顺序进行排序 消息,并且可能包含重复。

    所以最早收到的header应该是EmailMessage.get_all()返回的列表中的最后一个header。

    【讨论】:

      【解决方案2】:

      根据snakecharmerb 的评论(现已编辑为问题),我简化了流程。
      最后我不需要解码 received,因为 Message-ID 实际上是从原始 received 中提取 id强>领域。

      我列出了我最终使用的代码,以防其他人使用。 此代码只是提取感兴趣的标题字段并打印它们,但完整的代码对消息进行分析。

      #! /usr/bin/env python3
      #import locale
      #2020-09-05
      
      """
      Extract Message Header details from MBOX file
      """
      
      import os, time
      import mailbox
      from email.parser import BytesParser
      from email.policy import default
      
      INBOX = '~/temp/Gmail'
      
      print('Messages in ', INBOX)
      
      mymail = mailbox.mbox(INBOX, factory=BytesParser(policy=default).parse)
      
      for _, message in enumerate(mymail):
          date = message['date']
          to = message['to']
          sender = message['from']
          subject = message['subject']
          messageID = message['Message-ID']
          received = message['received']
          deliveredTo = message['Delivered-To']
          if(messageID == None): continue
      
          print("Date        :", date)
          print("From        :", sender)
          print("To:         :", to)
          print('Delivered-To:', deliveredTo)
          print("Subject     :", subject)
          print("Message-ID  :", messageID)
      #     print('Received    :', received)
      
          print("**************************************")
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2011-08-28
        • 1970-01-01
        • 1970-01-01
        • 2018-02-01
        • 2014-12-17
        • 1970-01-01
        • 1970-01-01
        • 2018-02-05
        相关资源
        最近更新 更多