使用 Python 在电子邮件正文中查找链接答案

【问题标题】：Finding links in an emails body with Python使用 Python 在电子邮件正文中查找链接
【发布时间】：2018-08-05 17:53:10
【问题描述】：

我目前正在使用 Python 开发一个项目，该项目将连接到电子邮件服务器并查看最新的电子邮件，以告知用户电子邮件中是否嵌入了附件或链接。我有前者工作，但没有后者。

我的脚本的 if any() 部分可能有问题。因为当我测试时它似乎工作了一半。虽然这可能是由于电子邮件字符串是如何打印出来的？

这是我连接到 gmail 然后查找链接的代码。

import imaplib
import email

word = ["http://", "https://", "www.", ".com", ".co.uk"] #list of strings to search for in email body

#connection to the email server
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('email@gmail.com', 'password')
mail.list()
# Out: list of "folders" aka labels in gmail.
mail.select("Inbox", readonly=True) # connect to inbox.

result, data = mail.uid('search', None, "ALL") # search and return uids instead

ids = data[0] # data is a list.
id_list = ids.split() # ids is a space separated string
latest_email_uid = data[0].split()[-1]

result, data = mail.uid('fetch', latest_email_uid, '(RFC822)') # fetch the email headers and body (RFC822) for the given ID


raw_email = data[0][1] # here's the body, which is raw headers and html and body of the whole email
# including headers and alternate payloads

print "---------------------------------------------------------"
print "Are there links in the email?"
print "---------------------------------------------------------"

msg = email.message_from_string(raw_email)
for part in msg.walk():
    # each part is a either non-multipart, or another multipart message
    # that contains further parts... Message is organized like a tree
    if part.get_content_type() == 'text/plain':
        plain_text = part.get_payload()
        print plain_text # prints the raw text
        if any(word in plain_text for word in word):
            print '****'
            print 'found link in email body'
            print '****'
        else:
            print '****'
            print 'no link in email body'
            print '****'

所以基本上如您所见，我有一个名为“Word”的变量，其中包含要在纯文本电子邮件中搜索的关键字数组。

当我发送带有“http://”或“https://”格式的嵌入式链接的测试电子邮件时 - 电子邮件会打印出带有文本中链接的电子邮件正文，如下所示 -

---------------------------------------------------------
Are there links in the email?
---------------------------------------------------------
Test Link <http://www.google.com/>


****
found link in email body
****

我收到我的打印消息说“在电子邮件正文中找到链接” - 这是我在测试阶段寻找的结果，但这会导致最终程序中发生其他事情。

然而，如果我在电子邮件中添加一个没有 http:// 的嵌入式链接，例如 google.com，那么即使我有一个嵌入式链接，该链接也不会打印出来，我也不会得到结果。

这是有原因的吗？我也怀疑我的 if any() 循环可能不是最好的。我最初添加它时并没有真正理解它，但它适用于 http:// 链接。然后我尝试了一个 .com 并遇到了我无法找到解决方案的问题。

【问题讨论】：

你的 any 代码不是问题，我试过了：word = ["http://", "https://", "www.", ".com", ".co.uk"] 和 plain_text = 'google.com'。而any(word in plain_text for word in word) 的结果是True
好的，所以基本上我需要找到一种方法让“google.com”在电子邮件正文字符串中打印出来，就像使用“http://”嵌入式链接一样？随着代码的工作。我从没想过先在控制台中测试基本代码，谢谢。

标签： python email parsing imaplib

【解决方案1】：

要检查电子邮件是否有附件，您可以在标题中搜索 Content-Type 并查看它是否显示 "multipart/*"。具有多部分内容类型的电子邮件可能包含附件。

要检查文本中的链接、图像等，您可以尝试使用Regular Expressions。事实上，在我看来，这可能是你最好的选择。使用正则表达式（或正则表达式），您可以找到与给定模式匹配的字符串。例如，"<a[^>]+href=\"(.*?)\"[^>]*>(.*)?</a>" 模式应该匹配电子邮件中的所有链接，无论它们是单个单词还是完整 URL。我希望这会有所帮助！以下是如何在 Python 中实现此功能的示例：

import re

text = "This is your e-mail body. It contains a link to <a 
href='http//www.google.com'>Google</a>."

link_pattern = re.compile('<a[^>]+href=\'(.*?)\'[^>]*>(.*)?</a>')
search = link_pattern.search(text)
if search is not None:
    print("Link found! -> " + search.group(0))
else:
    print("No links were found.")

对于“最终用户”，链接将仅显示为“Google”，没有 www 和更少的 http(s)...但是，源代码将包含 html 包装它，因此通过检查原始正文您可以找到该消息的所有链接。

我的代码并不完美，但我希望它能给你一个大致的方向......你可以在你的电子邮件正文中查找多种模式，用于图像出现、视频等。要学习正则表达式，你将需要研究一下，这是另一个link, to Wikipedia

【讨论】：

感谢您的回复。我不确定这是否有效，除非我不完全理解你？你是说要导入re？然后将我的字符串更改为您提供的字符串，替换我的原始字符串？我还是和上次一样的答案。即使嵌入了 .com，也找不到链接。
不，如果我说错了，我很抱歉。您将使用 python 的正则表达式模块 (re) 来检查您的字符串，并查看它是否找到了您将创建的模式的匹配项。我给了你一个模式示例，但你可以创建无穷无尽的模式，只需查看 Internet 上的“正则表达式”并尝试学习基础知识（这足以满足你的需要）。祝你好运！