如何使用beautifulsoup提取不在任何html标签中的文本？答案

【问题标题】：How to extract texts that are not in any html tags using beautifulsoup?如何使用beautifulsoup提取不在任何html标签中的文本？
【发布时间】：2019-06-30 12:46:16
【问题描述】：

电子邮件字符串：

can i buy a laptop<br><br>-- <br>
<div dir="ltr">
    <div>
        <div dir="ltr">
            <div>
                <div dir="ltr">
                    <div>
                        <div dir="ltr">
                            <div dir="ltr">
                                <p style="color:rgb(0,0,0);font-family:times;font-size:medium">
                                    Some important Text/ Email Signature 
                                </p>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
</div><br>

需要输出：

{
   body: "can i buy a laptop",
   Signature: "Some important Text/ Email Signature"
}

另一个问题是，电子邮件文本是动态的。也可以是这样的：

<div dir="ltr">Can i buy a phone?<br clear="all">
    <div><br>-- <br>
        <div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">
            <div dir="ltr"><span>
                    <div dir="ltr"><span style="color:rgb(136,136,136)"></span>
                        <div>
                            <div dir="ltr">
                                <div dir="ltr">
                                    <div dir="ltr">
                                    <div> Some Important Divs</div>
                                    </div>
                                </div>
                            </div>
                        </div>
                    </div>
                </span></div>
        </div>
    </div>
</div>

所以不能通过'ltr'标签来确定。到目前为止，我一直在提取第一部分的 ltr 标签并由 gmail_signature 签名。

    soup = BeautifulSoup(emailText, 'html.parser')
    mainbody = soup.find('div', {'dir': 'ltr'})
    if mainbody is not None:
        texts = [t for t in mainbody.contents if isinstance(t, NavigableString)]
        print('Mainbody: ', mainbody)
        print('Texts: ', texts)
        if len(texts) != 0:
            for idx,txt in enumerate(texts):
                allText += txt
                if idx != len(texts):
                    allText += "\n"    
    quotes = soup.find('div', {'class': 'gmail_quote'})
    if quotes is not None:
        for div in quotes:
            replies += " " + div.text
            # replies = replies.replace("\n", "")
            replies = replies.replace("\r", "")
            replies = re.sub(' +', ' ',replies)

【问题讨论】：

标签： python python-3.x beautifulsoup html-parsing

【解决方案1】：

试试这个：第二个例子：

import requests
from bs4 import BeautifulSoup

data=dict()
html_page = """<div dir="ltr">Can i buy a phone?<br clear="all">
    <div><br>-- <br>
        <div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">
            <div dir="ltr"><span>
                    <div dir="ltr"><span style="color:rgb(136,136,136)"></span>
                        <div>
                            <div dir="ltr">
                                <div dir="ltr">
                                    <div dir="ltr">
                                    <div> Some Important Divs</div>
                                    </div>
                                </div>
                            </div>
                        </div>
                    </div>
                </span></div>
        </div>
    </div>
</div>"""
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
    #'[document]',
    #'noscript',
    #'header',
    'html',
    #'meta',
    #'head',
    #'input',
    #'script',
    # there may be more elements you don't want, such as "style", etc.
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)
if "--"  in output:
  res=output.replace("\n","").split("--")
else:
  res=output.replace("\n","").split("Best Regards ")

data["body"]=res[0]
data["signature"]=res[1].strip()
print(data)

输出：

{'body': 'Can i buy a phone?  ', 'signature': 'Some Important Divs'}

第一个：

import requests
from bs4 import BeautifulSoup

data=dict()
html_page = """can i buy a laptop<br><br>-- <br>
<div dir="ltr">
    <div>
        <div dir="ltr">
            <div>
                <div dir="ltr">
                    <div>
                        <div dir="ltr">
                            <div dir="ltr">
                                <p style="color:rgb(0,0,0);font-family:times;font-size:medium">
                                    Some important Text/ Email Signature 
                                </p>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
</div><br>"""
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
    #'[document]',
    #'noscript',
    #'header',
    'html',
    #'meta',
    #'head',
    #'input',
    #'script',
    # there may be more elements you don't want, such as "style", etc.
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)
if "--"  in output:
  res=output.replace("\n","").split("--")
else:
  res=output.replace("\n","").split("Best Regards ")

data["body"]=res[0]
data["signature"]=res[1].strip()
print(data)

输出：

{'body': 'can i buy a laptop ', 'signature': 'Some important Text/ Email Signature'}

【讨论】：

感谢您的评论，但不幸的是，它没有打印第一个示例的预期结果。您已经为第二个示例提供了解决方案。但是我已经有了一个解决方案来解决我在这里发布的第二个问题。问题出在第一种类型上。
@Anurag 这是第一个示例的输出，我可以买一台笔记本电脑吗——一些重要的文本/电子邮件签名，你能告诉我你的预期结果是什么
我的预期结果是 { body: "can i buy a laptop", Signature: "Some important Text/Email Signature" }
@Anurag 你能告诉我你正在制作这个字典的内容吗？我可以在拆分数据并将第一个部分作为主体而另一个作为签名时创建它，如果有花药数据退了会出问题
我正在创建一个用于从电子邮件中提取正文和签名的 api。问题是这些数据并不统一。我解决了一些模式。这是我正在努力的模式之一。但我们可以假设他们将具有一般的签名，例如 - 或 Best Regards 等