【问题标题】:Beautiful Soup Can't Redact Phone Number with ParenthesesBeautiful Soup 不能用括号编辑电话号码
【发布时间】:2020-06-02 15:11:45
【问题描述】:

我正在尝试从 html 文件中编辑电话号码信息......虽然我可以很容易地识别所有电话号码,但我无法弄清楚为什么我无法替换带有括号的电话号码他们。示例如下:

import re
from bs4 import BeautifulSoup

text = '''<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2    </li><li>Fake Info</li></ul>

 City, MO 11111 | (555) 111-1111 | myemail@gmail.com

 Some Category / Some Name: 555-222-2222 | Record Number#: 

 </html>'''

soup = BeautifulSoup(text, 'html.parser')

def find_phone_numbers(text):
    phones = re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", text)
    return phones

phones = find_phone_numbers(str(soup))

print(phones)

for i in phones:
    target = soup.find_all(text=re.compile(i, re.I))
    try:
        for v in target:
            v.replace_with(v.replace(i,'(XXX) XXX-XXXX'))
    except TypeError:
        pass;

print(soup)

这些是我运行上面的结果:

['(555) 555-5555', '(555) 111-1111', '555-222-2222']
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2    </li><li>Fake Info</li></ul>

 City, MO 11111 | (555) 111-1111 | myemail@gmail.com


 Some Category / Some Name: (XXX) XXX-XXXX | Record Number#: 

 </div></body></html>

【问题讨论】:

    标签: python html beautifulsoup phone-number redaction


    【解决方案1】:

    您可以使用.find_all(text=True)从HTML汤中获取所有文本内容,然后将其替换为re.sub(这样可以保留所有标签,包括&lt;li&gt;):

    for content in soup.find_all(text=True):
        s = re.sub(r'(\(?\d{3}\)?)([\s.-]*)(\d{3})([\s.-]*)(\d{4})', '(XXX) XXX-XXXX', content)
        content.replace_with(s)
    
    print(soup)
    

    打印:

    <html>
    <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
    <title>Big Title</title>
    <style type="text/css">
    .parsed {font-size: 75%; color: #474747;}
    </style>
    </head>
    <body>
    <div class="parsed">
    <h1>Redacted Redacted</h1>
    <h2> Contact Info</h2>
    <ul>
    <li>Position Title: My Fake Title</li>
    <li>Email: Redacted@gmail.com</li>
    <li>Phones: (XXX) XXX-XXXX</li>
    </ul><b>Category:</b> <ul><li>Title 2    </li><li>Fake Info</li></ul>
    
     City, MO 11111 | (XXX) XXX-XXXX | myemail@gmail.com
    
     Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
    
     </div></body></html>
    

    【讨论】:

      【解决方案2】:

      方法略有改变。获取所有li 标签,然后对于每个标签,用您的掩码替换电话号码(如果存在电话号码)。我为此使用了一个临时变量 (temp_text),只是为了让代码更具可读性。

      all_li=soup.find_all('li')
      
      for li in all_li:
          temp_text=re.sub(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", '(XXX) XXX-XXXX', li.text)
          if temp_text:
              li.replace_with(temp_text)
      

      print(soup) 输出:

      【讨论】:

      • 我标记了另一个建议,因为它涵盖了整个文档,但我认为这在我追踪单个标签时会很有用 - 感谢帮助
      猜你喜欢
      • 2022-12-03
      • 2021-12-27
      • 1970-01-01
      • 1970-01-01
      • 2016-05-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-05-07
      相关资源
      最近更新 更多