【问题标题】:How to convert html list to text list? [duplicate]如何将html列表转换为文本列表? [复制]
【发布时间】:2021-07-27 00:11:27
【问题描述】:

假设您有以下 html 列表:

['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']

我想查询这个列表,让输出变成如下:

https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf
https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf

所以我可以访问这些 url 并从这些链接迭代下载文件。

所以我开发了以下正则表达式和代码:

import re
r = re.compile(((?<=uploaden:\s).+))
newlist = list(filter(r.match, mylist))  # Note 1
print(newlist)

但是,这不会返回任何内容(我认为是因为列表是 html):

[]

当将正则表达式调整为 .* 时,所有内容都会匹配。这怎么可能?

所以我的问题是如何从 html 代码创建一个新的字符串列表?

【问题讨论】:

    标签: python html regex pandas list


    【解决方案1】:

    您可以使用regex(?&lt;=Factuur uploaden: &lt;br&gt;)[^\&lt;]* 从文本中提取所需的子字符串。

    • (?&lt;=Factuur uploaden: &lt;br&gt;): Positive lookbehindFactuur uploaden: &lt;br&gt;
    • [^\&lt;]*:任何不是&lt;的字符,任意次数

    演示:

    import re
    
    list = ['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']
    
    for s in list:
        print(re.findall(r'(?<=Factuur uploaden: <br>)[^\<]*', s))
    

    输出:

    ['https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf']
    ['https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf']
    

    【讨论】:

      【解决方案2】:

      (?:如果前面有前缀则匹配正则表达式

      (?=suffix):如果后跟后缀则匹配正则表达式

      import re
      
      s = ['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']
      
      
      match = re.search(r'(?<=<br>Factuur uploaden: <br>)(.*)(?=<br><br><br>)', s[0])
      print(match.group(1))
      # https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf
      

      要对列表中的每个项目执行此操作,您可以识别字典中的每个前缀和后缀:

      ldict = {'item1': ['suffix1', 'prefix1'], 'item2': ['suffix2', 'prefix2'], 'item3': ['suffix3', 'prefix3']}
      

      一个例子(注意我在正则表达式中添加了 '?'):

      另一种更pythonic的方式:

      import re
      
      s = ['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']
      
      regex_expr = r'(?<={0})(.*?)(?={1})'
      
      ldict = {'item1': ['<br>Factuur uploaden: <br>', '<br><br><br>'], 'item2': ['<br>Email: ', '<br>']}
      
      def func(m):
          return m.group(1)
      result = [list(map(func, [re.search(regex_expr.format(v[0], v[1]), e) for v in ldict.values()])) for e in s]
      
      
      print(result)
      # [['https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf', 'maxdenhil.com'], 
      # ['https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf', 'maxdeil.com']]
      

      【讨论】:

      • 并将其存储到新列表中
      【解决方案3】:
      import re
      
      s = ['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']
      
      url_list = []
      
      for data in s:
          pattern = re.compile('(?<=<br>Factuur uploaden: <br>)(.*)(?=<br><br><br>)')
          url = pattern.findall(data)[0]
          url_list.append(url)
      
      print(url_list)
      

      输出是:

      ['https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf', 'https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf']
      

      我想这就是你需要的。

      【讨论】:

        【解决方案4】:

        你可以试试这段代码(先安装pprint和bs4)

        from pprint import pprint
        from bs4 import BeautifulSoup
        
        text = """your html goes here"""
        
        def convert(element):
            return [{li.a['href']: convert(li)}
                    for ul in element('ul', recursive=False)
                    for li in ul('li', recursive=False)]
        
        
        soup = BeautifulSoup(text, 'html.parser')
        data = convert(soup)
        pprint(data)
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2021-05-26
          • 2015-03-16
          • 2019-09-27
          • 2019-09-21
          • 2019-12-12
          • 2020-03-03
          • 1970-01-01
          • 2018-08-08
          相关资源
          最近更新 更多