【问题标题】:Formatting the output with html2text library使用 html2text 库格式化输出
【发布时间】:2016-11-10 15:16:46
【问题描述】:

我需要从 API 中检索包含行和列数据的 html 表数据并将其填充到其他团队。

import requests
import json
import html2text
#from bs4 import BeautifulSoup

headers = {
    'Authorization': 'Bearer hmy0w2ltszfxeysnq8cbjzfcyr4kzfk5k9a0vfca.t',
    'Content-Type': 'application/json',
}
data = '{}'
response = requests.get('https://sandbox.jiveon.com/api/core/v3/contents/436669', headers=headers, data=data)
data = response.json()
print (data['content']['text'])

用于将其转换为文本

format = html2text.HTML2Text()
format.ignore_links = True
format.bypass_tables = False
#format.ignore_tables = True
format.wrap_links = True
format.ignore_images = True
format.ignore_emphasis = True
format.wrap_links = True
print (format.handle(data['content']['text']))

上面代码sn-p的输出是:

<body><!-- [DocumentBodyStart:756f88b6-eed4-4030-ada9-f74dc8e4418b] --><div class="jive-rendered-content"><p>DB Release&#160;</p><p style="min-height: 8pt; padding: 0px;">&#160;</p><div class="j-rte-table"><table class="j-table jiveBorder" style="border: 1px solid #c6c6c6;" width="100%"><thead><tr style="background-color: #efefef;"><th style="width: 11%;">Release Version</th><th style="width: 10%;">REFDB_ID</th><th style="width: 160%;">SVN URL</th></tr></thead><tbody><tr><td style="width: 11%;">3.7.3</td><td style="width: 10%;"><p style="background-color: #ffffff; border: 0px; padding: 0px;">3710002</p><p style="background-color: #ffffff; border: 0px; padding: 0px;">3710003 <br/>3710005 <br/>3710007 <br/>3710009<br/>3710011</p></td><td style="width: 160%;"><p style="background-color: #ffffff; border: 0px; padding: 0px;"><a class="jive-link-external-small" href="http://svnurl.com" rel="nofollow">http://svnurl1.com&#160;</a></p><p style="background-color: #ffffff; border: 0px; padding: 0px;"><a class="jive-link-external-small" href="http://svnurl2.com" rel="nofollow">http://svnurl2.com</a></p></td></tr></tbody></table></div></div><!-- [DocumentBodyEnd:756f88b6-eed4-4030-ada9-f74dc8e4418b] --></body>

DB Release

Release Version| REFDB_ID| SVN URL
---|---|---
3.7.3|

3710002

3710003
3710005
3710007
3710009
3710011

|

http://svnurl1.com

http://svnurl2.com

而我的预期输出是

【问题讨论】:

    标签: python beautifulsoup html-parsing


    【解决方案1】:

    我得到了根据命令行参数过滤掉数据的解决方案。

    import requests
    import json
    import sys
    from bs4 import BeautifulSoup
    from sys import argv
    from xml.etree import ElementTree as ET
    
    
    headers = {
        'Authorization': 'Bearer hmy0w2ltszfxeysnq8cbjzfcyr4kzfk5k9a0vfca.t',
        'Content-Type': 'application/json',
    }
    data = '{}'
    response = requests.get('https://sandbox.jiveon.com/api/core/v3/contents/436669', headers=headers, data=data)
    data = response.json()
    html_doc = data['content']['text']
    soup = BeautifulSoup(html_doc, 'html.parser')
    mytag = []
    mydata = []
    finaldata = []
    table = soup.findAll('tr')
    for val in table:
        trdata = BeautifulSoup(str(val),'html.parser')
        if '3.7.4' in str(trdata):
          mytag = trdata.findAll('td')
    
    
    
    for val in mytag:
      mydata.append(val.get_text())
    
    for val in mydata:
      if str(val).startswith('http:'):
        urldata = str(val).split('.com')
        for val in urldata:
          if val:
            finaldata.append("".join([str(val), '.com']))
      else:
        finaldata.append(val)
    
    for val in finaldata:
      print (val)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-01-11
      • 1970-01-01
      • 2021-12-19
      • 1970-01-01
      相关资源
      最近更新 更多