使用 BeautifulSoup 访问注释的 HTML 行答案

【问题标题】：the Accessing commented HTML Lines with BeautifulSoup使用 BeautifulSoup 访问注释的 HTML 行
【发布时间】：2017-07-15 23:00:02
【问题描述】：

我正在尝试从这个特定网页上抓取统计数据：https://www.sports-reference.com/cfb/schools/louisville/2016/gamelog/

但是，当我查看 HTML 源代码时，“防御性游戏日志”的表格似乎被注释掉了（以 <...> 结尾）

因此，当尝试使用 BeautifulSoup4 时，以下代码仅抓取未注释掉的攻击性数据，而防御性数据被注释掉。

from urllib.request import Request,urlopen
from bs4 import BeautifulSoup
import re

accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link.read(), "lxml")


tables = soup.find_all(['th', 'tr'])
my_table = tables[0]
rows = my_table.findChildren(['tr'])
for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        value = cell.string
        print(value)

我很好奇是否有任何解决方案能够将所有防御值添加到列表中，就像在 BeautifulSoup4 内部或外部存储攻击性数据一样。谢谢！

请注意，我添加到下面给出的源自here 的解决方案：

data = []

table = defensive_log
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

【问题讨论】：

“注释掉”是什么意思？

标签： python-3.x beautifulsoup

【解决方案1】：

Comment 对象会给你你想要的：

from urllib.request import Request,urlopen
from bs4 import BeautifulSoup, Comment

accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link, "lxml")

comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
    comment=BeautifulSoup(str(comment), 'lxml')
    defensive_log = comment.find('table') #search as ordinary tag
    if defensive_log:
        break

【讨论】：

@Storm，有什么反馈吗？我的解决方案有帮助吗？
很抱歉需要很长时间才能回复您——我一直在搬家，终于回到了这个项目。我现在正在运行它以尝试合并它。
我从here 添加了以下代码。它允许我把它放到一张桌子上。我将最终的代码字符串放在上面的问题中。
对不起，我不完全理解你。所以你的意思是你已经做了一个变通办法，但现在正试图以我的方式实现目标？