如何将此列表打印到 DataFrame -Python/BeautifulSoup答案

【问题标题】：How to print this list into DataFrame -Python/BeautifulSoup如何将此列表打印到 DataFrame -Python/BeautifulSoup
【发布时间】：2021-02-04 20:18:35
【问题描述】：

此代码的输出打印下面提供的网站上的每一行。

但它也包含标签。本质上，我想将所有行打印到一个 dataFrame 中，我可以把它放在 Excel 上。

.text 不起作用，因为我使用的是 find_all，因为有些标签在名称中重复。

如何删除不需要的标签，然后将列表放入 DF，复制网站？

谢谢。

import requests
from bs4 import BeautifulSoup
import pandas as pd
productlinks=[]
r=requests.get(url)
soup= BeautifulSoup(r.content,'html.parser')
content=soup.find_all('tr')
for item in content:
    title=item.find_all('td')
    print(title)

【问题讨论】：

pd.read_html() 和 this answer 一样吗？
我收到此错误-引发 ImportError("lxml not found, please install it") ImportError: lxml not found, please install it
谢谢你，这行得通

标签： python html pandas dataframe beautifulsoup

【解决方案1】：

最简单的方法是使用pandas.read_html:

import pandas as pd

url='https://sitc.sitcancer.org/2020/abstracts/titles/'
df = pd.read_html(url)[0]
print(df)
df.to_csv('data.csv', index=False)

打印：

       #  ...                                           Keywords
0      1  ...  Adoptive immunotherapy; Monocyte/Macrophage; T...
1      2  ...  CAR T cells; Immune monitoring; Inflammation; ...
2      3  ...  Antibody; Biomarkers; Immune monitoring; T cel...
3      4  ...  Biomarkers; RNA; Solid tumors; Tumor microenvi...
4      5  ...  Antibody; B cell; Biomarkers; Immune monitorin...
..   ...  ...                                                ...
730  752  ...  Gene expression; Neoantigens; Regulatory T cel...
731  753  ...  Gene expression; Neoantigens; Regulatory T cel...
732  754  ...  Biomarkers; Chemokine; Chemotherapy; Costimula...
733  755  ...  Chemokine; Granulocyte; Myeloid cells; MDSC; T...
734  756  ...  Gene expression; Immune contexture; Immune sup...

[735 rows x 6 columns]

并保存data.csv（来自 LibreOffice 的屏幕截图）：

【讨论】：

我收到此错误-引发 ImportError("lxml not found, please install it") ImportError: lxml not found, please install it
@VoidS 尝试安装lxml 包。使用python3 -m pip install lxml 或类似命令。
谢谢，最后能不能解释一下第3行(url)后面的[0]
@VoidS pd.read_html() 返回所有表的列表，但在这种情况下我们想要第一个（并且只有一个），所以[0]