合并 2 个熊猫数据框，通过网络抓取提取答案

【问题标题】：Merging 2 panda dataframes, extracted through web scraping合并 2 个熊猫数据框，通过网络抓取提取
【发布时间】：2022-11-24 18:08:20
【问题描述】：

我正在尝试建立一个关于美国大学的数据库。我一直在使用 Beautiful Soup 和 Pandas 这样做，但遇到了困难，因为每页有几个表格要废弃。为了重新合并从两个表中提取的数据，我尝试使用.merge()，但一直没有成功。

我的代码如下：

# Connecticut
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Connecticut'
soup=bs(requests.get(url).text)
table = soup.find_all('table')

#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1.rename(columns = {'Enrollment(2020)[4]': 'Enrollment', 'Founded[5]':'Founded'}, inplace = True)

df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Type','Ref.'], axis=1)

df_Connecticut=df1.merge(df2, on=['School','Location','Control','Founded'])
df_Connecticut

我试过用其他状态来做，但仍然遇到同样的问题：

 Maine
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Maine'
soup=bs(requests.get(url).text)
table = soup.find_all('table')

#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1=df1.drop(['Type[a]'], axis=1)
df1.rename(columns = {'Location(s)': 'Location', 'Enrollment (2019)[b]':'Enrollment'}, inplace = True)
df1 = df1.astype({'School':'string','Location':'string','Control':'string','Enrollment':'string','Founded':'string'})

df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Cite'], axis=1)
df2.rename(columns = {'Location(s)': 'Location'}, inplace = True)
df2 = df2.astype({'School':'string','Location':'string','Founded':'string','Closed':'string'})

df_Maine=df1.merge(df2,  on=['School','Location','Founded'])
df_Maine```

我是 Python 的完全初学者。

【问题讨论】：

你试过了吗连接代替合并？

标签： pandas web-scraping merge

【解决方案1】：

同意@Clegane 的观点，您可能指的是concat，因为合并表明两者中必须有可以匹配的值。

pd.concat([df1,df2], ignore_index=True)

	School	Location	Control	Carnegie Classification	Enrollment	Founded	Closed
0	Albertus Magnus College	New Haven	Private(Catholic)	Masters university	1793	1925	nan
1	Asnuntuck Community College	Enfield	Public	Associates college	2388	1969[6]	nan
...	...	...	...	...	...	...	...
50	Silvermine College of Art	New Canaan	Private	nan	nan	1951	1971
51	Wilcox College of Nursing	Middletown	Private	nan	nan	1908	1997

否则，您应该解决 how 以进行外部连接：

df_Connecticut=df1.merge(df2, on=['School','Location','Control','Founded'], how='outer')

【讨论】：

concat 没有用，因为我最终得到了 2 列学校，2 列位置......但是 how 使外部连接工作正常！非常感谢！
不确定您是否使用示例pd.concat([df1,df2], ignore_index=True)

【解决方案2】：

这是与您提到的两个州有关的完整示例。当然，您也可以在列表中包括其他州。

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
big_df = pd.DataFrame()
states = ['Maine', 'Connecticut']

for x in states:
    r = s.get(f'https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_{x}')
    soup = bs(r.text, 'html.parser')
    df = pd.read_html(str(soup))[0]
    df.columns = ['School', 'Location', 'Control', 'Type', 'Enrollment', 'Founded']
    df['State'] = x
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)

终端结果：

    School  Location    Control Type    Enrollment  Founded State
0   Bates College   Lewiston    Private Baccalaureate college   1964    1855    Maine
1   Beal University Bangor  Private (for-profit)    Associates college  229 1891[5] Maine
2   Bowdoin College Brunswick   Private Baccalaureate college   1969    1794    Maine
3   Central Maine Community College Auburn  Public  Associates college  3746    1963[6] Maine
4   Colby College   Waterville  Private Baccalaureate college   2150    1813[7] Maine
... ... ... ... ... ... ... ...
61  University of New Haven West Haven  Private Masters university  7587    1920    Connecticut
62  University of Saint Joseph  West Hartford   Private (Catholic)  Masters university  2975    1932    Connecticut
63  Wesleyan University Middletown  Private Baccalaureate college   3599    1831    Connecticut
64  Western Connecticut State University    Danbury Public  Masters university  6463    1903    Connecticut
65  Yale University New Haven   Private Doctoral university 14910   1701    Connecticut
66 rows × 7 columns

【讨论】：