【发布时间】:2022-11-24 18:08:20
【问题描述】:
我正在尝试建立一个关于美国大学的数据库。我一直在使用 Beautiful Soup 和 Pandas 这样做,但遇到了困难,因为每页有几个表格要废弃。为了重新合并从两个表中提取的数据,我尝试使用.merge(),但一直没有成功。
我的代码如下:
# Connecticut
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Connecticut'
soup=bs(requests.get(url).text)
table = soup.find_all('table')
#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1.rename(columns = {'Enrollment(2020)[4]': 'Enrollment', 'Founded[5]':'Founded'}, inplace = True)
df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Type','Ref.'], axis=1)
df_Connecticut=df1.merge(df2, on=['School','Location','Control','Founded'])
df_Connecticut
我试过用其他状态来做,但仍然遇到同样的问题:
Maine
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Maine'
soup=bs(requests.get(url).text)
table = soup.find_all('table')
#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1=df1.drop(['Type[a]'], axis=1)
df1.rename(columns = {'Location(s)': 'Location', 'Enrollment (2019)[b]':'Enrollment'}, inplace = True)
df1 = df1.astype({'School':'string','Location':'string','Control':'string','Enrollment':'string','Founded':'string'})
df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Cite'], axis=1)
df2.rename(columns = {'Location(s)': 'Location'}, inplace = True)
df2 = df2.astype({'School':'string','Location':'string','Founded':'string','Closed':'string'})
df_Maine=df1.merge(df2, on=['School','Location','Founded'])
df_Maine```
我是 Python 的完全初学者。
【问题讨论】:
-
你试过了吗连接代替合并?
标签: pandas web-scraping merge