panda read_html 忽略 <br> 并连接字符串答案

【问题标题】：panda read_html ignore <br> and concatenate stringspanda read_html 忽略 <br> 并连接字符串
【发布时间】：2020-11-22 12:52:45
【问题描述】：

试图从这里收集表格： https://en.wikipedia.org/wiki/List_of_English_monarchs 如下：

 import pandas as pd
 url = "https://en.wikipedia.org/wiki/List_of_English_monarchs"
 spacer = lambda s: s.replace('\xa0', ' ').replace('[q]', ' ').replace('\u2009',' ')

 dfs = pd.read_html(url,attrs={"class":'wikitable'},converters={'Name':spacer,
                                                               'Birth':spacer,
                                                               'Marriages':spacer,
                                                               'Death':spacer})

它确实很好用，只是当有
文本
时似乎没有添加空格，例如第一列“名称”中的第一项：

'Edward the Elder 899 年 10 月 26 日至 924 年 7 月 17 日（24 年 266 天）'
它应该在哪里
'Edward the Elder 899 年 10 月 26 日 - 924 年 7 月 17 日（24 年 266 天）'

最终目标是能够从该列中提取日期

【问题讨论】：

标签： pandas web-scraping

【解决方案1】：

可能是这样的：

kings = requests.get(url)
df = pd.read_html(kings.text.replace('<br />',' '))
#using the first column as example
print(df[0]['Name'])

输出：

0                                               Alfred the Great (King of Wessex from 871) c. 886 – 26 October 899
1                                               Edward the Elder 26 October 899 – 17 July 924 (24 years, 266 days)

【讨论】：