如上面的 cmets 所述,没有可能的名称列表会导致问题。然而,也许并不完美,但提供一些尝试......
给定一个数据框示例,例如...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
代码(Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
输出:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
可能需要进一步清理,但可能会拆分大部分名称。