根据其他列向 Panda 数据框添加新列答案

【问题标题】：Add new column to Panda dataframe based on other column根据其他列向 Panda 数据框添加新列
【发布时间】：2019-04-04 22:35:13
【问题描述】：

我正在尝试向 Panda 数据集添加一个新列。这个新列 df['Year_Prod'] 派生自另一个 df['title'] 我从中提取年份。

数据示例：

country    designation     title
Italy      Vulkà Bianco    Nicosia 2013 Vulkà Bianco (Etna)         
Portugal   Avidagos        Quinta dos Avidagos 2011 Avidagos Red (Douro)

代码：

import re

import pandas as pd

df=pd.read_csv(r'test.csv', index_col=0)

df['Year_Prod']=re.findall('\\d+', df['title'])

print(df.head(10))

我收到以下错误：

 File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__self._set_item(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item value = self._sanitize_column(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3391, in _sanitize_column value = _sanitize_index(value, self.index, copy=False)

  File "C:\Python37\lib\site-packages\pandas\core\series.py", line 4001, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')

**ValueError: Length of values does not match length of index**

请告诉我您对此的看法，谢谢。

【问题讨论】：

您的标题中是否有多个数字？
@G.Anderson，好问题，我之前查过，每个标题只有一次出现。

标签： regex python-3.x pandas dataframe

【解决方案1】：

你可以使用熊猫str.extract

df['Year_Prod']= df.title.str.extract('(\d{4})')

    country     designation     title                                          Year_Prod
0   Italy       Vulkà Bianco    Nicosia 2013 Vulkà Bianco (Etna)                2013
1   Portugal    Avidagos        Quinta dos Avidagos 2011 Avidagos Red (Douro)   2011

编辑：正如@Paul H. 在 cmets 中建议的那样，您的代码不起作用的原因是 re.findall 需要一个字符串，但您传递的是一个系列。可以使用 apply where 在每一行中完成，传递的值是一个字符串，但没有多大意义，因为 str.extract 更有效。

df.title.apply(lambda x: re.findall('\d{4}', x)[0])

【讨论】：

可能值得解释一下 re.findall 期望单个字符串作为其第二个参数，但 OP 改为通过 pandas.Series。此外，OP 应该知道标准库中的函数通常不会接受 pandas 对象

【解决方案2】：

pandas 也有findall

df.title.str.findall('\d+').str[0]
Out[239]: 
0    2013
1    2011
Name: title, dtype: object

#df['Year_Prod']= df.title.str.findall('\d+').str[0] from pygo

【讨论】：

优秀@WB ，已添加到我的列表中:) +1，但是，您能否将其添加到答案中以完成所需的输出，以便寻找此内容的人可以从中受益df['Year_Prod']= df.title.str.findall('\d+').str[0]
@pygo 确定 :-) 添加
@W-B，谢谢老兄:-)

【解决方案3】：

您没有指定分隔符 - 默认为 , for .read_csv

你可以使用pd.Series.apply:

import re    
import pandas as pd

def year_finder(x):
    return re.findall('\\d+', x)[0] # First match I find

df=pd.read_csv(r'test.csv', delimiter='||', index_col=0)
df['Year_Prod']= df["title"].apply(year_finder)

print(df.head(10))

编辑：str.extract 方法见@Vaishali 的回答

【讨论】：

【解决方案4】：

基于iloc 方法的另一种方式。

>>> df['Year_Prod'] = df.iloc[:,2].str.extract('(\d{4})', expand=False)
>>> df
    country   designation                                          title Year_Prod
0     Italy  Vulkà Bianco               Nicosia 2013 Vulkà Bianco (Etna)      2013
1  Portugal      Avidagos  Quinta dos Avidagos 2011 Avidagos Red (Douro)      2011

【讨论】：

【解决方案5】：

`str.translate` 而不是 `regex`

def f(x):
  x = ''.join([c if c.isdigit() else ' ' for c in x])
  return x.strip().split(None, 1)[0]

df.assign(Year_Prod=df.title.map(f))

    country   designation                                          title Year_Prod
0     Italy  Vulkà Bianco               Nicosia 2013 Vulkà Bianco (Etna)      2013
1  Portugal      Avidagos  Quinta dos Avidagos 2011 Avidagos Red (Douro)      2011

【讨论】：

str.translate 而不是 regex

`str.translate` 而不是 `regex`