【问题标题】:Pyhthon: Getting "list index out of range" error; I know why but don't know how to resolve thisPython:出现“列表索引超出范围”错误;我知道为什么但不知道如何解决
【发布时间】:2021-10-28 07:16:20
【问题描述】:

我目前正在从事一个数据科学项目。想法是从“glassdoor_jobs.csv”中清理数据,并以更易于理解的方式呈现。

import pandas as pd

df = pd.read_csv('glassdoor_jobs.csv')

#salary parsing
#Removing "-1" Ratings
#Clean up "Founded"
#state field
#Parse out job description

df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary' in x.lower() else 0)
df = df[df['Salary Estimate'] != '-1']
Salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = Salary.apply(lambda x: x.replace('K', '').replace('$',''))

minus_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour', '').replace('employer provided salary:', ''))

df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(x.split('-')[1]))

我在最后一行收到错误。经过一番挖掘,我在 minus_hr 中发现,一些“Salary Estimate”只有一个数字而不是范围:

index Salary Estimate
0 150
1 58
2 130
3 125-150
4 110-140
5 200
6 67- 77

等等。现在我想弄清楚如何解决“列表索引超出范围”的问题,并使 max_salary 与只有一个值的单元格的 min_salary 相同。

我也在尝试获取最低和最高工资之间的平均值,如果单元格只有一个值,则将该值设为平均值

所以最后,像索引 0 这样的东西看起来像:

index min max average
0 150 150 150

【问题讨论】:

    标签: python pandas dataframe spyder


    【解决方案1】:

    您必须在某处添加条件语句。

    df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]) if '-' in x else x)
    

    上面可能会做,或者你可以定义一个函数。

    def max_salary(cell_value):
        if '-' in cell_value:
            max_salary = split(cell_value, '-')[1]
        else:
            max_salary = cell_value
    return max_salary
    
    df['max_salary'] = minus_hr.apply(lambda x: max_salary(x))
    
    
    def avg_salary(cell_value):
        if '-' in cell_value:
            salaries = split(cell_value,'-')
            avg = sum(salaries)/len(salaries)
        else:
            avg = cell_value
    return avg
    
    df['avg_salary'] = minus_hr.apply(lambda x: avg_salary(x))
    

    交换 min_salary 并重复

    【讨论】:

    • 所以按照你的第一个例子,我得到了最小值和最大值。我该怎么办平均?显然在当前状态下除以 2 是不可能的
    • 已更新。如果这对您有用,您可以将其标记为答案吗?我以前从来没有回答过编码问题:)
    • 所以平均工资部分比我想象的要容易;我所要做的就是:df['average_salary'] = (df.min_salary.astype(int) + df.max_salary.astype(int))/2 但是感谢您的回答。最小值和最大值真的很头疼,你帮了我!
    • 赢家!当你被一个应该很简单但你无法弄清楚的想法卡住时,我讨厌它。另一方面,如果你能弄清楚,那就太好了。为加价干杯:D
    【解决方案2】:

    在访问元素之前测试x.split('-') 的长度。

    salaries = x.split('-')
    if len(salaries) == 1:
        # only one salary number is given, so assign the same value to min and max 
        df['min_salary'] = df['max_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
    else:
        # two salary numbers are given
        df['min_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
        df['max_salary'] = minus_hr.apply(lambda x: int(salaries[1]))
    

    【讨论】:

    • salaries = x.split('-') 中未声明的变量 x 不会有问题
    • @ciaranhaines 啊,那是真的;我没有注意到原始代码在 lambda 上下文中。
    • 是的,这看起来很有希望,但我一直在第一行遇到问题
    【解决方案3】:

    如果you want to avoid.apply()...

    试试:

    import numpy as np
    
    # extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
    sals =  df['Salary Estimate'].str.extractall(r'(?P<min_salary>\d+)[^0-9]*(?P<max_salary>\d*)?')
    
    # reset the new frame's index
    sals = sals.reset_index()
    
    # join the extracted min/max salary columns to the original dataframe and fill any blanks with nan
    df = df.join(sals[['min_salary', 'max_salary']].fillna(np.nan))
    
    # fill any nan values in the 'max_salary' column with values from the 'min_salary' column
    df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
    
    # set the type of the columns to int
    df['min_salary'] = df['min_salary'].astype(int)
    df['max_salary'] = df['max_salary'].astype(int)
    
    # calculate the average
    df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
    
    # see what you've got
    print(df)
    

    或者不使用正则表达式:

    import numpy as np
    
    # extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
    df['sals'] =  df['Salary Estimate'].str.split('-')
    
    # expand the list in sals to two columns filling with nan
    df[['min_salary', 'max_salary']] = pd.DataFrame(df.sals.tolist()).fillna(np.nan)
    
    # delete the sals column
    del df['sals']
    
    # # fill any nan values in the 'max_salary' column with values from the 'min_salary' column
    df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
    
    # # set the type of the columns to int
    df['min_salary'] = df['min_salary'].astype(int)
    df['max_salary'] = df['max_salary'].astype(int)
    
    # # calculate the average
    df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
    
    # see you've got
    print(df)
    

    输出:

      Salary Estimate  min_salary  max_salary  average_salary
    0             150         150         150             150
    1              58          58          58              58
    2             130         130         130             130
    3         125-150         125         150             137
    4         110-140         110         140             125
    5             200         200         200             200
    6          67- 77          67          77              72
    

    【讨论】:

      猜你喜欢
      • 2014-06-19
      • 2021-07-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-03-12
      • 1970-01-01
      • 2015-12-26
      • 1970-01-01
      相关资源
      最近更新 更多