【发布时间】:2018-11-16 21:12:59
【问题描述】:
我有一组数据。我已经使用 pandas 将它们分别转换为虚拟变量和分类变量。所以,现在我想知道,如何在 Python 中运行多元线性回归(我正在使用 statsmodels)?是否有一些注意事项,或者我可能必须以某种方式在我的代码中指出这些变量是虚拟的/分类的?或者也许变量的转换就足够了,我只需要以model = sm.OLS(y, X).fit()? 运行回归。
我的代码如下:
datos = pd.read_csv("datos_2.csv")
df = pd.DataFrame(datos)
print(df)
我明白了:
Age Gender Wage Job Classification
32 Male 450000 Professor High
28 Male 500000 Administrative High
40 Female 20000 Professor Low
47 Male 70000 Assistant Medium
50 Female 345000 Professor Medium
27 Female 156000 Assistant Low
56 Male 432000 Administrative Low
43 Female 100000 Administrative Low
然后我做:1=男性,0=女性,1:教授,2:行政,3:助理:
df['Sex_male']=df.Gender.map({'Female':0,'Male':1})
df['Job_index']=df.Job.map({'Professor':1,'Administrative':2,'Assistant':3})
print(df)
得到这个:
Age Gender Wage Job Classification Sex_male Job_index
32 Male 450000 Professor High 1 1
28 Male 500000 Administrative High 1 2
40 Female 20000 Professor Low 0 1
47 Male 70000 Assistant Medium 1 3
50 Female 345000 Professor Medium 0 1
27 Female 156000 Assistant Low 0 3
56 Male 432000 Administrative Low 1 2
43 Female 100000 Administrative Low 0 2
现在,如果我要运行多元线性回归,例如:
y = datos['Wage']
X = datos[['Sex_mal', 'Job_index','Age']]
X = sm.add_constant(X)
model1 = sm.OLS(y, X).fit()
results1=model1.summary(alpha=0.05)
print(results1)
结果正常显示了,但是可以吗?还是我必须以某种方式表明变量是虚拟变量或分类变量?请帮助,我是 Python 新手,我想学习。来自南美洲 - 智利的问候。
【问题讨论】:
标签: python pandas linear-regression statsmodels dummy-variable