作者:大树
更新时间:01.14
email:59888745@qq.com
数据处理,机器学习
阿里天池 大航杯“智造扬中”电力AI大赛 的案例分析实现
今天我来实现大航杯“智造扬中”电力AI大赛的案例实现,按照工业界流程来一一呈现:
- 业务场景定义 包括:核心目标定义,关键场景描述.
- 业务规则梳理 包括:业务规则提炼,规则联动分析
- 数据定量分析 包括:数据多维分析,数据异常处理
- 模型设计研究 包括:应用场景定制,模型参数调优设置
- 运算和结果分析 包括:模型运算输出,业务回归验证
电力AI大赛大赛介绍请参考: https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100066.333.2.mnbu1L&raceId=231602
1.业务场景定义
a.电力AI大赛大赛介绍请参考URL.
b.通过分析,我们得知,业务需求是通分析江苏镇江扬中市的高新区企业历史近2年的用电量,
希望能够根据历史数据去精准预测未来一个月每一天的用电量,如10月份。
c.高新技术产业开发区,高薪区,上班族(工作日,休息日,节假日,夏天还是冬天等
和用电量相关的关键场景。
2.业务规则梳理
a.通过分析,这是一个典型的回归类问题,和我们的流量预测非常相似,
我们来看看如何用数据驱动的方式去完成这样一个预测。
3 .数据定量分析
3.1.载入数据,数据一览
import numpy as np
import pandas as pd
_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
_df.head()
3.2 数据清洗处理(包括异常,缺省值,空值,重复值,日期格式等)
处理na的方法有这些,具体业务具体看:
dropna(),dropna(axis=0,how=\'all\',thresh=None) #thresh =3,
fillna(0)填充d.mean()
isnull(),
notnull(),
drop_duplicates(),重复值_df.drop_duplicates([\'user_id\',\'record_date\'])
import numpy as np
import pandas as pd
_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
_df.head()
#_df.shape
_df.dropna(axis=0,how=\'all\',thresh=None)
_df.drop_duplicates([\'user_id\',\'record_date\'])
_df[\'record_date\']=pd.to_datetime(_df[\'record_date\'])
_df.head()
构造和时间相关的强特征¶
import numpy as np
import pandas as pd
_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
_df.head()
#_df.shape
_df.dropna(axis=0,how=\'all\',thresh=None)
_df.drop_duplicates([\'user_id\',\'record_date\'])
_df[\'record_date\']=pd.to_datetime(_df[\'record_date\'])
_df.head()
test_df=pd.date_range(\'2016-10-1\',periods=31,freq=\'D\')#create very data for 10.1--10.31
test_df=pd.DataFrame(test_df,columns=[\'record_date\'])
test_df[\'power_consumption\']=0.0
test_df
total_df=pd.concat([_df,test_df])
#total_df.fillna(0)
total_df.dropna()
#total_df.head()
total_df.tail()
#时间相关的特征
total_df[\'day_of_week\']=total_df[\'record_date\'].apply(lambda x:x.dayofweek)
total_df[\'day_of_month\']=total_df[\'record_date\'].apply(lambda x:x.day)
total_df[\'day_of_year\']=total_df[\'record_date\'].apply(lambda x:x.dayofyear)
total_df[\'month_of_year\']=total_df[\'record_date\'].apply(lambda x:x.month)
total_df[\'year\']=total_df[\'record_date\'].apply(lambda x:x.year)
#添加工作日还是周末的信息,周六周日和工作日的用电量显然是不一样
total_df[\'holiday\']=0
total_df[\'holiday_sat\']=0
total_df[\'holiday_sun\']=0
#周末特征信息
total_df.loc[total_df.day_of_week ==5,\'holiday\']=1
total_df.loc[total_df.day_of_week ==5,\'holiday_sat\']=1
total_df.loc[total_df.day_of_week ==6,\'holiday\']=1
total_df.loc[total_df.day_of_week ==6,\'holiday_sun\']=1
#一个月4周的周信息,属于第几周
def week_of_month(day):
if day in range(1,8):return 1
if day in range(8,15):return 2
if day in range(15,22):return 3
if day in range(22,32):return 4
total_df[\'week_of_month\']=total_df[\'day_of_month\'].apply(lambda x:week_of_month(x))
total_df.head()
#属于上中下旬信息,有些企业的任务是按照月份的上中下旬来安排的,同样可能对用电量会有影响
def period_of_month(day):
if day in range(1,11):return 1
if day in range(11,21):return 2
if day in range(21,32):return 3
total_df[\'period_of_month\'] =total_df[\'day_of_month\'].apply(lambda x:period_of_month(x))
total_df.head()
#上半月下半月信息
def period2_of_month(day):
if day in range(1,16):return 1
if day in range(16,32):return 2
total_df[\'period2_of_month\'] =total_df[\'day_of_month\'].apply(lambda x:period2_of_month(x))
total_df.head()
# 手动填充节日信息 另外一个对用电量非常大的影响是节假日,法定节假日大部分企业会放假,
# 电量会有大程度的下滑。我们通过查日历的方式去手动填充一个特征/字段,表明这一天是否是节日。
def day_of_festival(day):
l_festival=[\'2016-10-01\',\'2016-10-02\',\'2016-10-03\',\'2016-10-04\',\'2016-10-05\',\'2016-10-06\',\'2016-10-07\']
if day in l_festival:return 1
else:return 0
total_df[\'festival_pc\']=0
total_df[\'festival\']=0
total_df[\'festival\']=total_df[\'festival\'].apply(lambda x:day_of_festival(x))
total_df.head(20)
#已经有的数据特征字段
# 可以看到有
# 日期
# 用电量
# 星期几
# 一个月第几天
# 一年第几天
# 一年第几个月
# 年
# 是否节假日
# 月中第几周
# 一个月上中下旬哪个旬
# 上半月还是下半月
# 是否节日
col_names=total_df.columns.values
col_names
#确认一下训练数据没有缺省值
counts={}
for name in col_names:
count=total_df[name].isnull().sum()
counts[name]=[count]
is_null_filds = pd.DataFrame(counts)
is_null_filds
# 4. 模型设计研究
包括:应用场景定制,模型参数设置
分离训练集和测试集
我们根据日期分割训练集和测试集,用于后续的建模
## 非十月份的是训练集
train_X = total_df[~((total_df.year==2016)&(total_df.month_of_year==10))]
test_X = total_df[((total_df.year==2016)&(total_df.month_of_year==10))]
train_y = train_X.power_consumption
train_X = train_X.drop([\'power_consumption\',\'record_date\',\'year\'],axis=1)
test_X = test_X.drop([\'power_consumption\',\'record_date\',\'year\'],axis=1)
train_X.head()
train_X.shape
# 5 建模与调参,利用网格搜索交叉验证去查找最好的参数,
# DecisionTree
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
param_grid = {\'max_features\': [0.7, 0.8, 0.9, 1],
\'max_depth\': [3, 5, 7, 9, 12]
}
dt = DecisionTreeRegressor()
grid = GridSearchCV(dt, param_grid=param_grid, cv=5, n_jobs=8, refit=True)
grid.fit(train_X, train_y)
best_dt_reg = grid.best_estimator_
print(best_dt_reg)
print(best_dt_reg.score(train_X,train_y))
#考察一下训练集上的拟合程度
best_dt_reg.score(train_X, train_y)
#进行结果预测
from datetime import datetime
#完成提交日期格式的转换
def dataprocess(t):
t = str(t)[0:10]
time = datetime.strptime(t, \'%Y-%m-%d\')
res = time.strftime(\'%Y%m%d\')
return res
#生成10月份31天的时间段
commit_df = pd.date_range(\'2016/10/1\', periods=31, freq=\'D\')
commit_df = pd.DataFrame(commit_df)
commit_df.columns = [\'predict_date\']
#用模型进行预测
test_X[\'user_id\']=test_X[\'day_of_month\'].apply(lambda x:x)
test_X
prediction = best_dt_reg.predict(test_X.values)
commit_df[\'predict_power_consumption\'] = pd.DataFrame(prediction).astype(\'int\')
commit_df[\'predict_date\'] = commit_df[\'predict_date\'].apply(dataprocess)
commit_df.head()
RandomForest 模型融合
# RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
### 多少颗树,树有多深(一般不超过10),建树的时候不用全部属性(具体看多少属性), 采样
param_grid = {
\'n_estimators\': [5, 8, 10, 15, 20, 50, 100, 200],
\'max_depth\': [3, 5, 7, 9],
\'max_features\': [0.6, 0.7, 0.8, 0.9],
}
rf = RandomForestRegressor()
grid = GridSearchCV(rf, param_grid=param_grid, cv=3, n_jobs=8, refit=True)
grid.fit(train_X, train_y)
breg = grid.best_estimator_
print(breg)
print(breg.score(train_X, train_y))
用模型进行预测
from datetime import datetime
def dataprocess(t):
t = str(t)[0:10]
time = datetime.strptime(t, \'%Y-%m-%d\')
res = time.strftime(\'%Y%m%d\')
return res
#用模型进行预测
test_X[\'user_id\']=test_X[\'day_of_month\'].apply(lambda x:x)
commit_df = pd.date_range(\'2016/10/1\', periods=31, freq=\'D\')
commit_df = pd.DataFrame(commit_df)
commit_df.columns = [\'predict_date\']
prediction = breg.predict(test_X)
commit_df[\'predict_power_consumption\'] = pd.DataFrame(prediction).astype(\'int\')
commit_df[\'predict_date\'] = commit_df[\'predict_date\'].apply(dataprocess)
commit_df.head()
总结:
通过上面这个用电量分析预测未来用电量例子,我们可以发现,在建摸前对业务数据的分析, 特征提取很重要,它直接决定了你预测的准确度的高低,所以好的特征提取很重要。 只有尽可能全面准确的对业务场景的了解,才能比较好的做特征提取, 在加上合适的算法模型,才能作出好的效果.
完整版代码¶
import numpy as np
import pandas as pd
_df = pd.read_csv("tianchi_powerdata/zhenjiang_power.csv")
train_df = _df
_df.head()
#_df.shape
#df_201609
#train_df.head(5)
_df[\'record_date\']=pd.to_datetime(_df[\'record_date\'])
_df.head()
train_df=_df[[\'record_date\',\'power_consumption\']].groupby(by=\'record_date\').agg(\'sum\')
train_df=train_df.reset_index()
train_df.head()
test_df=pd.date_range(\'2016-10-1\',periods=31,freq=\'D\')#create very data for 10.1--10.31
test_df=pd.DataFrame(test_df,columns=[\'record_date\'])
test_df[\'power_consumption\']=0.0
test_df
total_df=pd.concat([_df,test_df])
#total_df.fillna(np.random.randint(100,10000))
total_df.dropna()
#total_df.head()
total_df.tail()
#时间相关的特征
total_df[\'day_of_week\']=total_df[\'record_date\'].apply(lambda x:x.dayofweek)
total_df[\'day_of_month\']=total_df[\'record_date\'].apply(lambda x:x.day)
total_df[\'day_of_year\']=total_df[\'record_date\'].apply(lambda x:x.dayofyear)
total_df[\'month_of_year\']=total_df[\'record_date\'].apply(lambda x:x.month)
total_df[\'year\']=total_df[\'record_date\'].apply(lambda x:x.year)
#添加工作日还是周末的信息,周六周日和工作日的用电量显然是不一样
total_df[\'holiday\']=0
total_df[\'holiday_sat\']=0
total_df[\'holiday_sun\']=0
#周末特征信息
total_df.loc[total_df.day_of_week ==5,\'holiday\']=1
total_df.loc[total_df.day_of_week ==5,\'holiday_sat\']=1
total_df.loc[total_df.day_of_week ==6,\'holiday\']=1
total_df.loc[total_df.day_of_week ==6,\'holiday_sun\']=1
#一个月4周的周信息,属于第几周
def week_of_month(day):
if day in range(1,8):return 1
if day in range(8,15):return 2
if day in range(15,22):return 3
if day in range(22,32):return 4
total_df[\'week_of_month\']=total_df[\'day_of_month\'].apply(lambda x:week_of_month(x))
total_df.head()
#属于第上中下旬信息
def period_of_month(day):
if day in range(1,11):return 1
if day in range(11,21):return 2
if day in range(21,32):return 3
total_df[\'period_of_month\'] =total_df[\'day_of_month\'].apply(lambda x:period_of_month(x))
total_df.head()
#上半月下半月信息
def period2_of_month(day):
if day in range(1,16):return 1
if day in range(16,32):return 2
total_df[\'period2_of_month\'] =total_df[\'day_of_month\'].apply(lambda x:period2_of_month(x))
total_df.head()
# 手动填充节日信息 另外一个对用电量非常大的影响是节假日,法定节假日大部分企业会放假,
# 电量会有大程度的下滑。我们通过查日历的方式去手动填充一个特征/字段,表明这一天是否是节日。
def day_of_festival(day):
l_festival=[\'2016-10-01\',\'2016-10-02\',\'2016-10-03\',\'2016-10-04\',\'2016-10-05\',\'2016-10-06\',\'2016-10-07\']
if day in l_festival:return 1
else:return 0
total_df[\'festival_pc\']=0
total_df[\'festival\']=0
total_df[\'festival\']=total_df[\'festival\'].apply(lambda x:day_of_festival(x))
total_df.head(20)
#已经有的数据特征字段
# 可以看到有
# 日期
# 用电量
# 星期几
# 一个月第几天
# 一年第几天
# 一年第几个月
# 年
# 是否节假日
# 月中第几周
# 一个月上中下旬哪个旬
# 上半月还是下半月
# 是否节日
col_names=total_df.columns.values
col_names
#确认一下训练数据没有缺省值
counts={}
for name in col_names:
count=total_df[name].isnull().sum()
counts[name]=[count]
is_null_filds = pd.DataFrame(counts)
is_null_filds
#添加独热向量编码/one-hot encoding ;针对星期几这个特征,初始化一个长度为7的向量[0,0,0,0,0,0,0]
#对于类别型特征,我们经常在特征工程的时候会对他们做一些特殊的处理
# 星期一会被填充成[1,0,0,0,0,0,0]
# 星期二会被填充成[0,1,0,0,0,0,0]
# 星期三会被填充成[0,0,1,0,0,0,0]
# 星期四会被填充成[0,0,0,1,0,0,0]
# 以此类推...
# 树状模型建模 树状模型是工业界最常用的机器学习算法之一,我们在训练集上去学习出来一个最好的决策路径,而每条决策路径的根节点是我们预测的结果;
# 1.分离训练集和测试集
## 非十月份的是训练集
train_X = total_df[~((total_df.year==2016)&(total_df.month_of_year==10))]
test_X = total_df[((total_df.year==2016)&(total_df.month_of_year==10))]
#print(train_X.shape)
#print(test_X.shape)
train_y = train_X.power_consumption
train_X = train_X.drop([\'power_consumption\',\'record_date\',\'year\'],axis=1)
test_X = test_X.drop([\'power_consumption\',\'record_date\',\'year\'],axis=1)
train_X.head()
#建模与调参;我们利用网格搜索交叉验证去查找最好的参数
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
param_grid = {\'max_features\': [0.7, 0.8, 0.9, 1],
\'max_depth\': [3, 5, 7, 9, 12]
}
dt = DecisionTreeRegressor()
grid = GridSearchCV(dt, param_grid=param_grid, cv=5, n_jobs=8, refit=True)
grid.fit(train_X, train_y)
best_dt_reg = grid.best_estimator_
print(best_dt_reg)
print(best_dt_reg.score(train_X,train_y))
from datetime import datetime
#完成提交日期格式的转换
def dataprocess(t):
t = str(t)[0:10]
time = datetime.strptime(t, \'%Y-%m-%d\')
res = time.strftime(\'%Y%m%d\')
return res
#生成10月份31天的时间段
commit_df = pd.date_range(\'2016/10/1\', periods=31, freq=\'D\')
commit_df = pd.DataFrame(commit_df)
commit_df.columns = [\'predict_date\']
#用模型进行预测
prediction = best_dt_reg.predict(test_X.values)
commit_df[\'predict_power_consumption\'] = pd.DataFrame(prediction).astype(\'int\')
commit_df[\'predict_date\'] = commit_df[\'predict_date\'].apply(dataprocess)
commit_df.head()
%matplotlib inline
import matplotlib.pyplot as plt
print("Feature ranking:")
feature_names = [u\'day_of_week\', u\'day_of_month\', u\'day_of_year\', u\'month_of_year\',
u\'holiday\', u\'holiday_sat\', u\'holiday_sun\', u\'week_of_month\',
u\'period_of_month\', u\'period2_of_month\', u\'festival_pc\', u\'festival\']
feature_importances = breg.feature_importances_
indices = np.argsort(feature_importances)[::-1]
for f in indices:
print("feature %s (%f)" % (feature_names[f], feature_importances[f]))
plt.figure(figsize=(20,8))
plt.title("Feature importances")
plt.bar(range(len(feature_importances)), feature_importances[indices],
color="b",align="center")
plt.xticks(range(len(feature_importances)), np.array(feature_names)[indices])
plt.xlim([-1, train_X.shape[1]])
plt.show()
remark:
说明:
模型设计:
load data
交叉验证
classer
model=classer.fit(x,y)
predict = model.transforam(x,y)
predict.filter()
predict.count()
sklearn:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
回归问题,对连续值进行预测,如上面的用电量预测:
DecisionTreeRegressor()
XGBRegressor()
RandomForestRegressor()
xgb.XGBRegressor()
GridSearchCV(xgb_model, param_grid, n_jobs=8)
param_grid = {\'max_features\': [0.7, 0.8, 0.9, 1], \'max_depth\': [3, 5, 7, 9, 12] }
dt = DecisionTreeRegressor()
grid = GridSearchCV(dt, param_grid=param_grid, cv=5, n_jobs=8, refit=True)
grid.fit(train_X, train_y)
best_dt_reg = grid.best_estimator_
best_dt_reg.predict(test_X.values)
rf = RandomForestRegressor()
grid = GridSearchCV(rf, param_grid=param_grid, cv=3, n_jobs=8, refit=True)
grid.fit(train_X, train_y)
best_dt_reg = grid.best_estimator_
best_dt_reg.score(train_X, train_y)
best_dt_reg.predict(test_X.values)
param_grid = { \'max_depth\': [3, 4, 5, 7, 9], \'n_estimators\': [20, 40, 50, 80, 100, 200, 400, 800, 1000, 1200], \'learning_rate\': [0.05, 0.1, 0.2, 0.3], \'subsample\': [0.8, 1], \'colsample_bylevel\':[0.8, 1] }
# 使用xgboost的regressor完成回归
xgb_model = xgb.XGBRegressor()
# 数据拟合
rgs = GridSearchCV(xgb_model, param_grid, n_jobs=8)
rgs.fit(X, y)
print(rgs.best_score_)
print(rgs.best_params_)
rgs.predict(test_X.values)
LogisticRegression逻辑回归 被用来解决分类问题(二元分类),但多类的分类(所谓的一对多方法)也适用;优点是对于每一个输出的对象都有一个对应类别的概率
GaussianNB朴素贝叶斯 在多类的分类问题上表现的很好;
kNN(k-最近邻)方法 通常用于一个更复杂分类算法的一部分,用它的估计值做为一个对象的特征;
DecisionTree决策树分类和回归树(CART) 适用于多类分类 支持向量机SVM 用于分类问题;逻辑回归