部署具有一种热编码特征的机器学习模型答案

【问题标题】：deploy machine learning model with one hot encoded features部署具有一种热编码特征的机器学习模型
【发布时间】：2019-03-07 17:52:28
【问题描述】：

我已经训练了一个具有分类特征的 xgboost 分类器，我之前已经对其进行了热编码。例如，我有一个分类特征“Year”，它的取值介于 2014 年和 2018 年之间。当 OHEd 时，我得到 5 个二元特征：Year_2014、Year_2015、Year_2016、Year_2017、Year_2018。如果我对 Year=2019 的样本进行预测，因为 Year_2019 特征不存在，会发生什么情况？

更一般地说，为了对新样本进行预测，转换数据的稳健方法是什么？

【问题讨论】：

您为什么不实际尝试它，并在此处报告您可能遇到的任何问题？像您的“更普遍”部分这样的问题可以说是 SO 的题外话，这是关于 实际编码 问题...
预测功能会失败。在问题的第二部分 - 没有直接的答案。但是您会在 SO 和其他 SE 站点中找到很好的讨论。这是一个 - stackoverflow.com/questions/51505295/…。

标签： machine-learning deployment production one-hot-encoding

【解决方案1】：

二进制特征的评估如下：

if(year != ${year value}){
  // Enter "left" branch
} else {
  // Enter "right" branch
}

一个看不见的类别级别被发送到“左”分支。

【讨论】：

【解决方案2】：

#While traning say year has below values
df = pd.DataFrame([2014,2015,2016,2017,2018], columns = ['year']) 
data=pd.get_dummies(df,columns=['year']) 
data.head()
# while predicting lets say input for year is 2018
known_categories = ['2014','2015','2016','2017','2018']    
year_type = pd.Series(['2018']) 
year_type = pd.Categorical(year_type, categories = known_categories)
pd.get_dummies(year_type)
# column name does not matter only the values matters which will be input to the model

【讨论】：

社区鼓励在代码中添加解释，而不是纯粹基于代码的答案（参见here）