【发布时间】:2020-05-28 02:58:39
【问题描述】:
我正在使用完整文件构建 OneHotEncoder。
def buildOneHotEncoder(training_file_name, categoricalCols):
one_hot_encoder = OneHotEncoder(sparse=False)
df = pd.read_csv(training_file_name, skiprows=0, header=0)
df = df[categoricalCols]
df = removeNaN(df, categoricalCols)
logging.info(str(df.columns))
one_hot_encoder.fit(df)
return one_hot_encoder
def removeNaN(df, categoricalCols):
# Replace any NaN values
for col in categoricalCols:
df[[col]] = df[[col]].fillna(value=CONSTANT_FILLER)
return df
现在我在分块处理相同文件时使用相同的编码器
for chunk in pd.read_csv(training_file_name, chunksize=CHUNKSIZE):
....
INPUT = chunk[categoricalCols]
INPUT = removeNaN(INPUT, categoricalCols)
one_hot_encoded = one_hot_encoder.transform(INPUT)
....
它给了我错误“ValueError: Found unknown categories ['missing'] in column 2 during transform'
我无法一次处理整个文件,因为在训练迭代期间需要内存才能使用所有内核。
【问题讨论】:
标签: scikit-learn categorical-data one-hot-encoding feature-engineering