使用 .loc 后，Dask categorize() 将不起作用答案

【问题标题】：Dask categorize() won't work after using .loc使用 .loc 后，Dask categorize() 将不起作用
【发布时间】：2018-12-27 20:42:05
【问题描述】：

我在使用 dask（dask 版本：1.00，pandas 版本：0.23.3）时遇到了严重问题。我正在尝试从 CSV 文件加载 dask 数据帧，将结果过滤到两个单独的数据帧中，并对两者执行操作。

但是，在拆分数据框并尝试将类别列设置为“已知”后，它们仍然是“未知”。因此我无法继续我的操作（这要求类别列是“已知的”。）

注意：我已经按照建议使用 pandas 而不是 read_csv() 创建了一个最小示例。

import pandas as pd
import dask.dataframe as dd

# Specify dtypes
b_dtypes = {
    'symbol': 'category',
    'price': 'float64',
}

i_dtypes = {
    'symbol': 'category',
    'price': 'object'
}

# Specify a function to quickly set dtypes
def to_dtypes(df, dtypes):
    for column, dtype in dtypes.items():
        if column in df.columns:
            df[column] = df.loc[:, column].astype(dtype)
    return df

# Set up our test data
data = [
    ['B', 'IBN', '9.9800'],
    ['B', 'PAY', '21.5000'],
    ['I', 'PAY', 'seventeen'],
    ['I', 'SPY', 'ten']
]

# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')

# Convert into dask
df = dd.from_pandas(pdf, npartitions=3)

#
## At this point 'df' simulates what I get when I read the mixed-type CSV file via dask
#

# Split the dataframe by the 'type' column
b_df = df.loc[df['type'] == 'B', :]
i_df = df.loc[df['type'] == 'I', :]

# Convert columns into our intended dtypes
b_df = to_dtypes(b_df, b_dtypes)
i_df = to_dtypes(i_df, i_dtypes)

# Let's convert our 'symbol' column to known categories
b_df = b_df.categorize(columns=['symbol'])
i_df['symbol'] = i_df['symbol'].cat.as_known()

# Is our symbol column known now?
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

#
## print() returns 'False' for both, this makes me want to kill myself.
## (Please help...)
#

更新： 因此，如果我将“npartitions”参数转换为 1，那么 print() 在这两种情况下都会返回 True。所以这似乎是包含不同类别的分区的问题。但是，将两个数据帧仅加载到两个分区中是不可行的，那么有没有办法告诉 dask 进行某种重新排序以使各个分区的类别保持一致？

【问题讨论】：

嗨琼斯，欢迎来到 SO。我希望你能找到你的问题的答案。阅读有关How to ask 的内容并生成mcve mcve2 是个好主意
好主意。抱歉，我对在 StackExchange 上发帖有点陌生。添加了最小示例。

标签： python pandas dataframe dask

【解决方案1】：

你的问题的答案基本都包含在doc中。我指的是# categorize requires computation, and results in known categoricals评论的部分代码我会在这里展开，因为在我看来你在滥用loc

import pandas as pd
import dask.dataframe as dd

# Set up our test data
data = [['B', 'IBN', '9.9800'],
        ['B', 'PAY', '21.5000'],
        ['I', 'PAY', 'seventeen'],
        ['I', 'SPY', 'ten']
       ]

# Create pandas dataframe
pdf = pd.DataFrame(data, columns=['type', 'symbol', 'price'], dtype='object')

# Convert into dask
ddf = dd.from_pandas(pdf, npartitions=3)

# Split the dataframe by the 'type' column
# reset_index is not necessary
b_df = ddf[ddf["type"] == "B"].reset_index(drop=True)
i_df = ddf[ddf["type"] == "I"].reset_index(drop=True)

# Convert columns into our intended dtypes
b_df = b_df.categorize(columns=['symbol'])
b_df["price"] = b_df["price"].astype('float64')
i_df = i_df.categorize(columns=['symbol'])

# Is our symbol column known now? YES
print(b_df['symbol'].cat.known, flush=True)
print(i_df['symbol'].cat.known, flush=True)

【讨论】：