将数据框分成 3 个新的数据框答案

【问题标题】：separating dataframe into 3 new dataframes将数据框分成 3 个新的数据框
【发布时间】：2021-06-03 11:47:49
【问题描述】：

我的目标是首先将数据框分为 3 个类别，然后创建 3 个包含这 3 个类别的新数据框。这是我下面的代码。

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']

train_path = tf.keras.utils.get_file(
    "iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_path = tf.keras.utils.get_file(
    "iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")

train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)

train.pop('SepalWidth')
train.pop('PetalWidth')

flower0 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
flower1 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
flower2 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])

for row in range(len(train)):
    species = train.iloc[row]['Species']
    info = train.iloc[row]
    info.pop('Species')

    if species == 0.0:
        flower0.append(info)
    elif species == 1.0:
        flower1.append(info)
    else:
        flower2.append(info)

print(flower0)

plt.scatter(flower0.pop('SepalLength'), flower0.pop('PetalLength'), color='Red')
plt.scatter(flower1.pop('SepalLength'), flower1.pop('PetalLength'), color='Blue')
plt.scatter(flower2.pop('SepalLength'), flower2.pop('PetalLength'), color='Green')
plt.show()

我对机器学习和数据工程非常陌生，所以我想在散点图上可视化我的数据的样子。由于我无法在 4 个维度上绘制此数据（因为我有 4 个类别：萼片宽度/长度和花瓣宽度/长度），我决定只绘制 2、萼片长度和花瓣长度。我使用 .pop() 方法删除了不必要的列，然后卡在了这个代码块上。

flower0 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
flower1 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])
flower2 = pd.DataFrame(columns=['SepalLength', 'PetalLength'])

for row in range(len(train)):
    species = train.iloc[row]['Species']
    info = train.iloc[row]
    info.pop('Species')

    if species == 0.0:
        flower0.append(info)
    elif species == 1.0:
        flower1.append(info)
    else:
        flower2.append(info)

print(flower0)

plt.scatter(flower0.pop('SepalLength'), flower0.pop('PetalLength'), color='Red')
plt.scatter(flower1.pop('SepalLength'), flower1.pop('PetalLength'), color='Blue')
plt.scatter(flower2.pop('SepalLength'), flower2.pop('PetalLength'), color='Green')
plt.show()

在这里，我创建了 3 个空数据框，其中包含我想稍后用于轴绘图的 2 列，并在 for 循环中循环遍历大型数据集。 for 循环按物种对行进行排序，然后将它们附加到相应的数据帧中。这里的附加似乎不起作用，因为当我打印出它读取的新数据帧之一时：

Empty DataFrame
Columns: [SepalLength, PetalLength]
Index: []

有谁知道我应该如何将这些行添加到特定的新数据框中？提前非常感谢您！

如果您想要布朗尼点，附带问题：这是显示散点图的最佳方式吗？我在网上看了，它说最好的方法是将数据绘制在不同的散点集中，这样我就可以独立更改每个组的颜色。我的整个目标只是以不同的颜色直观地看到每朵花的花瓣长度和萼片长度。

【问题讨论】：

将您的问题简化为您所要求的。提供minimal reproducible example 强调最小。也就是说，您可能正在寻找一个简单的 groupby 。阅读一些 pandas 教程可能会对您有所帮助。 pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

标签： python pandas dataframe tensorflow matplotlib

【解决方案1】：

我认为您不需要在这里使用 for 循环，对于大型数据集来说，在大街上使用 for 循环遍历数据帧的效率非常低。

只需去掉 for 循环，将flower0、flower1、flower2 的定义替换为 iloc 定义。

# change definition to what you want using iloc
flower0 = train.loc[train.Species==0.0][['SepalLength', 'PetalLength']]
flower1 = train.loc[train.Species==1.0][['SepalLength', 'PetalLength']]
flower2 = train.loc[train.Species>1 ][['SepalLength', 'PetalLength']]

# drop the for loop
plt.scatter(flower0.pop('SepalLength'), flower0.pop('PetalLength'), color='Red')
plt.scatter(flower1.pop('SepalLength'), flower1.pop('PetalLength'), color='Blue')
plt.scatter(flower2.pop('SepalLength'), flower2.pop('PetalLength'), color='Green')
plt.show()

无论如何，我相信您返回的是一个空数据框，因为您正试图将一个系列对象 (info = train.iloc[row]) “附加”到数据框。将系列附加到existing data frame use df = pd.concat([df, s.to_frame().T])

【讨论】：