将大型数据集分解为有组织的索引答案

【问题标题】：Breaking down large dataset into organized index将大型数据集分解为有组织的索引
【发布时间】：2015-02-18 23:11:45
【问题描述】：

我正在尝试从我拥有的数据集中创建shape_id 的索引字典（见下文）。我意识到我可以使用循环（并尝试这样做），但我有一种直觉，在 pandas 中有一种批量方法可以做到这一点，而且计算量并不大。

可能的解决方案： groupby, str.findall, str.extract

字典的结构应该是这样的：

{shape_id: [shape_pt_sequence, [shape_pt_lat,shape_pt_lon]]}

这是我目前拥有的代码：

import pandas as pd

# readability assignments for shapes.csv
shapes = pd.read_csv('csv/shapes.csv')
shapes_shape_id = shapes['shape_id']
shapes_shape_id_index = list(set(shapes_shape_id))
shapes_shape_pt_sequence = shapes['shape_pt_sequence']
shapes_shape_pt_lat = shapes['shape_pt_lat']
shapes_shape_pt_lon = shapes['shape_pt_lon']

shapes_tuple = []

# add shape index to final dict
for i in range(len(shapes_shape_id_index)):
    shapes_tuple.append([shapes_shape_id_index[i]])

print(shapes_tuple)

这是LINK 到shapes.csv 的要点。

这是一个空的 shape_id 索引：

[[20992], [20993], [20994], [20995], [20996], [20997], [20998], [20999], [21000], [21001], [21002], [21003], [21004], [21005], [21006], [21007], [21008], [21009], [21010], [21011], [21012], [21013], [21014], [21015], [21016], [21017], [21018], [21019], [21020], [21021], [21022], [21023], [21026], [21027], [21028], [21029], [21030], [21031], [21032], [21033], [21034], [21035], [21036], [21037], [21038], [21039], [21040], [21041], [21042], [21043], [21044], [21045], [21046], [21047], [21048], [21049], [21050], [21051], [21052], [21053], [21054], [21055], [21056], [21057], [21058], [21059], [21060], [21061], [21062], [21063], [21064], [21065], [21066], [21067], [21068], [21069], [21070], [21071], [21072], [21073], [21074], [21075], [21076], [21077], [21078], [21079], [21080], [21081], [21082], [21083], [21084], [21085], [21086], [21087], [21088], [21089], [20958], [20959], [20960], [20961], [20962], [20963], [20964], [20965], [20966], [20967], [20968], [20969], [20970], [20971], [20972], [20973], [20974], [20975], [20976], [20977], [20978], [20979], [20980], [20981], [20982], [20983], [20984], [20985], [20986], [20987], [20988], [20989], [20990], [20991]]

shapes.csv 看起来像这样：

shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,is_stop
20958,44.0577683,-123.0873313,1,0
20958,44.0577163,-123.087073,2,0
20958,44.0576286,-123.0867103,3,0
20958,44.0574258,-123.086641,4,0
20958,44.0571421,-123.0866518,5,0
20958,44.0568706,-123.086653,6,0
20958,44.0566161,-123.0867028,7,0
20958,44.0565641,-123.0869733,8,0
20958,44.0565503,-123.0872603,9,0
20958,44.0565536,-123.087631,10,0
20958,44.0565439,-123.0879283,11,0
20958,44.0564661,-123.087894,12,0
20958,44.0565124,-123.0881793,13,0
20958,44.0565181,-123.0884921,14,0
20958,44.0565331,-123.0888668,15,0
20958,44.0565406,-123.0892323,16,0
20958,44.0565406,-123.0896295,17,0
20958,44.0563515,-123.0897096,18,0
20958,44.056073,-123.0897108,19,0
20958,44.0558501,-123.0897,20,0
20958,44.0558358,-123.0897016,21,0
20958,44.0556489,-123.0896861,22,0
20958,44.0554398,-123.0896781,23,0
20958,44.0552033,-123.0896776,24,0
20958,44.0549253,-123.089692,25,0
20958,44.0546778,-123.0897281,26,0
20958,44.0546578,-123.0897326,27,0
20958,44.0546338,-123.0896965,28,0
20958,44.0543988,-123.0896838,29,0
20958,44.0543536,-123.0899543,30,0
20958,44.0543628,-123.0903496,31,0
20958,44.0543668,-123.0906733,32,0
20958,44.0543718,-123.0910178,33,0

例如，在 shape.csv 中，20958 的最大 shape_pt_sequence 值为 72。20960 的最大 shape_pt_sequence 值为 400，等等。

【问题讨论】：

你想要的输出是什么？ stops.csv 是什么？您没有在代码中使用它。
Stops 是一个错字，应该是 shape.csv。现已更正。我试图得到的输出格式为[shape_id:[shape_pt_sequence, [shape_pt_lat,shape_pt_lon]]]
这篇文章是 tl;dr 吗？我可以瘦一点。
[shape_id:[shape_pt_sequence, [shape_pt_lat,shape_pt_lon]]] 不是一个有效的 Python 结构，你能举一些真实的例子吗？以及为什么你想要这个表格，以及你要使用它。像这样的东西：{20928:[[1, 2, 3, 4, ...], [[44.0577683, ...], [-123.0873313]]], 20960:...}
您的数据结构是多余的，因为您可以按索引访问点列表，例如print data[20958] 提供[(44.0577683,-123.0873313),(44.0577163,-123.087073),...] 和print data[20958][1] 提供(44.0577163,-123.087073)。

标签： python csv dictionary pandas

【解决方案1】：

我不知道你为什么需要[shape_id:[shape_pt_sequence, [shape_pt_lat,shape_pt_lon]]]这样的结构，它对数据选择不是很有用，你可以使用MultiIndex：

shapes = pd.read_csv('shapes.csv')
shapes.set_index(["shape_id", "shape_pt_sequence"], inplace=True)

然后选择20958的所有数据：

print shapes.loc[20958]

选择一个点：

print shapes.loc[20958, 45]

选择 20958 的数据，shape_pt_sequence 在范围内：

print print shapes.loc[(20958, slice(45, 48)), :]

选择[45, 48]中shape_pt_sequence的数据：

print shapes.loc[(20958, [45, 48]), :]

如果你真的想要表格，这里是代码：

shapes = pd.read_csv('shapes.csv')

def f(df):
    return [df.shape_pt_sequence.tolist(), [df.shape_pt_lat.tolist(), df.shape_pt_lon.tolist()]]

res = shapes.groupby("shape_id").apply(f).to_dict()

【讨论】：

感谢您的答案变化。 MultiIndex 为我解决了问题！我现在也将研究索引！

【解决方案2】：

假设您的 REAL 任务不是验证数据文件，读取文件并使用循环填充适当的数据结构并不笨重，一点也不...

f = open('shapes.csv')
f.next() # skip headers
lines = [line.strip().split(',') for line in f] # f is closed automatically
data = {} ; item = 0
for i, lat, lon, seq, stop in lines:
    i = int(i)
    if i != item:
        item = i
        data[item] = [(float(lat), float(lon))]
    else:
        data[item].append((float(lat), float(lon)))

您的数据文件中不需要stop 标记，也不需要为每个坐标对显式存储索引。

【讨论】：