【问题标题】:Pasing through CSV file to store as dictionary with nested array values. Best approach?传递 CSV 文件以存储为具有嵌套数组值的字典。最好的方法?
【发布时间】:2015-08-15 19:09:56
【问题描述】:

我正在尝试获取这个 csv 文件并以字典的形式对其进行解析和存储(抱歉,如果我错误地使用了我目前正在学习的术语)。第一个元素是我的键,其余的将是嵌套数组形式的值。

targets_value,11.4,10.5,10,10.8,8.3,10.1,10.7,13.1
targets,Cbf1,Sfp1,Ino2,Opi1,Cst6,Stp1,Met31,Ino4
one,"9.6,6.3,7.9,11.4,5.5",N,"8.4,8.1,8.1,8.4,5.9,5.9",5.4,5.1,"8.1,8.3",N,N
two,"7.0,11.4,7.0","4.8,5.3,7.0,8.1,9.0,6.1,4.6,5.0,4.6","6.3,5.9,5.9",N,"4.3,4.8",N,N,N
three,"6.0,9.7,11.4,6.8",N,"11.8,6.3,5.9,5.9,9.5","5.4,8.4","5.1,5.1,4.3,4.8,5.1",N,N,11.8
four,"9.7,11.4,11.4,11.4",4.6,"6.2,7.9,5.9,5.9,6.3","5.6,5.5","4.8,4.8,8.3,5.1,4.3",N,7.9,N
five,7.9,N,"8.1,8.4",N,"4.3,8.3,4.3,4.3",N,N,N
six,"5.7,11.4,9.7,5.5,9.7,9.7","4.4,7.0,7.7,7.5,6.9,4.9,4.6,4.9,4.6","7.9,5.9,5.9,5.9,5.9,6.3",6.7,"5.1,4.8",N,7.9,N
seven,"6.3,11.4","5.2,4.7","6.3,6.0",N,"8.3,4.3,4.8,4.3,5.1","9.8,9.5",N,8.4
eight,"11.4,11.4,5.9","4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9","6.3,6.3,5.9,5.9,6.6,6.6","5.3,5.2,7.0","8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1","9.2,7.4","9.4,9.3,7.9",N
nine,"9.7,9.7,11.4,9.7","5.2,4.6,5.5,6.5,4.5,4.6,5.5","6.3,5.9,5.9,9.5,6.5",N,"4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8",8.0,8.6,N
ten,"9.7,9.7,9.7,11.4,7.9","5.2,4.6,5.5,6.5,4.5,4.6,5.5","6.3,5.9,5.9,9.5,6.5",5.7,"4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8",8.0,8.6,N
YPL250C_Icy2,"11.4,6.1,11.4",N,"6.3,6.0,6.6,7.0,10.0,6.5,9.5,7.0,10.0",7.1,"4.3,4.3",9.2,"10.7,9.5",N
,,,,,,,,
,,,,,,,,

问题在于,在每一行中,有些列是引号,因为每个单元格有多个值,而有些列只有一个条目但没有引号。并且没有输入值的单元格被插入一个 N。因为引号和非引号、数字和非数字混合在一起。

希望输出看起来像这样:

{'eight': ['11.4,11.4,5.9', '4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9', '6.3,6.3,5.9,5.9,6.6,6.6', '5.3,5.2,7.0', '8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1', '9.2,7.4', '9.4,9.3,7.9', 'N'], 

'ten': ['9.7,9.7,9.7,11.4,7.9', '5.2,4.6,5.5,6.5,4.5,4.6,5.5', '6.3,5.9,5.9,9.5,6.5', '5.7', '4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', '8.0', '8.6', 'N'], 

'nine': ['9.7,9.7,11.4,9.7', '5.2,4.6,5.5,6.5,4.5,4.6,5.5', '6.3,5.9,5.9,9.5,6.5', 'N', '4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', '8.0', '8.6', 'N']
}

我编写了一个脚本来清理它并存储它,但不确定我的脚本是否“无缘无故太长”。有什么建议吗?

motif_dict = {}
with open(filename, "r") as file:
    data = file.readlines()
    for line in data:
        if ',,,,,,,,' in line:
            continue
        else:
            quoted_holder = re.findall(r'"(\d.*?\d)"' , line)
            #reverses the order of the elements contained in the array
            quoted_holder = quoted_holder[::-1]
            new_line = re.sub(r'"\d.*?\d"', 'h', line).split(',')
            for position,element in enumerate(new_line):
                if element == 'h':
                    new_line[position] = quoted_holder.pop()
        motif_dict[new_line[0]] = new_line[1:]

【问题讨论】:

    标签: python regex csv dictionary


    【解决方案1】:

    有一个csv 模块可以让您更轻松地处理 csv 文件。在您的情况下,您的代码变为

    import csv
    
    with open("motif.csv","r",newline="") as fp:
        reader = csv.reader(fp)
        data = {row[0]: row[1:] for row in reader if row and row[0]}
    

    if row and row[0] 让我们跳过空行或第一个元素为空的行。这会产生(添加了换行符)

    >>> data["eight"]
    ['11.4,11.4,5.9', '4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9', 
     '6.3,6.3,5.9,5.9,6.6,6.6', '5.3,5.2,7.0',
     '8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1', 
     '9.2,7.4', '9.4,9.3,7.9', 'N']
    >>> data["ten"]
    ['9.7,9.7,9.7,11.4,7.9', '5.2,4.6,5.5,6.5,4.5,4.6,5.5',
     '6.3,5.9,5.9,9.5,6.5', '5.7', '4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', 
     '8.0', '8.6', 'N']
    

    在实践中,对于处理,我认为您希望将 'N' 替换为 None 或其他对象作为缺失标记,并使每个值成为浮点列表(即使它只有一个元素),但是这取决于你。

    【讨论】:

    • strikes table 这让我想起了 Raymond Hettinger 的例子,就是这么美。
    • 哇,非常感谢!我不知道为此存在一个模块,这会为我节省大量时间。抱歉,我还有一个问题,是否可以使用该模块将值设置为元组而不是数组?这样我就可以用它们进行计算,而不必担心匹配错误的元素,因为每个元素都有 8 个元素对应于另一个 8 个值列表。
    • 哇,谢谢梅尔维尔。找到他的“将代码转换为漂亮的惯用 Python”视频。太棒了,很漂亮。
    猜你喜欢
    • 2021-01-31
    • 2014-09-17
    • 1970-01-01
    • 2018-10-08
    • 2021-03-17
    • 1970-01-01
    • 1970-01-01
    • 2017-01-19
    • 2020-08-03
    相关资源
    最近更新 更多