【问题标题】:splitting csv file in individual columns based on range of values in python根据python中的值范围将csv文件拆分为各个列
【发布时间】:2019-02-03 19:25:48
【问题描述】:

我想拆分/分隔 csv 列范围内给定的值,为范围内的每个数字添加新数据,同时保持所有其他列的数据匹配。

重要的是我能够为 (xy) 范围内的任何数字维护其他列 (Job ID) 中的数据,因此写入的结果 csv 显然会比原来的要长得多.

我希望我的输出 csv 代表 26-29、66-67 等范围内每个数字的单独列。所以我想要一个输出 csv 文件,例如:

Job ID 21879 表示 4 次,分别代表 26、27、28 和 29。

我想在为我的脚本编写以下步骤之前先执行此操作,但现在卡住了。

脚本的其余部分用 (/) 分割日期值,将它们分配给新行并将它们与页码字段连接起来。这是我希望为显示范围内的数字拆分的页码字段。

此脚本的结果列表仅输出作业 ID 列中所需的值,以及第二个中的连接日期和页面字段。这部分工作正常,它是我需要将每个数字表示为给定范围的单个数字的最后一个 csv 文件。

感谢您在拆分这些值范围和维护其他数据字段方面的帮助。

我的输入数据的一个子集如下:

Job ID  Job summary Link    Locality    Received    Job status  Asset   Date       Page No
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  26-29
21878   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21877   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21876   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21875   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21874   Addition    Documents Link  CBD 28/06/2018  Completed   Water   26/07/2018  42-43
21873   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  
21872   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  66-67
21871   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  07-08
21870   Addition    Documents Link  CBD 27/06/2018  Completed   Water   28/06/2018  59
21869   Addition    Documents Link  CBD 27/06/2018  Completed   Water   28/06/2018  58
21868   Addition    Documents Link  CBD 26/06/2018  Completed   Water       
21867   Addition    Documents Link  CBD 26/06/2018  Completed   Water       

我想要的输出是:

Job ID  Job summary Link    Locality    Received    Job status  Asset   Date       Page No
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  26
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  27  
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  28  
21879   Addition    Documents Link  CBD 15/06/2018  Completed   Water   28/06/2018  29  
21878   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21877   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21876   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21875   Addition    Documents Link  CBD 28/06/2018  Completed   Water       
21874   Addition    Documents Link  CBD 28/06/2018  Completed   Water   26/07/2018  42
21874   Addition    Documents Link  CBD 28/06/2018  Completed   Water   26/07/2018  43
21873   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  
21872   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  66
21872   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  67
21871   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  07
21871   Addition    Documents Link  CBD 27/06/2018  Completed   Water   26/07/2018  08
21870   Addition    Documents Link  CBD 27/06/2018  Completed   Water   28/06/2018  59
21869   Addition    Documents Link  CBD 27/06/2018  Completed   Water   28/06/2018  58
21868   Addition    Documents Link  CBD 26/06/2018  Completed   Water       
21867   Addition    Documents Link  CBD 26/06/2018  Completed   Water       

当前脚本是:

import os
import csv
with open('CSV_File.csv','r') as csvinput:  
    with open('temp__spreadsheet_cache_1.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput)
        for row in csv.reader(csvinput):
            if row[7] == "Date":
                writer.writerow(row+["day"])
            else:
                writer.writerow(row+row[4].split('/'))
with open('temp__spreadsheet_cache_1.csv','r') as csvinput:
    with open('temp__spreadsheet_cache_2.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput)
        for row in csv.reader(csvinput):
            if row[7] == "Date":
                writer.writerow(row+["month"])
            else:
                writer.writerow(row+row[4].split('/'))
with open('temp__spreadsheet_cache_2.csv','r') as csvinput:
    with open('temp__spreadsheet_cache_3.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput)
        for row in csv.reader(csvinput):
            if row[7] == "Date":
                writer.writerow(row+["year"])
            else:
                writer.writerow(row+row[4].split('/'))
with open('temp__spreadsheet_cache_3.csv','r') as csvinput:
    with open('temp__spreadsheet_cache_4.csv', 'w') as csvoutput:
        writer = csv.writer(csvoutput)
        for row in csv.reader(csvinput):
            if row[7] == "Date":
                writer.writerow(row+["Concatenation"])
            else:
                writer.writerow(row+row[4].split('/'))
#---Using Current output (temp__spreadsheet_cache_4.csv) to create new list--
blank =[]
with open (r'temp__spreadsheet_cache_4.csv', 'r') as NEW_CSV:
    csvReader = csv.reader(NEW_CSV, delimiter=',', quotechar='"')
    header = csvReader.next()
    JobIndex = header.index("Job ID")
    PageIndex = header.index("Page No")
    DayIndex = header.index("day")
    MonthIndex = header.index("month")
    YearIndex = header.index("year")
    Summary = header.index("Job summary")
    StatusIndex = header.index("Job status")
    class_1 = header.index("Asset")
    for row in csvReader:
        Page = row[PageIndex]
        Day = row[DayIndex]
        Month = row[MonthIndex]
        Year = row[YearIndex]
        JobID = row[JobIndex]
        To_be_overridden_concat = row[PageIndex]
        Type = row[Summary]
        Status = row[StatusIndex]
        waterclass = row[class_1]
        if waterclass == 'Water'  
          blank.append([JobID,Day,Month,Year,Page,To_be_overridden_concat])
str(blank)
for column in blank:
    column[1] = column[1].lstrip('0')
    column[2] = column[2].lstrip('0')
    column[3] = column[3].lstrip('0')
    column[4] = column[4].lstrip('0')
for column in blank:
    column[0] = column[0].lstrip()
    column[1] = column[1].lstrip()
    column[2] = column[2].lstrip()
    column[3] = column[3].lstrip() 
    column[4] = column[4].lstrip()
for column in blank:
    column[0] = column[0].rstrip()
    column[1] = column[1].rstrip()
    column[2] = column[2].rstrip()
    column[3] = column[3].rstrip()
    column[4] = column[4].rstrip()
    column[5] = column[1]+column[2]+column[3]+column[4]
##os.remove("temp__spreadsheet_cache_4.csv")
os.remove("temp__spreadsheet_cache_3.csv")
os.remove("temp__spreadsheet_cache_2.csv")
os.remove("temp__spreadsheet_cache_1.csv")
for row in blank:
    del row[1:5]
print blank[0:10]

【问题讨论】:

  • 您能分享一下您输入的真实内容/结构吗?您在没有指定分隔符的情况下创建 csv 阅读器这一事实在某种程度上表明您的输入使用逗号作为分隔符,而不是上面示例所建议的空白字符。具体来说,我想知道当没有给出日期和页码时是否有空白单元格,即该行有两个尾随逗号。
  • 此外,除了重复的行之外,您所需的示例输出在结构上看起来与您的输入相同,但您的代码会在添加一些新列时忽略许多原始列。那么,两者中的哪一个是您想要的输出?还有... :) if waterclass == 'Water' 行最后缺少一个冒号,这让我想到了一个问题:只有在您的输入中 AssetWater 时,您才希望这样做吗?
  • 抱歉,我是论坛的新手。显然必须发布一定次数来添加数据的图像/屏幕截图,但分隔符在第 38 行。如果不添加屏幕截图,我有点不确定该怎么做。因此尝试仅针对 csv 文件中的可视数据表示进行调整。 excel 数据刚刚被复制到此处的 .py 文件中。空白单元格没问题,并且没有这些问题。我坚持的部分实际上是在提供的代码之前,只是试图通过添加到目前为止所做的事情来提供上下文。正确的!复制错误,应该是....'水':提前谢谢!
  • 如果提供的代码与您的问题没有直接关系,最好不要发布。它只会让人们感到困惑,因为它让我感到困惑。您应该只包括理解和回答您的问题所必需的内容。您是否尝试过从示例输入到示例输出?
  • 直接复制/粘贴示例数据(来自文本编辑器)比尝试为问题格式化更好。使用 edit 按钮进行任何更改以改进您的问题。

标签: python csv


【解决方案1】:

首先我需要假设你有一个标准的 CSV 文件,用逗号分隔字段,例如:

Job ID,Job summary,Link,Locality,Received,Job status,Asset,Date,Page No
21879,Addition,Documents,Link,CBD,15/06/2018,Completed,Water,28/06/2018,26-29
21878,Addition,Documents,Link,CBD,28/06/2018,Completed,Water,,
21874,Addition,Documents,Link,CBD,28/06/2018,Completed,Water,26/07/2018,42-43
21873,Addition,Documents,Link,CBD,27/06/2018,Completed,Water,26/07/2018,1

如果是这种情况,那么您的数据可以固定如下:

from datetime import datetime
import csv

fieldnames = ["Job ID", "Job summary", "Link", "Locality", "ReceivedDay", "ReceivedMonth", "ReceivedYear", "Job status", "Asset", "Day", "Month", "Year", "Page No"]

with open("CSV_File.csv", "rb") as f_input, open("output.csv", "wb") as f_output:
    csv_input = csv.reader(f_input)
    next(csv_input) # skip the header

    csv_output = csv.writer(f_output)
    csv_output.writerow(fieldnames)

    for row in csv_input:
        date_received = row[5].split('/')

        if len(row[8]):
            date = row[8].split('/')
        else:
            date = ["", "", ""]

        if row[9].find('-') != -1:
            pages = map(int, row[9].split("-"))

            for page in range(pages[0], pages[1] + 1):
                output_row = row[:5] +  date_received + row[6:8] + date + [page]
                csv_output.writerow(output_row)
        else:
            output_row = row[:5] +  date_received + row[6:8] + date + [row[9]]
            csv_output.writerow(output_row)

这会给你一个输出文件开始:

Job ID,Job summary,Link,Locality,ReceivedDay,ReceivedMonth,ReceivedYear,Job status,Asset,Day,Month,Year,Page No
21879,Addition,Documents,Link,CBD,15,06,2018,Completed,Water,28,06,2018,26
21879,Addition,Documents,Link,CBD,15,06,2018,Completed,Water,28,06,2018,27
21879,Addition,Documents,Link,CBD,15,06,2018,Completed,Water,28,06,2018,28
21879,Addition,Documents,Link,CBD,15,06,2018,Completed,Water,28,06,2018,29
21878,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,,,,
21877,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,,,,
21876,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,,,,
21875,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,,,,
21874,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,26,07,2018,42
21874,Addition,Documents,Link,CBD,28,06,2018,Completed,Water,26,07,2018,43

它的工作原理是首先跳过输入标头并编写合适的输出标头。它假定收到的日期始终存在。 split('/') 用于将日期分成三部分。如果页码中包含-符号,则使用split('-')获取这两部分,然后转换为两个整数。

输出行是通过将输入行的部分与两个日期部分组合在一起来创建的。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-09-03
    • 2022-01-12
    • 1970-01-01
    • 2022-11-28
    • 2013-09-05
    • 2017-10-10
    相关资源
    最近更新 更多