Python将数据读入数组或列表答案

【问题标题】：Python read data into array or listPython将数据读入数组或列表
【发布时间】：2014-04-07 18:39:33
【问题描述】：

我需要从文本文件中读取数据，对其进行操作，并将其全部存储在数组或列表或其他一些数据结构中，以便我可以将其制成表格，并使用 matplotlib 对其进行绘图。

我打算有一个输入语句，来存储一个治疗日期和时间。输入文件中的日期和时间将从该日期时间中减去，以给出自治疗以来的时间（以分钟或小时为单位）。

首先，我正在使用的输入文件格式如下：

!05/04/2014
@1332
Contact Angle (deg)     106.87
Contact Angle Left (deg)    106.90
Contact Angle Right (deg)   106.85
Wetting Tension (mN/m)      -21.13
Wetting Tension Left (mN/m) -21.16
Wetting Tension Right (mN/m)    -21.11
Base Tilt Angle (deg)       0.64
Base (mm)           1.7001
Base Area (mm2)         2.2702
Height (mm)         1.1174
Sessile Volume (ul)     2.1499
Sessile Surface Area (mm2)  6.3842
Contrast (cts)          255
Sharpness (cts)         186
Black Peak (cts)        0
White Peak (cts)        255
Edge Threshold (cts)        105
Base Left X (mm)        2.435
Base Right X (mm)       4.135
Base Y (mm)         3.801
RMS Fit Error (mm)      2.201E-3

@1333
Contact Angle (deg)     105.42
Contact Angle Left (deg)    106.04
Contact Angle Right (deg)   104.80
Wetting Tension (mN/m)      -19.36
Wetting Tension Left (mN/m) -20.12
Wetting Tension Right (mN/m)    -18.59
Base Tilt Angle (deg)       0.33
Base (mm)           1.6619
Base Area (mm2)         2.1691
Height (mm)         0.9837
Sessile Volume (ul)     1.6893
Sessile Surface Area (mm2)  5.3962
Contrast (cts)          255
Sharpness (cts)         190
Black Peak (cts)        0
White Peak (cts)        255
Edge Threshold (cts)        105
Base Left X (mm)        2.397
Base Right X (mm)       4.040
Base Y (mm)         3.753
RMS Fit Error (mm)      3.546E-3

在文件中，每个新日期都以“！”开头并且采用显示的格式 (dd/mm/yyyy)。

表格应包含输入文件中的日期时间、接触角以及最后治疗后的分钟数。

下面的代码从文本文件中提取所需的相关信息，并写入另一个文件，但我不知道存储信息的最佳方式是什么。

with open(infile) as f, open(outfile, 'w') as f2:
    for line in f:
        if line.split():
            if line.split()[0][0] == '!':
                for i in range(1,11):
                    current_date += (line.split()[0][i])
                f2.write(current_date[:2] + ' ' + current_date[3:5] + ' ' + current_date[6:] + '\n')
            current_date = ""
            if line.split()[0][0] == '@':
                for i in range(0,5):
                    measure_time += (line.split()[0][i])
                f2.write(measure_time[1:3] + ":" + measure_time[3:] + '\n')
            if line.split()[0] == "Contact" and line.split()[2] == "(deg)":
                contact_angle = line.split()[-1].strip()
                f2.write("Contact Angle (deg): " + contact_angle + '\n\n')
            measure_time = ""
        else:
            continue

我也一直在玩 datetime，并且有一些代码可以计算从单个输入处理以来的时间，但我需要它来应用输入文件中的每个日期和时间。

from datetime import datetime
import numpy as np

dt = input("Enter treatment date and time in format: dd mm yyyy hh:mm\n")
#dt = '27 03 2014 12:06'

dob = datetime.strptime(dt,'%d %m %Y %H:%M')



b = datetime(2014,3,27,16,22,0)
c = b-dob
print(c.seconds)
print(c.seconds/60)
print(c.seconds//3600)

最后，我想使用 matplotlib 绘制接触角与治疗后时间的关系。

如果有人能帮我解决这个问题，我将不胜感激。

【问题讨论】：

你没有陈述你的问题。你有什么问题？
对不起，如果问题不清楚。我可以将相关数据提取到文件中，但我希望能够将上述点（日期时间、接触角、治疗后的时间）制成表格并绘制图表。我不知道该怎么做，以及如何将数据存储在 Python 中以进行操作。
您的问题几乎没有问题，而关于 SO 的一般规则有时是一个问题。给我一点时间，我给你举例说明如何解析与你相似的数据（但不要指望完整的解决方案，请）。
一件事：您确定文件中的数据是用空格分隔的吗？如果有制表符，解析这个文件会更容易理解。
非常感谢。标签和值之间的空格由制表符分隔。

标签： python arrays numpy matplotlib

【解决方案1】：

以下是解析此类文件的方法。所有内容都存储在包含字典的字典中（海龟一直向下:)。主键 ar ID (@smth)。

另一种方法是按日期存储，每个项目都是按 ID 列出的字典列表。但这对collections.defauldict 来说是最简单的，这可能会让你有点困惑。因此，下面的解决方案可能不是最好的，但对您来说应该更容易理解。

data = {}

date = ID = values = None

for line in datafile:
    if line.lstrip().startswith('!'):
        date = line[1:].strip()
        print date, line
    elif line.lstrip().startswith('@'):
        ID = line[1:].strip()
        data[ID] = {}
        data[ID]['date'] = date
    elif line.strip(): # line not all whitespace
        if not ID: 
            continue # we skip until we get next ID
        try:
            words = line.split()
            value = float(words[-1]) # last word
            unit = words[-2].lstrip('(').rstrip(')')
            item = {'value': value, 'unit': unit}
            key = ' '.join(words[:-2])
            data[ID][key] = item
        except (ValueError) as err:
            print "Could not parse this line:"
            print line
            continue
    else: # if 'empty' line
        ID = None

我鼓励您逐行分析这一行，在https://docs.python.org/2/ 中查找方法。如果你真的被困在 cmets 中，有人可以给你一个指向更具体页面的链接。总账。

【讨论】：

我已经浏览了您发布的代码，并且可以理解其中大部分发生了什么。我现在想知道，您是否可以帮我绘制接触角与在 iPython 环境中使用 matplotlib 治疗后的时间关系？

【解决方案2】：

您显然有记录，因此您的数据会以最佳方式组织。

您的示例中的每条记录都以@ 开头（然后我假设是测量索引）。这些记录中的每一个都有一个额外的字段：顶部列出的日期。

records = []
record = {}
for line in f:
    kv = line.strip().split('\t')
    if kv[0].startswith('@'):
        record['measurement_date'] = msr_date
        records.append(record)  # store the last record
        record = {}  # make a new empty record
        for n in range(21):
            kv = f.next().strip().split('\t')
            quantity = kv[0].split('(')[0].strip()
            value = float(kv[1])
            record[quantity] = value
    elif kv[0].startswith('!'):
        msr_date = datetime.strptime(kv[0][1:], "%d/%m/%Y")   # append it to the record later
    else: 
        pass  # empty line
records.pop()  # The first record is a dummy record
# the last record has nog been appended yet
record['measurement_date'] = msr_date
records.append(record)

最后，您会得到一个字典列表records。然后，您可以循环这些以更有效的形式存储它们，例如使用numpy structured arrays。

arr = np.array([ (d['Contact Angle'], d['msr_date'], d['msr_date'] - treatment_date)
    for d in records ], dtype=[
    ('contact_angle', 'f4'),
    ('msr_date', 'datetime64'),
    ('lapse_time', 'timedelta64')])

请注意，如果datetime64 是您需要的格式，则必须查找（请查看this SO question 。

使用最后一个arr，您可以将所有内容整齐地放置在“列”中，但您可以通过名称访问它们。例如，您可以绘制

plt.plot(arr['lapse_time'], arr['contact_angle']) 但您必须告诉 matplotlib 对其独立变量使用 timedelta 参数，如 shown here for example。

【讨论】：

我还没有尝试过你的解决方案，但我只是为了清楚起见而写这个。 @1200（例如）只是在 12:00 测量的符号。 “@”后面的四个数字是 24 小时格式的时间。此外，日期 (!07/04/2014) 行将添加到新日期。当天可以在不同的时间进行多次测量。
抱歉，我遇到了一个错误。在第 24 行，value = float(kv[1]) - ValueError: could not convert string to float.
单位和数量之间是否还有一个选项卡？在这种情况下，您必须在任何地方都使用 kv[2] 而不是 kv[1]。
我还想指出，我认为 @m.wasowski 解析文件的解决方案更加优雅，因为它优雅地跟踪 HHMM 规范（他称之为 ID）和特别注意不可解析的内容。
@Matthew 如果你得到一个 IndexError，那是因为你试图访问一个不存在的数组元素。您可以通过在该行之前添加print(repr(kv[1])) 来找出问题所在（并了解 Python 的工作方式）。您应该看到返回的所有数字都带有单引号。使用您提供的数据并将单位和值之间的空格更改为单个选项卡，我没有收到此错误，因此请调试或编辑第一篇文章以反映实际文件。但是，我建议您使用@m-wasowski 提供的程序解析数据。