将时间序列表从字符串提取到字典中[关闭]答案

【问题标题】：Extracting timeseries tables from string into dictionary [closed]将时间序列表从字符串提取到字典中[关闭]
【发布时间】：2019-05-27 03:52:21
【问题描述】：

我有一个包含多个时间序列数据的文本文件，如下所示：

Elect Price 
(Jenkins 1989)

1960 6.64784
1961 6.95902
1962 6.8534
1963 6.95924
1964 6.77416
1965 6.96237
1966 6.94241
1967 6.50688
1968 5.72611
1969 5.45512
1970 5.2703
1971 5.75105
1972 5.26886
1973 5.06676
1975 6.14003
1976 5.44883
1977 6.49034
1978 7.17429
1979 7.87244
1980 9.20048
1981 7.35384
1982 6.44922
1983 5.44273
1984 4.3131
1985 5.27546
1986 4.99998
1987 5.78054
1988 5.65552

Hydro Electricity 
(Guyol 1969; Energy Information Administration 1995)

1958 5.74306e+009
1959 5.90702e+009
1960 6.40238e+009
1961 6.77396e+009
1962 7.12661e+009
1963 7.47073e+009
1964 7.72361e+009
1980 1.62e+010
1985 1.85e+010
1986 1.88e+010
1987 1.89e+010
1988 1.96e+010
1989 1.95e+010
1990 2.02e+010
1991 2.05e+010
1992 2.04e+010
1993 2.12e+010

Nuclear Electricity
(Guyol 1969; Energy Information Administration 1995)

1958 4.43664e+006
1959 1.34129e+007
1960 2.56183e+007
1961 4.09594e+007
1962 6.09336e+007
1963 1.09025e+008
1964 1.59522e+008
1980 6.40598e+009
1985 1.33e+010
1986 1.42e+010
1987 1.55e+010
1988 1.68e+010
1989 1.73e+010
1990 1.77e+010
1991 1.86e+010
1992 1.88e+010
1993 1.95e+010

我已将它作为单个字符串加载，我想知道将其转换为以下形式的字典的最佳方法是：

{('Elect Price', '(Jenkins 1989)'): [(1960, 6.64784), (1961, 6.95902), (1962, 6.8534), ...], ...}

我的第一个直觉是逐行遍历字符串并检查是否有几个不同的正则表达式匹配并从那里开始，但我还必须包含逻辑来处理变量名之后要做什么匹配，然后是引用，数据等等。

有没有更好的方法来做到这一点？可能使用某种模板来提取变量名称、引用和数据？我确信这是一个相当普遍的任务，所以我假设有更多的标准方法/工具来解决这个问题。

【问题讨论】：

标签： python regex string time-series

【解决方案1】：

您可以实现使用内置字符串方法split。首先被两个连续的换行符分割。然后，以两个为一组迭代创建的列表以单独格式化数据，同时实现split 以通过单个换行符分割。具体的格式应该很简单，但会很乏味。

大概是这样的：

def parse_input(s):
    # split by two consecutive newlines
    s = s.split("\n\n")

    out = {}
    for i in range(0, len(s), 2):  # iterate in chunks of two.
        # split key by newline, remove extra spaces, and convert to tuple
        key = tuple(map(lambda x: x.strip(), s[i].split("\n")))
        # split value by newline, split each line by space, and evaluate  
        # each piece of data with the builtin 'eval' function.
        value = list(map(lambda x: tuple(map(eval, x.split())), s[i + 1].split("\n")))
        out[key] = value
    return out

由于我是 stackoverflow 的新手，请告诉我如何改进我的答案。

【讨论】：

我不确定您为什么在这里使用了 eval 函数，但我很欣赏这个答案。我最终想出了一些可行的方法，但我仍然有兴趣看看其他人可能会想出什么。
我只是想有一种简单的方法将数据解析为第一个数字的 int 而第二个数字的 float。

【解决方案2】：

我最终找到了一个很棒的网站来帮助解析以类似格式存储的数据here。我不确定如何使用正则表达式解析多行数据。我没有以这种方式提出问题，因为我不想将其限制为这种方法，但是在这里使用这个网站是我想出的：

import re
import pandas as pd

rx_dict = {'data': re.compile(r'^(\d+)\s'),
           'citation': re.compile(r'^(?P<citation>\(.+\))'),
           'variable': re.compile(r'^(?P<variable>[\w|\d|\s]+)$')}


def _parse_line(line):
    """
    Do a regex search against all defined regexes and
    return the key and match result of the first matching regex

    """

    for key, rx in rx_dict.items():
        match = rx.search(line)
        if match:
            return key, match
    # if there are no matches
    return None, None


def parse_file(filepath):
    """
    Parse text at given filepath

    Parameters
    ----------
    filepath : str
        Filepath for file_object to be parsed

    Returns
    -------
    data : dict
        Parsed data

    """

    data = {}  # create an empty dict to collect the data
    # open the file and read through it line by line
    with open(filepath, 'r') as file_object:
        line = file_object.readline()
        while line:
            if not line.strip():
                line = file_object.readline()
            # at each line check for a match with a regex
            key, match = _parse_line(line)

            # extract variable name
            if key == 'variable':
                variable = match.group('variable').strip()

            # extract citation
            if key == 'citation':
                citation = match.group('citation').strip()

            # identify beginning of data
            if key == 'data':
                data[(variable, citation)] = [[], []]
                # read each line of the table until a blank line
                while line.strip():
                    # extract number and value
                    year = int(line.split(' ')[0])
                    value = float(line.split(' ')[1])

                    data[(variable, citation)][0].append(year)
                    data[(variable, citation)][1].append(value)

                    line = file_object.readline()

            line = file_object.readline()

    return data


if __name__ == "__main__":
    filepath = "data_txt.txt"

    data = parse_file(filepath)

这种方法在字符串的每一行上测试一组正则表达式，以确定它是否包含变量名、引用或数据。找到数据后，将读取并处理每一行，直到找到一个空白行。这给了我一些接近预期结果的东西，除了我选择将数据存储在列表列表中而不是元组列表中。

【讨论】：