从文本文件的行中提取数据答案

【问题标题】：Extract data from lines of a text file从文本文件的行中提取数据
【发布时间】：2012-11-11 08:45:33
【问题描述】：

我需要从文本文件的行中提取数据。数据是名称和评分信息，格式如下：

Shyvana - 12/4/5 - Loss - 2012-11-22
Fizz - 12/4/5 - Win - 2012-11-22
Miss Fortune - 12/4/3 - Win - 2012-11-22

这个文件是由我的小 Python 程序的另一部分生成的，我在其中询问用户姓名，从姓名列表中查找他们输入的姓名以确保其有效，然后询问击杀、死亡、助攻和无论他们赢了还是输了。然后我要求确认并将该数据写入新行的文件中，然后像这样在末尾附加日期。准备该数据的代码：

data = "%s - %s/%s/%s - %s - %s\n" % (
        champname, kills, deaths, assists, winloss, timestamp)

基本上，我想在程序的另一部分读回该数据并将其显示给用户并使用它进行计算，例如特定名称随时间推移的平均值。

我是 python 新手，而且我一般对编程不是很有经验，所以我发现的大多数字符串拆分和格式化示例都太神秘了，我无法理解如何适应我在这里需要的东西，有人可以帮忙吗？我可以以不同的方式格式化写入的数据，这样令牌查找会更简单，但我希望它直接在文件中简单。

【问题讨论】：

当你读回来的时候，你想把它存储在什么数据结构中？
哦，天哪，非常感谢大家，最后，这种拆分业务是有道理的！我会尝试其中一些，看看什么最适合我，谢谢！感恩节快乐！

标签： python string file split extract

【解决方案1】：

以下内容会将所有内容读入以玩家名称为键的字典中。与每个玩家关联的值本身就是一个字典，充当记录，其中包含与项目关联的命名字段，这些字段被转换为适合进一步处理的格式。

info = {}
with open('scoring_info.txt') as input_file:
    for line in input_file:
        player, stats, outcome, date = (
            item.strip() for item in line.split('-', 3))
        stats = dict(zip(('kills', 'deaths', 'assists'),
                          map(int, stats.split('/'))))
        date = tuple(map(int, date.split('-')))
        info[player] = dict(zip(('stats', 'outcome', 'date'),
                                (stats, outcome, date)))

print('info:')
for player, record in info.items():
    print('  player %r:' % player)
    for field, value in record.items():
        print('    %s: %s' % (field, value))

# sample usage
player = 'Fizz'
print('\n%s had %s kills in the game' % (player, info[player]['stats']['kills']))

输出：

info:
  player 'Shyvana':
    date: (2012, 11, 22)
    outcome: Loss
    stats: {'assists': 5, 'kills': 12, 'deaths': 4}
  player 'Miss Fortune':
    date: (2012, 11, 22)
    outcome: Win
    stats: {'assists': 3, 'kills': 12, 'deaths': 4}
  player 'Fizz':
    date: (2012, 11, 22)
    outcome: Win
    stats: {'assists': 5, 'kills': 12, 'deaths': 4}

Fizz had 12 kills in the game

或者，与其将大部分数据保存在字典中，这会使嵌套字段访问有点尴尬 — info[player]['stats']['kills'] — 您可以改用更高级的“通用”类来保存它们，这会让您改写info2[player].stats.kills。

为了说明，这里使用我命名为 Struct 的类几乎是一样的，因为它有点像 C 语言的 struct 数据类型：

class Struct(object):
    """ Generic container object """
    def __init__(self, **kwds): # keyword args define attribute names and values
        self.__dict__.update(**kwds)

info2 = {}
with open('scoring_info.txt') as input_file:
    for line in input_file:
        player, stats, outcome, date = (
            item.strip() for item in line.split('-', 3))
        stats = dict(zip(('kills', 'deaths', 'assists'),
                          map(int, stats.split('/'))))
        victory = (outcome.lower() == 'win') # change to boolean T/F
        date = dict(zip(('year','month','day'), map(int, date.split('-'))))
        info2[player] = Struct(champ_name=player, stats=Struct(**stats),
                               victory=victory, date=Struct(**date))
print('info2:')
for rec in info2.values():
    print('  player %r:' % rec.champ_name)
    print('    stats: kills=%s, deaths=%s, assists=%s' % (
          rec.stats.kills, rec.stats.deaths, rec.stats.assists))
    print('    victorious: %s' % rec.victory)
    print('    date: %d-%02d-%02d' % (rec.date.year, rec.date.month, rec.date.day))

# sample usage
player = 'Fizz'
print('\n%s had %s kills in the game' % (player, info2[player].stats.kills))

输出：

info2:
  player 'Shyvana':
    stats: kills=12, deaths=4, assists=5
    victorious: False
    date: 2012-11-22
  player 'Miss Fortune':
    stats: kills=12, deaths=4, assists=3
    victorious: True
    date: 2012-11-22
  player 'Fizz':
    stats: kills=12, deaths=4, assists=5
    victorious: True
    date: 2012-11-22

Fizz had 12 kills in the game

【讨论】：

这看起来很有希望，我在我的文件上工作，我怎样才能得到一个球员的特定统计数据？我正在阅读的教程书没有深入探讨字典语法，例如，我怎么能打印“Fizz had”、kills、“in the game”。
@Kassandra：那就是print 'Fizz had %s kills in the game' % info['Fizz']['stats']['kills']。还有其他结构化数据的方法，例如使用一个或多个自定义类，或者可能使用collections 模块中的namedtuples 等内置类。他们会让你写info['Fizz'].stats.kills。
哦，天哪，听起来不错，我会在这里尝试一下，看看我能不能得到我想要的，我不知道我能像我一直在努力的那样做到这一点在我的文件中调整一个全新的函数设置来处理它，当我可以设置一些变量时，我试试看，namedtuples 上的符号看起来也不错，我也会尝试，再次感谢！
@Kassandra：我犹豫是否将Struct 的想法放在我更新的答案中，因为它对于 Python 新手来说可能太先进了——但替代方案是多个自定义类和/或namedtuples——更多的代码——所以我认为冒险是值得的。它的代码很短，因为类是在内部使用字典实现的。
我似乎在将新字段集成到这本词典中时遇到了一些麻烦。我需要它包含一个gameid 键并从[00523] Lulu - 6/1/19 - Win - 2012-11-23 行中获取00523，我不确定字典是否允许它或者我是否想要它，但我想我想要stats是gameid 的一种“子文件夹”，所以字典的结构类似于champion > gameid > stats > kills, deaths, assists, result, date 我不确定这是否可以用这本字典来完成。

【解决方案2】：

您想使用 split (' - ') 来获取零件，然后可能再次获取数字：

for line in yourfile.readlines ():
    data = line.split (' - ')
    nums = [int (x) for x in data[1].split ('/')]

应该在 data[] 和 nums[] 中为您提供所需的所有内容。或者，您可以使用 re 模块并为其编写正则表达式。不过，这似乎还不够复杂。

【讨论】：

【解决方案3】：

有两种方法可以从文本文件示例中读取数据。

第一种方法

您可以使用 python 的 csv 模块并指定您的分隔符为-。

见http://www.doughellmann.com/PyMOTW/csv/

第二种方法

或者，如果您不想使用此 csv 模块，则可以在将文件中的每一行作为字符串读取后简单地使用 split 方法。

f = open('myTextFile.txt', "r")
lines = f.readlines()

for line in lines:
    words = line.split("-")   # words is a list (of strings from a line), delimited by "-".

所以在上面的示例中，champname 实际上是words 列表中的第一项，即words[0]。

【讨论】：

之前没写完。

【解决方案4】：

# Iterates over the lines in the file.
for line in open('data_file.txt'):
    # Splits the line in four elements separated by dashes. Each element is then
    # unpacked to the correct variable name.
    champname, score, winloss, timestamp = line.split(' - ')

    # Since 'score' holds the string with the three values joined,
    # we need to split them again, this time using a slash as separator.
    # This results in a list of strings, so we apply the 'int' function
    # to each of them to convert to integer. This list of integers is
    # then unpacked into the kills, deaths and assists variables
    kills, deaths, assists = map(int, score.split('/'))

    # Now you are you free to use the variables read to whatever you want. Since
    # kills, deaths and assists are integers, you can sum, multiply and add
    # them easily.

【讨论】：

我正在尝试这个，但我认为我没有正确使用它，我正在尝试做'info =“Miss Fortune - 12/4/3 - Win - 2012-11- 22" for item in info: champname, score, winloss, timestamp = item.split(" - ") print champname'
如果您想用单行测试，请在列表中使用for line in ["Miss Fortune - 12/4/3 - Win - 2012-11-22"]:，而不是原始字符串。否则它将读取单个字符并尝试从中提取信息。

【解决方案5】：

首先，您将行分成数据片段

>>> name, score, result, date = "Fizz - 12/4/5 - Win - 2012-11-22".split(' - ')
>>> name
'Fizz'
>>> score
'12/4/5'
>>> result
'Win'
>>> date
'2012-11-22'

其次，解析你的分数

>>> k,d,a = map(int, score.split('/'))
>>> k,d,a
(12, 4, 5)

最后，将日期字符串转换为日期对象

>>> from datetime import datetime    
>>> datetime.strptime(date, '%Y-%M-%d').date()
datetime.date(2012, 1, 22)

现在您已将所有部分解析并规范化为数据类型。

【讨论】：