应用 Lambda 将（棘手的）字符串重新编码为数字答案

【问题标题】：Applying Lambda to Recode (tricky) Strings to Numbers应用 Lambda 将（棘手的）字符串重新编码为数字
【发布时间】：2017-07-14 06:02:36
【问题描述】：

我有一个大型 NFL 场景数据集，但为了便于说明，我将其简化为包含 2 个观察值的列表。像这样：

data = [[scenario1],[scenario2]]

以下是数据集的组成：

data[0][0]
>>"It is second down and 3. The ball is on your opponent's 5 yardline. There is 3 seconds left in the fourth quarter. You are down by 3 points."

data[1][0]
>>"It is first down and 10. The ball is on your 20 yardline. There is 7 minutes left in the third quarter. You are down by 10 points."

我无法用这样的字符串格式的数据构建任何模型。因此，我想将这些场景重新编码为新的列（或者如果你愿意的话）作为定量值。我想我应该先把数据框弄平：

down = 0
yards = 0
yardline = 0
seconds = 0
quarter = 0
points = 0

data = [[scenario1, down, yards, yardline, seconds, quarter, points], [scenario2, yards, yardline, seconds, quarter, points]]

现在是棘手的部分，我必须如何从场景列的信息中填充新列。棘手，因为例如，在第二句中，如果出现“对手”这个词，这意味着我们必须将其计算为 100——无论码线编号是多少。在上面的scenario1变量中，应该是100-5=95。

起初我以为我应该把所有的数字分开并扔掉单词，但正如上面所指出的，有些单词实际上是正确分配数量值所必需的。我从来没有做过这么微妙的 lambda。或者，lambda 不是正确的方法？我愿意接受任何/所有建议。

为了强化，这里是我想看到的（如果我输入来自scenario1：

data[0][1:]
>>2,3,95,3,4,-3

谢谢

【问题讨论】：

标签： string list python-3.x lambda

【解决方案1】：

lambda 不是您想要的方式。 Python 的 re 模块是你的朋友 :)

from re import search

def getScenarioData(scenario):
    data = []

    ordinals_to_nums = {'first':1, 'second':2, 'third':3, 'fourth':4}
    numerals_to_nums = {
        'zero':0, 'one':1, 'two':2, 'three':3, 'four':4,
        'five':5, 'six':6, 'seven':7, 'eight':8, 'nine':9
    }

    # Downs
    match = search('(first|second|third|fourth) down and', scenario)
    if match:
        raw_downs = match.group(1)
        downs = ordinals_to_nums[raw_downs]
        data.append(downs)

    # Yards
    match = search('down and (\S+)\.', scenario)
    if match:
        raw_yards = match.group(1)
        data.append(int(raw_yards))

    # Yardline
    match = search("(oponent's)? (\S+) yardline", scenario)
    if match:
        raw_yardline = match.groups()
        yardline = 100-int(raw_yardline[1]) if raw_yardline[0] else int(raw_yardline[1])
        data.append(yardline)

    # Seconds
    match = search('(\S+) (seconds|minutes) left', scenario)
    if match:
        raw_secs = match.groups()
        multiplier = 1 if raw_secs[1] == 'seconds' else 60
        data.append(int(raw_secs[0]) * multiplier)

    # Quarter
    match = search('(\S+) quarter', scenario)
    if match:
        raw_quarter = match.group(1)
        quarter = ordinals_to_nums[raw_quarter]
        data.append(quarter)

    # Points
    match = search('(up|down) by (\S+) points', scenario)
    if match:
        raw_points = match.groups()
        if raw_points:
            polarity = 1 if raw_points[0] == 'up' else -1
            points = int(raw_points[1]) * polarity
        else:
            points = 0
        data.append(points)

    return data

就个人而言，我发现像[[scenario, <scenario_data>], ...] 这样存储您的数据有点奇怪，但是要将数据添加到每个场景中：

for s in data:
    s.extend(getScenarioData(s[0]))

我建议使用字典列表，因为使用像 data[0][3] 这样的索引可能会在一两个月后变得混乱：

def getScenarioData(scenario):
    # instead of data = []
    data = {'scenario':scenario}

    # instead of data.append(downs)
    data['downs'] = downs

    ...

scenarios = ['...', '...']
data = [getScenarioData(s) for s in scenarios]

编辑：当您想从字典中获取值时，请使用get 方法来防止引发KeyError，因为如果找不到密钥，get 默认为None：

for s in data:
    print(s.get('quarter'))

【讨论】：

哇，太有用了！！我将尝试不同的数据框，看看哪个效果更好。我开始通过 dicts 列表以您的方式看待它。
很高兴我能帮上忙 :)
我确定它可以工作，但我收到错误“NoneType”对象没有属性：组。翻阅数据后，我发现有些场景非常简单，比如“你落后 3 分，你在做什么？”我应该为此使用try/except 吗？还是有其他方法可以默认为 0 或 na？我注意到你的一些重新编码有if，也许我应该在所有重新编码上标记它？
我不确定场景的语法有多严格，所以我让它很容易出错。我将编辑答案以说明在每个 search 调用中找不到匹配项
太棒了，现在我看到了search 的力量。还要感谢get 的提示，它现在可以完美运行。你这个男人！