如何在 Pandas DataFrame 列中展开连字符分隔的数字范围？答案

【问题标题】：How do you unwrap hyphen separated number ranges in a Pandas DataFrame column?如何在 Pandas DataFrame 列中展开连字符分隔的数字范围？
【发布时间】：2020-07-29 08:44:37
【问题描述】：

我有一个 Pandas DataFrame，它有一列包含逗号分隔的数字、& 分隔的数字和连字符分隔的数字范围...

Title   LLFCs     Red     Amber   Green
a       15, 18    11.65   2.86    1.89
b       16 & 19   9.08    2.93    1.53
c       112-114   6.45    2.54    1.64

我希望每个 'LLFC' 值都有自己的行，这意味着连字符所暗示的数字（在本例中为 113）也必须展开。我的理想结果如下...

Title   LLFCs     Red     Amber   Green
a       15        11.65   2.86    1.89
a       18        11.65   2.86    1.89
b       16        9.08    2.93    1.53
b       19        9.08    2.93    1.53
c       112       6.45    2.54    1.64
c       113       6.45    2.54    1.64
c       114       6.45    2.54    1.64

除了解开连字符值之外，我目前有以下几行可以满足我的所有需求...

data1 = data1.assign(LLFCs=data1['LLFCs'].str.replace('-',', '))
data1 = data1.assign(LLFCs=data1['LLFCs'].str.replace(' & ',', '))
data1 = data1.assign(LLFCs=data1['LLFCs'].str.split(', ')).explode('LLFCs')

这段代码实现了以下...

Title   LLFCs     Red     Amber   Green
a       15        11.65   2.86    1.89
a       18        11.65   2.86    1.89
b       16        9.08    2.93    1.53
b       19        9.08    2.93    1.53
c       112       6.45    2.54    1.64
c       114       6.45    2.54    1.64

这显然不包括连字符包装的值，有人能帮我解决这个问题吗？

【问题讨论】：

所以在连字符的情况下你需要扩大范围？

标签： python pandas dataframe split hyphen

【解决方案1】：

灵感来自这里numeric string to range

import re
data = '''Title   LLFCs     Red     Amber   Green
a       15, 18    11.65   2.86    1.89
b       16 & 19    9.08    2.93    1.53
c       112-114   6.45    2.54    1.64'''
arr = [[t for t in re.split(r"[ ][ ]+", l)] for l in data.split("\n")]
df = pd.DataFrame(arr[1:], columns=arr[0])

def f(x):
    x = re.sub(" ","", x)
    result = []
    for part in x.split(','):
        if "-" in part:
            a, b = part.split("-")
            a, b = int(a), int(b)
            result.extend(range(a, b + 1))
        elif "&" in part:
            a, b = part.split("&")
            result += [int(a), int(b)]
        else:
            a = int(part)
            result.append(a)
    return result

df = df.assign(LLFCs=lambda x: [f(curr) for curr in x["LLFCs"]]).explode("LLFCs")
print(df.to_string(index=False))

输出

Title LLFCs    Red Amber Green
    a    15  11.65  2.86  1.89
    a    18  11.65  2.86  1.89
    b    16   9.08  2.93  1.53
    b    19   9.08  2.93  1.53
    c   112   6.45  2.54  1.64
    c   113   6.45  2.54  1.64
    c   114   6.45  2.54  1.64

【讨论】：

【解决方案2】：

拆分 LLFCs 列，并遍历结果 - 如果分隔符是 -，则创建一个数字范围。之后可以explode：

df['LLFCs'] = [tuple(range(int(first),int(last)+1))
               if delimiter == "-" else (int(first), int(last)) 
               for a, delimiter, b 
               #note that the delimiter is wrapped in parentheses,
               #this keeps the delimiter as part of the extract
               in df.LLFCs.str.split("([,&-])")
              ]
df


Title   LLFCs          Red     Amber    Green
0   a   (15, 18)       11.65    2.86    1.89
1   b   (16, 19)        9.08    2.93    1.53
2   c   (112, 113, 114) 6.45    2.54    1.64

现在，你可以爆炸了：

df.explode("LLFCs")

Title   LLFCs   Red Amber   Green
0   a   15  11.65   2.86    1.89
0   a   18  11.65   2.86    1.89
1   b   16  9.08    2.93    1.53
1   b   19  9.08    2.93    1.53
2   c   112 6.45    2.54    1.64
2   c   113 6.45    2.54    1.64
2   c   114 6.45    2.54    1.64

【讨论】：