【问题标题】:Extract Float Values from Strings with Conditions in Dataframe从具有数据框中条件的字符串中提取浮点值
【发布时间】:2022-01-10 03:41:16
【问题描述】:

我有以下人员数据框及其测试结果:

df = pd.DataFrame({'name': 'A B C D E F G H I J'.split(),
                   'result': [0.30, '<0.30', '0.20', 1.20, 'less than 0.30', 'less than 0.25', 1.26, 'test 1.29', 'less than 3.30', 'more than 0.40']})

print(df)

  name          result
0    A             0.3
1    B           <0.30
2    C            0.20
3    D             1.2
4    E  less than 0.30
5    F  less than 0.25
6    G            1.26
7    H       test 1.29
8    I  less than 3.30
9    J  more than 0.40

我需要从result 列中提取浮点值,为此我应用了以下代码:

df['result'] = df['result'].str.extract(r'(\d+.\d+)').astype('float')

但是,有一个问题。我的阈值为0.30。这意味着我需要保留结果为0.30 或大于0.30 的行。使用此逻辑,应省略 less then 30&lt;30 的结果。这就是我应用过滤器时df[df['result'] &gt;= 0.30] 不起作用的原因。

 name   result
0   A   0.3
1   B   0.30 # should be omitted as it's less than 0.30
3   D   1.2
4   E   0.30 # should be omitted as it's less than 0.30
6   G   1.26
7   H   1.29
8   I   3.3  
9   J   0.40

期望的输出:

  name  result
0   A   0.30
3   D   1.20
6   G   1.26
7   H   1.29
8   I   3.30
9   J   0.40

最聪明的做法是什么?任何建议,将不胜感激。谢谢!

【问题讨论】:

  • 如果您想保留严格大于 0.30 的值,那么您应该使用“大于”(&gt;),而不是“大于或等于”(&gt;= ),你不觉得吗?

标签: python regex pandas string data-manipulation


【解决方案1】:

我建议在应用任何其他逻辑之前转换数据。为此,我将添加两个新列 minmax。通过这种方式,您将获得一些好处:

  1. 它是非破坏性的,这将使调试更容易。
  2. 通过尽早将less than&lt;等转换为标准化版本,下游逻辑变得更容易。如果数据中出现新的“规则”(例如,有人键入 greater than),这将防止大规模重构

正则表达式做了很多繁重的工作,如果你能找到一个解析器来处理 greater/less 文本,它很可能比我编写的简单正则表达式更健壮一些。

看起来比实际复杂一点,大部分只是命名组:

Regex Demo

正则表达式解释

(?P<operator>                            - named group for the text prefix
 (?P<lessOrGreater>less|more)\s+than     - looks for `less than` or `more than`
 |\<|\>                                  - OR `<` or `>`
 |.+                                     - Otherwise capture all characters prior to the num
)?                                       - the operator named group is optional (e.g. when it's just a number)
\s*                                      - optional whit espace between operator and number
(?P<num>\d+\.\d+)                        - capture number. Can't handle `,` yet.

Python 代码

Python/Pandas 不是我每天编写的代码,所以我确信有更好的方法来编写代码,但下面是一个简单的解决方案,应该可以帮助你。

import pandas as pd
import re


def addMinMax(df):
    simpleParser = re.compile('(?P<operator>(?P<lessOrGreater>less|more)\\s+than|\\<|\\>|.+)?\\s*(?P<num>\\d+\\.\\d+)')
    min = []
    max = []
    abs_min = 0
    abs_max = 1  # could also use null if you prefer unbounded.
    increments = 0.001  # using this to move anything less/greater one step away.

    for index, row in df.iterrows():
        result = str(row['result'])

        match = simpleParser.search(result)

        operator = match.group('operator')
        less_or_greater = match.group('lessOrGreater')
        num = float(match.group('num'))

        # no operator means there was no text prior to the number, so it should be equal to
        if operator is None:
            min.append(num)
            max.append(num)
            
        # check for less
        elif (less_or_greater is not None and less_or_greater == 'less') or operator == '<':
            min.append(abs_min)
            max.append(num - increments)
        
        # check for greater
        elif (less_or_greater is not None and less_or_greater == 'more') or operator == '<':
            min.append(num + increments)
            max.append(abs_max)
            
        # if we're not sure, assume equal but print a warning.
        else:
            print(f'UNKNOWN INPUT: `{result}`. Assuming equal to `{num}`.')
            min.append(num)
            max.append(num)

    df['min_result'] = min
    df['max_result'] = max


df = pd.DataFrame({'name': 'A B C D E F G H I J'.split(),
                   'result': [0.30, '<0.30', '0.20', 1.20, 'less than 0.30', 'less than 0.25', 1.26, 'test 1.29',
                              'less than 3.30', 'more than 0.40']})

addMinMax(df)

print('\nEntire Dataset with new columns:\n')
print(df)

print('\n\nFilter to items greater than or equal to .3:\n')
print(df[df['min_result'] >= 0.30])

样本输出

UNKNOWN INPUT: `test 1.29`. Assuming equal to `1.29`.

Entire Dataset with new columns:

  name          result  min_result  max_result
0    A             0.3       0.300       0.300
1    B           <0.30       0.000       0.299
2    C            0.20       0.200       0.200
3    D             1.2       1.200       1.200
4    E  less than 0.30       0.000       0.299
5    F  less than 0.25       0.000       0.249
6    G            1.26       1.260       1.260
7    H       test 1.29       1.290       1.290
8    I  less than 3.30       0.000       3.299
9    J  more than 0.40       0.401       1.000


Filter to items greater than or equal to .3:

  name          result  min_result  max_result
0    A             0.3       0.300        0.30
3    D             1.2       1.200        1.20
6    G            1.26       1.260        1.26
7    H       test 1.29       1.290        1.29
9    J  more than 0.40       0.401        1.00

Process finished with exit code 0

祝你的问题好运。

【讨论】:

    猜你喜欢
    • 2022-01-17
    • 2020-03-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-08-04
    • 2010-10-30
    • 1970-01-01
    • 2022-09-23
    相关资源
    最近更新 更多