我建议在应用任何其他逻辑之前转换数据。为此,我将添加两个新列 min 和 max。通过这种方式,您将获得一些好处:
- 它是非破坏性的,这将使调试更容易。
- 通过尽早将
less than、<等转换为标准化版本,下游逻辑变得更容易。如果数据中出现新的“规则”(例如,有人键入 greater than),这将防止大规模重构
正则表达式做了很多繁重的工作,如果你能找到一个解析器来处理 greater/less 文本,它很可能比我编写的简单正则表达式更健壮一些。
看起来比实际复杂一点,大部分只是命名组:
Regex Demo
正则表达式解释
(?P<operator> - named group for the text prefix
(?P<lessOrGreater>less|more)\s+than - looks for `less than` or `more than`
|\<|\> - OR `<` or `>`
|.+ - Otherwise capture all characters prior to the num
)? - the operator named group is optional (e.g. when it's just a number)
\s* - optional whit espace between operator and number
(?P<num>\d+\.\d+) - capture number. Can't handle `,` yet.
Python 代码
Python/Pandas 不是我每天编写的代码,所以我确信有更好的方法来编写代码,但下面是一个简单的解决方案,应该可以帮助你。
import pandas as pd
import re
def addMinMax(df):
simpleParser = re.compile('(?P<operator>(?P<lessOrGreater>less|more)\\s+than|\\<|\\>|.+)?\\s*(?P<num>\\d+\\.\\d+)')
min = []
max = []
abs_min = 0
abs_max = 1 # could also use null if you prefer unbounded.
increments = 0.001 # using this to move anything less/greater one step away.
for index, row in df.iterrows():
result = str(row['result'])
match = simpleParser.search(result)
operator = match.group('operator')
less_or_greater = match.group('lessOrGreater')
num = float(match.group('num'))
# no operator means there was no text prior to the number, so it should be equal to
if operator is None:
min.append(num)
max.append(num)
# check for less
elif (less_or_greater is not None and less_or_greater == 'less') or operator == '<':
min.append(abs_min)
max.append(num - increments)
# check for greater
elif (less_or_greater is not None and less_or_greater == 'more') or operator == '<':
min.append(num + increments)
max.append(abs_max)
# if we're not sure, assume equal but print a warning.
else:
print(f'UNKNOWN INPUT: `{result}`. Assuming equal to `{num}`.')
min.append(num)
max.append(num)
df['min_result'] = min
df['max_result'] = max
df = pd.DataFrame({'name': 'A B C D E F G H I J'.split(),
'result': [0.30, '<0.30', '0.20', 1.20, 'less than 0.30', 'less than 0.25', 1.26, 'test 1.29',
'less than 3.30', 'more than 0.40']})
addMinMax(df)
print('\nEntire Dataset with new columns:\n')
print(df)
print('\n\nFilter to items greater than or equal to .3:\n')
print(df[df['min_result'] >= 0.30])
样本输出
UNKNOWN INPUT: `test 1.29`. Assuming equal to `1.29`.
Entire Dataset with new columns:
name result min_result max_result
0 A 0.3 0.300 0.300
1 B <0.30 0.000 0.299
2 C 0.20 0.200 0.200
3 D 1.2 1.200 1.200
4 E less than 0.30 0.000 0.299
5 F less than 0.25 0.000 0.249
6 G 1.26 1.260 1.260
7 H test 1.29 1.290 1.290
8 I less than 3.30 0.000 3.299
9 J more than 0.40 0.401 1.000
Filter to items greater than or equal to .3:
name result min_result max_result
0 A 0.3 0.300 0.30
3 D 1.2 1.200 1.20
6 G 1.26 1.260 1.26
7 H test 1.29 1.290 1.29
9 J more than 0.40 0.401 1.00
Process finished with exit code 0
祝你的问题好运。