这是个好问题。我尝试使用pandas 的Series.str 类,但我不知道如何以矢量计算方式处理这个问题,因为层次结构级别的数量可能非常大。
这里我给出一个简单的for循环方法。如果您的数据非常大,它可能会很慢,但至少它确实有效。
使用这个封装的函数:
from collections import OrderedDict
import re
def get_index(data):
hierarchy = OrderedDict()
data = re.sub(r"([A-Z])(\d)", r"\1.\2", data)
results = []
for line in data.splitlines():
if line:
this_hier = hierarchy
indices = []
for hier in line.split("."):
if hier:
if not hier in this_hier:
this_hier.update({hier: OrderedDict()})
# Remove `+1` if you'd like indices start from 0
indices.append(list(this_hier.keys()).index(hier) + 1)
this_hier = this_hier[hier]
results.append(".".join(map(str, indices)))
else:
results.append("")
return results
还有你的数据:
data = """A
A01
A01.111
A01.236
A01.236.249
A01.236.500
A01.378
A01.378.100
A01.378.610
A01.378.610.050
A01.378.610.100
B
B01
B01.043
B01.043.075
B01.043.075.189
B01.043.075.189.250
B01.043.075.189.250.150
B01.043.075.189.250.150.160
B01.043.075.189.250.150.160.170
B01.043.075.189.250.250
B01.044
B01.043.076
B01.043.075.190
B01.043.075.189.251
B01.043.075.189.250.151
B01.043.075.189.250.150.161
B01.043.075.189.250.150.160.171
B01.043.075.189.250.251
B01.045
"""
您可以通过以下方式获取并输出结果:
indices = get_index(data)
for text, idx in zip(data.splitlines(), indices):
print(f"{text:<40}{idx}")
输出将是:
A 1
A01 1.1
A01.111 1.1.1
A01.236 1.1.2
A01.236.249 1.1.3.1
A01.236.500 1.1.3.2
A01.378 1.1.4
A01.378.100 1.1.5.1
A01.378.610 1.1.5.2
A01.378.610.050 1.1.5.2.1
A01.378.610.100 1.1.5.2.2
B 2
B01 2.1
B01.043 2.1.1
B01.043.075 2.1.1.1
B01.043.075.189 2.1.1.2.1
B01.043.075.189.250 2.1.1.2.2.1
B01.043.075.189.250.150 2.1.1.2.2.2.1
B01.043.075.189.250.150.160 2.1.1.2.2.2.2.1
B01.043.075.189.250.150.160.170 2.1.1.2.2.2.2.2.1
B01.043.075.189.250.250 2.1.1.2.2.2.3
B01.044 2.1.2
B01.043.076 2.1.1.3
B01.043.075.190 2.1.1.2.3
B01.043.075.189.251 2.1.1.2.2.3
B01.043.075.189.250.151 2.1.1.2.2.2.4
B01.043.075.189.250.150.161 2.1.1.2.2.2.2.3
B01.043.075.189.250.150.160.171 2.1.1.2.2.2.2.2.2
B01.043.075.189.250.251 2.1.1.2.2.2.5
B01.045 2.1.3
一些插图:
-
OrderedDict of collections 用于保证 Python
- OP 数据的分层级别有点不自然,因为所有级别都用一个点
. 分隔,除了第一级没有分隔符,例如A01。所以我建议在它们之间添加一个点(例如A.01),或者自动添加一个。这就是我的代码中有data = re.sub(r"([A-Z])(\d)", r"\1.\2", data) 行的原因。这可以让您的代码保持美观。