我的标题在我的 txt 文件的第一列。我想创建一个 Pandas DF答案

【问题标题】：My headers are in the first column of my txt file. I want to create a Pandas DF我的标题在我的 txt 文件的第一列。我想创建一个 Pandas DF
【发布时间】：2021-03-29 07:49:50
【问题描述】：

文本文件中的示例数据

[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email = michael.toole@123.ie
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo@sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee@email.com
[User]

想知道是否有人可以帮助我，您可以在上面查看我的示例数据集。我想做的（请告诉我是否有更有效的方法）是循环遍历第一列，并且无论出现唯一 ID 列表的任何位置（例如 first_name、last_name、role 等）都将相应行中的值附加到该列表并执行每个唯一 ID 的操作，以便我留下以下内容。我读过关于多索引的文章，我不确定这是否是一个更好的解决方案，但我无法让它工作（我对 python 很陌生）

enter image description here

# Define a list of selected persons
selectedList = textFile
# Define a list of searching person
searchList = ['uid']
# Define an empty list
foundList = []

# Iterate each element from the selected list
for index, sList in enumerate(textFile):
  # Match the element with the element of searchList
  if sList in searchList:
    # Store the value in foundList if the match is found
    foundList.append(selectedList[index])

【问题讨论】：

您显示的不是文本文件，而是电子表格的图像。我无法从中猜出 TEXT 文件的格式，因此我无法为您提供帮助。请将文件内容显示为可复制的文本并在问题本身中显示。
添加了示例数据文本文件。

标签： python dataframe jupyter-notebook

【解决方案1】：

您有一个文本文件，其中每条记录都以[User] 行开头，数据行具有key=value 格式。我知道没有模块能够自动处理它，但很容易手动解析它。代码可能是：

with open('file.txt') as fd:
    data = []                          # a list of records
    for line in fd:
        line = line.strip()            # strip end of line
        if line == '[User]':           # new record
            row = {}                   # row will be a key: value dict
            data.append(row)
        else:
            k,v = line.split('=', 1)   # split on the = character
            row[k] = v

df = pd.DataFrame(data)                # list of key: value dicts => dataframe

通过显示的示例数据，我们得到：

  employeeNo last_name first_name language                 email     department            role                 email Location
0        123     Toole    Michael  english   michael.toole@123.ie     Marketing  Marketing Lead                   NaN      NaN
1        456   Ronaldo       Juan  Spanish                    NaN  Data Science       Team Lead   juan.ronaldo@sms.ie    Spain
2        998       Lee     Damian  english                    NaN           NaN             NaN  damian.lee@email.com      NaN

【讨论】：

【解决方案2】：

我确信有一种更优化的方法可以做到这一点，但它是获取一个唯一的行名列表，这次在循环过程中提取它们并将它们组合到一个新的数据帧中。最后，使用所需的列名对其进行更新。

import pandas as pd
import numpy as np
import io

data = '''
[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email=michael.toole@123.ie
department=Marketing
role="Marketing Lead"
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo@sms.ie
department="Data Science"
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee@email.com
[User]
'''

df = pd.read_csv(io.StringIO(data), sep='=', comment='[', header=None)

new_cols = df[0].unique()
new_df = pd.DataFrame()
for col in new_cols:
    tmp = df[df[0] == col]
    tmp.reset_index(inplace=True)
    new_df = pd.concat([new_df, tmp[1]], axis=1)
new_df.columns = new_cols
new_df['User'] = None
new_df = new_df[['User','employeeNo','last_name','first_name','language','email','department','role','Location']]

new_df
    User    employeeNo  last_name   first_name  language    email   department  role    Location
0   None    123     Toole   Michael     english     michael.toole@123.ie    Marketing   Marketing Lead  Spain
1   None    456     Ronaldo     Juan    Spanish     juan.ronaldo@sms.ie     Data Science    Team Lead   NaN
2   None    998     Lee     Damian  english     damian.lee@email.com    NaN     NaN     NaN

【讨论】：

谢谢@r-beginners。这看起来像我需要的。我目前收到一个错误。 ValueError：长度不匹配：预期轴有 0 个元素，新值有 19 个元素。我认为这可能是因为 new_df = pd.dataframe(index=[0,1])?，为什么我们在创建新的 DF 时要在这里索引？只是想了解这里的逻辑，非常感谢
我的数据框的列名是0,1。所以我使用 tmp[1] 来获取数据列。您需要通过数据列名称来指定它。
不需要索引初始数据帧。这些是代码创建的剩余部分。
tmp[1]需要修改为textFile['your data column name']。
非常感谢 r-beginners 代码对我的帮助很大！我发现的唯一一件事是标题对我来说没有改变（我在所有标题中都有“数据”）我会玩弄代码，看看我是否可以解决这个问题，任何建议都将不胜感激！再次非常感谢。

【解决方案3】：

根据之前版本偏移值的测试重写

import pandas as pd
# Revised from previous answer - ensures key value pairs are contained to the same
# record - previous version assumed the first record had all the expected keys - 
# inadvertently assigned (Location) value of second record to the first record 
# which did not have a Location key 
# This version should perform better - only dealing with one single df
#  - and using pandas own pivot() function

textFile = 'file.txt'
filter = '[User]'

# Decoration - enabling a check and balance - how many users are we processing?
textFileOpened = open(textFile,'r')
initialRead = textFileOpened.read()
userCount = initialRead.count(filter)  # sample has 4 [User] entries - but only three actual unique records
print ('User Count {}'.format(userCount))

# Create sets so able to manipulate and interrogate
allData = []
oneRow = []
userSeq = 0

#Iterate through file - assign record key and [userSeq] Key to each pair
with open(textFile, 'r') as fp:
    for fileLineSeq, line in enumerate(fp):
        if filter in str(line):
            userSeq = userSeq + 1 # Ensures each key value pair is grouped
        else: userSeq = userSeq
        oneRow = [fileLineSeq, userSeq, line]
        allData.append(oneRow)

df = pd.DataFrame(allData)

df.columns = ['FileRow','UserSeq','KeyValue']  # rename columns
userSeparators = df[df['KeyValue'] == str(filter+'\n') ].index # Locate [User Records]
df.drop(userSeparators, inplace = True) # Remove [User] records
df = df.replace(' = ' ,  '=' , regex=True ) # Input data dirty - cleaning up
df = df.replace('\n' ,  '' , regex=True ) # remove the new lines appended during the list generation

# print(df) # Test as necessary here

# split KeyValue column into two
df[['Key', 'Value']] = df.KeyValue.str.split('=', expand=True)
# very powerful function - convert to table
df = df.pivot(index='UserSeq', columns='Key', values='Value') 
print(df)

结果

User Count 4
Key     Location    department                 email employeeNo first_name language last_name            role
UserSeq                                                                                                      
1            NaN     Marketing  michael.toole@123.ie        123    Michael  english     Toole  Marketing Lead
2          Spain  Data Science   juan.ronaldo@sms.ie        456       Juan  Spanish   Ronaldo       Team Lead
3            NaN           NaN  damian.lee@email.com        998     Damian  english       Lee             NaN

【讨论】：

嗨@irnerd，如果可以的话，我想回到这个。请注意，第一个用户没有属性位置，因此我应该得到 NaN，但是代码会向下迭代列表并获取它可以找到的第二个位置值（实际上与第二个用户相关联）。有没有办法阻止这种情况发生？
嗨@sqlworrier - 为延迟道歉 - 如果你已经解决了这个问题 - 但如果没有看到与此评论相同日期的新答案