【问题标题】:My headers are in the first column of my txt file. I want to create a Pandas DF我的标题在我的 txt 文件的第一列。我想创建一个 Pandas DF
【发布时间】:2021-03-29 07:49:50
【问题描述】:

文本文件中的示例数据

[User]
employeeNo=123
last_name=Toole
first_name=Michael
language=english
email = michael.toole@123.ie
department=Marketing
role=Marketing Lead
[User]
employeeNo=456
last_name= Ronaldo
first_name=Juan
language=Spanish
email=juan.ronaldo@sms.ie
department=Data Science
role=Team Lead
Location=Spain
[User]
employeeNo=998
last_name=Lee
first_name=Damian
language=english
email=damian.lee@email.com
[User]

想知道是否有人可以帮助我,您可以在上面查看我的示例数据集。我想做的(请告诉我是否有更有效的方法)是循环遍历第一列,并且无论出现唯一 ID 列表的任何位置(例如 first_name、last_name、role 等)都将相应行中的值附加到该列表并执行每个唯一 ID 的操作,以便我留下以下内容。 我读过关于多索引的文章,我不确定这是否是一个更好的解决方案,但我无法让它工作(我对 python 很陌生)

enter image description here

# Define a list of selected persons
selectedList = textFile
# Define a list of searching person
searchList = ['uid']
# Define an empty list
foundList = []

# Iterate each element from the selected list
for index, sList in enumerate(textFile):
  # Match the element with the element of searchList
  if sList in searchList:
    # Store the value in foundList if the match is found
    foundList.append(selectedList[index])

【问题讨论】:

  • 您显示的不是文本文件,而是电子表格的图像。我无法从中猜出 TEXT 文件的格式,因此我无法为您提供帮助。请将文件内容显示为可复制的文本并在问题本身中显示。
  • 添加了示例数据文本文件。

标签: python dataframe jupyter-notebook


【解决方案1】:

您有一个文本文件,其中每条记录都以[User] 行开头,数据行具有key=value 格式。我知道没有模块能够自动处理它,但很容易手动解析它。代码可能是:

with open('file.txt') as fd:
    data = []                          # a list of records
    for line in fd:
        line = line.strip()            # strip end of line
        if line == '[User]':           # new record
            row = {}                   # row will be a key: value dict
            data.append(row)
        else:
            k,v = line.split('=', 1)   # split on the = character
            row[k] = v

df = pd.DataFrame(data)                # list of key: value dicts => dataframe

通过显示的示例数据,我们得到:

  employeeNo last_name first_name language                 email     department            role                 email Location
0        123     Toole    Michael  english   michael.toole@123.ie     Marketing  Marketing Lead                   NaN      NaN
1        456   Ronaldo       Juan  Spanish                    NaN  Data Science       Team Lead   juan.ronaldo@sms.ie    Spain
2        998       Lee     Damian  english                    NaN           NaN             NaN  damian.lee@email.com      NaN

【讨论】:

    【解决方案2】:

    我确信有一种更优化的方法可以做到这一点,但它是获取一个唯一的行名列表,这次在循环过程中提取它们并将它们组合到一个新的数据帧中。最后,使用所需的列名对其进行更新。

    import pandas as pd
    import numpy as np
    import io
    
    data = '''
    [User]
    employeeNo=123
    last_name=Toole
    first_name=Michael
    language=english
    email=michael.toole@123.ie
    department=Marketing
    role="Marketing Lead"
    [User]
    employeeNo=456
    last_name= Ronaldo
    first_name=Juan
    language=Spanish
    email=juan.ronaldo@sms.ie
    department="Data Science"
    role=Team Lead
    Location=Spain
    [User]
    employeeNo=998
    last_name=Lee
    first_name=Damian
    language=english
    email=damian.lee@email.com
    [User]
    '''
    
    df = pd.read_csv(io.StringIO(data), sep='=', comment='[', header=None)
    
    new_cols = df[0].unique()
    new_df = pd.DataFrame()
    for col in new_cols:
        tmp = df[df[0] == col]
        tmp.reset_index(inplace=True)
        new_df = pd.concat([new_df, tmp[1]], axis=1)
    new_df.columns = new_cols
    new_df['User'] = None
    new_df = new_df[['User','employeeNo','last_name','first_name','language','email','department','role','Location']]
    
    new_df
        User    employeeNo  last_name   first_name  language    email   department  role    Location
    0   None    123     Toole   Michael     english     michael.toole@123.ie    Marketing   Marketing Lead  Spain
    1   None    456     Ronaldo     Juan    Spanish     juan.ronaldo@sms.ie     Data Science    Team Lead   NaN
    2   None    998     Lee     Damian  english     damian.lee@email.com    NaN     NaN     NaN
    

    【讨论】:

    • 谢谢@r-beginners。这看起来像我需要的。我目前收到一个错误。 ValueError:长度不匹配:预期轴有 0 个元素,新值有 19 个元素。我认为这可能是因为 new_df = pd.dataframe(index=[0,1])?,为什么我们在创建新的 DF 时要在这里索引?只是想了解这里的逻辑,非常感谢
    • 我的数据框的列名是0,1。所以我使用 tmp[1] 来获取数据列。您需要通过数据列名称来指定它。
    • 不需要索引初始数据帧。这些是代码创建的剩余部分。
    • tmp[1]需要修改为textFile['your data column name']
    • 非常感谢 r-beginners 代码对我的帮助很大!我发现的唯一一件事是标题对我来说没有改变(我在所有标题中都有“数据”)我会玩弄代码,看看我是否可以解决这个问题,任何建议都将不胜感激!再次非常感谢。
    【解决方案3】:

    根据之前版本偏移值的测试重写

    import pandas as pd
    # Revised from previous answer - ensures key value pairs are contained to the same
    # record - previous version assumed the first record had all the expected keys - 
    # inadvertently assigned (Location) value of second record to the first record 
    # which did not have a Location key 
    # This version should perform better - only dealing with one single df
    #  - and using pandas own pivot() function
    
    textFile = 'file.txt'
    filter = '[User]'
    
    # Decoration - enabling a check and balance - how many users are we processing?
    textFileOpened = open(textFile,'r')
    initialRead = textFileOpened.read()
    userCount = initialRead.count(filter)  # sample has 4 [User] entries - but only three actual unique records
    print ('User Count {}'.format(userCount))
    
    # Create sets so able to manipulate and interrogate
    allData = []
    oneRow = []
    userSeq = 0
    
    #Iterate through file - assign record key and [userSeq] Key to each pair
    with open(textFile, 'r') as fp:
        for fileLineSeq, line in enumerate(fp):
            if filter in str(line):
                userSeq = userSeq + 1 # Ensures each key value pair is grouped
            else: userSeq = userSeq
            oneRow = [fileLineSeq, userSeq, line]
            allData.append(oneRow)
    
    df = pd.DataFrame(allData)
    
    df.columns = ['FileRow','UserSeq','KeyValue']  # rename columns
    userSeparators = df[df['KeyValue'] == str(filter+'\n') ].index # Locate [User Records]
    df.drop(userSeparators, inplace = True) # Remove [User] records
    df = df.replace(' = ' ,  '=' , regex=True ) # Input data dirty - cleaning up
    df = df.replace('\n' ,  '' , regex=True ) # remove the new lines appended during the list generation
    
    # print(df) # Test as necessary here
    
    # split KeyValue column into two
    df[['Key', 'Value']] = df.KeyValue.str.split('=', expand=True)
    # very powerful function - convert to table
    df = df.pivot(index='UserSeq', columns='Key', values='Value') 
    print(df)
    

    结果

    User Count 4
    Key     Location    department                 email employeeNo first_name language last_name            role
    UserSeq                                                                                                      
    1            NaN     Marketing  michael.toole@123.ie        123    Michael  english     Toole  Marketing Lead
    2          Spain  Data Science   juan.ronaldo@sms.ie        456       Juan  Spanish   Ronaldo       Team Lead
    3            NaN           NaN  damian.lee@email.com        998     Damian  english       Lee             NaN
    

    【讨论】:

    • 嗨@irnerd,如果可以的话,我想回到这个。请注意,第一个用户没有属性位置,因此我应该得到 NaN,但是代码会向下迭代列表并获取它可以找到的第二个位置值(实际上与第二个用户相关联)。有没有办法阻止这种情况发生?
    • 嗨@sqlworrier - 为延迟道歉 - 如果你已经解决了这个问题 - 但如果没有看到与此评论相同日期的新答案
    猜你喜欢
    • 1970-01-01
    • 2016-05-09
    • 2020-09-04
    • 2021-01-14
    • 2017-05-09
    • 2019-09-12
    • 2022-10-15
    • 2022-09-23
    • 2019-05-30
    相关资源
    最近更新 更多