【问题标题】:Find duplicate values in text file via python通过python在文本文件中查找重复值
【发布时间】:2019-04-22 20:24:07
【问题描述】:

寻找在文本文件中查找重复值的pythonic方法。

1||mike||jones||38||first street||2018-05-01
2||michale||jones||38||8th street||2018-05-01
3||mich||jones||38||9th street||2018-05-01
4||mitchel||jones||38||10th street||2018-05-01
1||mike||jones||38||first street||2018-12-01

试图查找重复的 id 列并保持最新? 我是否只需将输出插入 id 循环到列表中,然后检查值是否已经在列表中?

【问题讨论】:

  • 保持最新是什么意思?是否要删除所有重复项,但最新的除外?请提供您想要获取的最终文本文件。
  • 也许这个question 会有所帮助。
  • 抱歉,最近我的意思是使用最后一列的日期。例如第 0 列是 id。在文件中我有重复的 1,但在这种情况下,最后一个条目的日期比第一个 id 1 更新。所以我需要那个实例而不是第一个实例。但当然它可以是文件中的任何位置,并不总是最后一个条目。

标签: python-3.x


【解决方案1】:
import pandas as pd
import numpy as np

f= open("sample.txt","w+")
f.write("1||mike||jones||38||first street||2018-05-01\n2||michale||jones||38||8th street||2018-05-01\n3||mich||jones||38||9th street||2018-05-01\n4||mitchel||jones||38||10th street||2018-05-01\n1||mike||jones||38||first street||2018-12-01")
f.close()

#read the delimited file with appropriate dataType(numpy.datetime64) for date field
tbl= pd.read_csv("sample.txt",sep='\|\|',names=("id","firstName","lastName","age","address","applicationDate"),dtype={"id":np.int,"firstName":np.str,"lastName":np.str,"age":np.int,"address":np.str,"applicationDate":np.datetime64})


#Note-
#Records with ID=2,3,4 are distinct based on address
#only record with id=1 is dupelicate. Hence source system is taking care of identification of duplicate regestration.
#So We'll only need to identify duplicates based on ID & recent record based on application date(No need to re-implement any logic for dupelicate identification).


for id in set(tbl["id"]):
    #create the temperory dataFrame for rows consist of given id and rank based on value in each field. 
    tempRankDF = tbl.loc[tbl["id"]==id].rank(ascending=False)

    #Note- rank function will calculate rank for each field based on it's dataType. 
    #Hense we used dataType for field "appilcationDate"=numpy.datetime64. 
    #So that when we calculate the rank in descending order on "applicationDate" then recent record will have rank==1

    #Get the index of recent record wrt original dataFrame
    recentRowIndex = tempRankDF.loc[tempRankDF["applicationDate"]==1].index[0]

    print(tbl.iloc[recentRowIndex])


#Note: Update the code inside for loop as per your convinence to write final resultset to either file or another dataFrame or to the database.
#You can directly execute this code & check the resultset.

【讨论】:

    【解决方案2】:

    我们有非常强大的库 Pandas 可用于以最少的代码行数执行分析操作。

    基本上,pandas 是一个开源 python 包,它提供了许多用于数据分析的工具。下面列出了 pandas 的一些基本优点和用途:

    1. 它可以以适合数据分析的方式呈现数据。
    2. 该包包含多种方便的数据过滤方法。
    3. 它有多种实用程序来执行输入/输出操作。

    实现您想使用 pandas 实现的案例

    首先使用pip install pandas安装pandas

    i/p > 具有给定格式的输入数据的文本文件

    o/p > 所需输出为 csv 格式的文本文件

    
    import pandas as pd
    from datetime import datetime
    
    with open("input") as file:     # Read input
        headers = ["id", "first_name", "last_name", "age", "address", "date"]
        dtypes = [int, str, str, int, str, datetime]
        data_frame = pd.read_csv(file, sep='[|][|]', names=headers, header=None,  parse_dates=['date'],
                                 engine="python")   # Read data into data frame from csv
        data_frame.sort_values(data_frame.date.name, ascending=False, inplace=True)     # Sort the values based on dates
        data_frame.drop_duplicates(subset=data_frame.id.name, inplace=True)     # Delete duplicate rows based on id
        data_frame.to_csv('output', sep=',', header=None)   # Generate outpu
    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-10-04
      • 2013-04-27
      • 2016-08-04
      • 1970-01-01
      • 2017-03-25
      相关资源
      最近更新 更多