通过python在文本文件中查找重复值答案

【问题标题】：Find duplicate values in text file via python通过python在文本文件中查找重复值
【发布时间】：2019-04-22 20:24:07
【问题描述】：

寻找在文本文件中查找重复值的pythonic方法。

1||mike||jones||38||first street||2018-05-01
2||michale||jones||38||8th street||2018-05-01
3||mich||jones||38||9th street||2018-05-01
4||mitchel||jones||38||10th street||2018-05-01
1||mike||jones||38||first street||2018-12-01

试图查找重复的 id 列并保持最新？我是否只需将输出插入 id 循环到列表中，然后检查值是否已经在列表中？

【问题讨论】：

保持最新是什么意思？是否要删除所有重复项，但最新的除外？请提供您想要获取的最终文本文件。
也许这个question 会有所帮助。
抱歉，最近我的意思是使用最后一列的日期。例如第 0 列是 id。在文件中我有重复的 1，但在这种情况下，最后一个条目的日期比第一个 id 1 更新。所以我需要那个实例而不是第一个实例。但当然它可以是文件中的任何位置，并不总是最后一个条目。

标签： python-3.x

【解决方案1】：

import pandas as pd
import numpy as np

f= open("sample.txt","w+")
f.write("1||mike||jones||38||first street||2018-05-01\n2||michale||jones||38||8th street||2018-05-01\n3||mich||jones||38||9th street||2018-05-01\n4||mitchel||jones||38||10th street||2018-05-01\n1||mike||jones||38||first street||2018-12-01")
f.close()

#read the delimited file with appropriate dataType(numpy.datetime64) for date field
tbl= pd.read_csv("sample.txt",sep='\|\|',names=("id","firstName","lastName","age","address","applicationDate"),dtype={"id":np.int,"firstName":np.str,"lastName":np.str,"age":np.int,"address":np.str,"applicationDate":np.datetime64})


#Note-
#Records with ID=2,3,4 are distinct based on address
#only record with id=1 is dupelicate. Hence source system is taking care of identification of duplicate regestration.
#So We'll only need to identify duplicates based on ID & recent record based on application date(No need to re-implement any logic for dupelicate identification).


for id in set(tbl["id"]):
    #create the temperory dataFrame for rows consist of given id and rank based on value in each field. 
    tempRankDF = tbl.loc[tbl["id"]==id].rank(ascending=False)

    #Note- rank function will calculate rank for each field based on it's dataType. 
    #Hense we used dataType for field "appilcationDate"=numpy.datetime64. 
    #So that when we calculate the rank in descending order on "applicationDate" then recent record will have rank==1

    #Get the index of recent record wrt original dataFrame
    recentRowIndex = tempRankDF.loc[tempRankDF["applicationDate"]==1].index[0]

    print(tbl.iloc[recentRowIndex])


#Note: Update the code inside for loop as per your convinence to write final resultset to either file or another dataFrame or to the database.
#You can directly execute this code & check the resultset.

【讨论】：

【解决方案2】：

我们有非常强大的库 Pandas 可用于以最少的代码行数执行分析操作。

基本上，pandas 是一个开源 python 包，它提供了许多用于数据分析的工具。下面列出了 pandas 的一些基本优点和用途：

它可以以适合数据分析的方式呈现数据。
该包包含多种方便的数据过滤方法。
它有多种实用程序来执行输入/输出操作。

实现您想使用 pandas 实现的案例

首先使用pip install pandas安装pandas

i/p > 具有给定格式的输入数据的文本文件

o/p > 所需输出为 csv 格式的文本文件


import pandas as pd
from datetime import datetime

with open("input") as file:     # Read input
    headers = ["id", "first_name", "last_name", "age", "address", "date"]
    dtypes = [int, str, str, int, str, datetime]
    data_frame = pd.read_csv(file, sep='[|][|]', names=headers, header=None,  parse_dates=['date'],
                             engine="python")   # Read data into data frame from csv
    data_frame.sort_values(data_frame.date.name, ascending=False, inplace=True)     # Sort the values based on dates
    data_frame.drop_duplicates(subset=data_frame.id.name, inplace=True)     # Delete duplicate rows based on id
    data_frame.to_csv('output', sep=',', header=None)   # Generate outpu

【讨论】：