【问题标题】:How to read CSV files line by line and remove mismatched datatypes rows using pyspark/python?如何使用 pyspark/python 逐行读取 CSV 文件并删除不匹配的数据类型行?
【发布时间】:2022-09-22 02:15:51
【问题描述】:

我的 CSV 文件存在错误记录问题。我已经为这些文件推断/最终列数据类型。因此,每当我们通过与最终/推断的数据类型进行比较而得到不匹配的数据类型时,它应该逐行读取并删除不匹配的记录(行)。

我有以下步骤作为要求,

For example,
1.read csv line by line
2.find the datatype of each column and match with inferred column-datatype
3.if there is mismatch ,then remove the record(row)
otherwise keep all good rows in valid_record_csv file
INFERRED datatypes-column(final datatype) to compare with files.
name: string
date: date
phone: long
col1: integer
col2: double

SAMPLE INPUT csv file

Name   date         phone         col1     col2
124    PANAMA       440894563              0.9800
BB     2022-9-23    449035667              9.08765
BB     GRENADA       BBCC                  PASS

SAMPLE OUTPUT csv file: 
I have one column empty/blank, but it is having inferred/final datatype as integer. Another one with double datatype

NAME   DATE       PHONE       col1    col2
BB     2022      449035667            9.08765
  • 你只需要在 pyspark 中?你接受使用熊猫的解决方案吗?
  • @khaledkoubaa Pandas 也很好。
  • 这回答了你的问题了吗? Drop rows in spark which dont follow schema
  • @Chris 感谢您的链接!但我想要 pyspark 或 python/pandas 中的解决方案

标签: python python-3.x pyspark apache-spark-sql


【解决方案1】:

熊猫解决方案:

import re
import pandas as pd

#0. read the csv file (supposing you have csv file named 'INPUT.csv')

df = pd.read_csv('INPUT.csv')

df
    Name    date        phone       col1    col2
0   124     PANAMA      440894563   PASS    901
1   BB      2022-9-23   449035667   520     9.08765
2   BB      GRENADA     BBCC        0.536   PASS

#1. deal with columns (cleaning):
df['date'] = pd.to_datetime(df['date'], errors='coerce') # deal with date column: keep only valid dates
df['phone'] = pd.to_numeric(df['phone'], errors='coerce').astype(pd.Int64Dtype()) # deal with phone column: keep only valid integers
df['Name'] = df['Name'].map(lambda x: x if isinstance(x, str) and re.match("^[a-zA-Z\s]+$", x) else None) # deal with Name column: keep only Name : strings than only contains letters and spaces

#a. cleaning pure integers
def f_integer(x):
    if '.' in x:
        return None 
    else:
        try:
            return int(x)
        except:
            return None 

df['col1'] = df['col1'].map(f_integer).astype(pd.Int64Dtype())

#b. cleaning pure doubles
def f_double(x):
    try:
        if float(x) and ('.' in x):
            return float(x)
        else:
            return None
    except:
        return None 
df['col2'] = df['col2'].map(f_double)

#-->
    Name    date        phone       col1    col2
0   None    NaT         440894563   <NA>    NaN
1   BB      2022-09-23  449035667   520     9.08765
2   BB      NaT         NaN         <NA>    NaN

#2. remove the record(row) with mismatch:
df.loc[df['Name'].notna() & df['date'].notna() & df['phone'].notna() & df['col1'].notna() & df['col2'].notna()]

#-->
df
    Name    date        phone       col1    col2
1   BB      2022-09-23  449035667   520     9.08765

#3. save the cleaned dataframe to csv
df.to_csv('OUTPUT.csv', index=False)

【讨论】:

  • 嗨@khaledkoubaa 非常感谢您的支持。谢谢!!!。我试图对一些双精度和整数数据类型执行相同的操作。你能帮我解决这个问题吗?请参阅“要求”。我也需要删除这些数据类型不匹配的行。
  • 我已经更新了上面的问题,以获取另外两种带有空列的数据类型。你能帮忙解决这个用例吗
  • col1 必须是纯整数,对吗?
  • 你是对的。它应该是整数数据类型。
  • @Anos 我更新了答案,让我知道你的 cmets
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-04-26
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-06-22
  • 2021-09-07
相关资源
最近更新 更多