Python 将 lambda 函数应用于 csv 文件（大文件）答案

【问题标题】：Python apply a lambda function into a csv file(Large files)Python 将 lambda 函数应用于 csv 文件（大文件）
【发布时间】：2021-05-30 06:07:42
【问题描述】：

我想使用 python 将此函数 hideEmail 应用于我的 csv 文件（大文件）的特定列

函数示例：

def hideEmail(email):
    #hide email
    text = re.sub(r'[^@.]', 'x', email)
    return text

CSV 文件（大文件 > 1gb）：

    id;Name;firstName;email;profession
    100;toto;tata;test@test.com;developer
    101;titi;tete;test@test.com;doctor
    ..
    ..

【问题讨论】：

标签： python pandas dataframe csv

【解决方案1】：

没有数据框有点难懂，但你可以试试：

import pandas as pd #import pandas
df = pd.read_csv('enter_file_path_here') #read the data

df['col'] = df['col'].apply(lambda x: hideEmail(x))
#if you want to make it back to a csv:
df.to_csv('name.csv')

【讨论】：

问题是如何应用于 csv 文件，而不是 pandas 数据框。我认为您还应该包括如何读写熊猫数据框
对，我会相应地编辑它:)
我认为这个问题是针对 Pandas Dataframe
这里不需要lambda。

【解决方案2】：

将csv 数据加载到DataFrame：

df = pd.read_csv(r'/path/to/csv')

那么你可以直接使用pd.Series.str.replace，因为它默认支持正则表达式：

df = df.astype(str).apply(lambda x: x.str.replace(r'[^@.]', 'x'), axis=1)

也就是说，如果您只想更改一个大的 csv 文件，那么 pandas 可能是矫枉过正。您可能在 sed 有一个 look。这是一个例子：

sed -E 's/(\w+)@(\w+)/xxx@xxx/' /path/to/file.csv > /path/to/new_file.csv

【讨论】：

Thanks@FelipeLanza 但我在 python 中有其他函数可以应用，不幸的是没有正则表达式，所以我不能使用 sed
它肯定支持正则表达式。可以在这里看看：gnu.org/software/sed/manual/sed.html#sed-regular-expressions.

【解决方案3】：

使用熊猫

您可以使用pandas（如上一个问题中的here 所述）来应用作为参数传递的函数。

要导出获得的数据框，请使用here描述的to_csv函数

import pandas as pd

def hideEmail(email):
    #hide email
    text = re.sub(r'[^@.]', 'x', email)
    return text 
    

column_name = "email"

df = pd.read_csv(r'Path of your CSV file\File Name.csv')
df[column_name] = df[column_name].map(hideEmail)
df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv')

【讨论】：

【解决方案4】：

您可以使用内置的map() 方法将函数映射到文件的每一行：

import re

def hideEmail(email):
    #hide email
    text = re.sub(r'[^@.]', 'x', email)
    return text 

with open('file.csv', 'r') as r:
    r = map(hideEmail, r.readlines())

with open('file2.csv', 'w') as f:
    for line in r:
        f.write(line + '\n')

编辑 （感谢 juanpa.arrivillaga 指出）：

r = map(hideEmail, r.readlines()) 可以仅替换为 r = map(hideEmail, r)。

【讨论】：

不需要r.readlines() 只需r = map(hideEmail, r) 工作
@juanpa.arrivillaga 谢谢你通知我。
@AnnZen 如何指定列名来应用我的 lamda 函数？
这将替换行中不是@ 或. 的所有内容，此解决方案肯定缺少分隔输入的列/字段方面。

【解决方案5】：

您可以使用内置的map() 函数来完成它，如下所示：

def hideEmail(email):
    #hide email
    text = re.sub(r'[^@.]', 'x', email)
    return text


with open('path/to/csvfile', 'r') as file:
     lines = [l.strip().split(';') for l in file.readlines()]

modifiedlines = []       # to store lines after email field is modified 

for i in lines[1:]:         # iterating from index 1 as index 0 is header
    i[3] = hideEmail(i[3])       # as email field is at index 3
    modifiedlines.append(';'.join(i))     # appending modified line

with open('path/to/csvfile', 'w') as file:
     file.writelines(modifiedlines)            # writing the lines back to file

【讨论】：