如何根据数据可用性从 excel 或 csv 文件中读取数据？答案

【问题标题】：How to read data from excel or csv file based on data availability?如何根据数据可用性从 excel 或 csv 文件中读取数据？
【发布时间】：2017-04-19 15:24:14
【问题描述】：

我有两种文件，excel 和 csv，我用它们来读取具有两个永久列的数据：问题、答案和两个临时列，它们可能存在也可能不存在 Word 和 Replacement。

我已经制作了不同的函数来从 csv 和 excel 文件中读取数据，这些文件将根据文件的扩展名被调用。

有没有办法根据临时列（Word 和 Replacement）的存在时间和不存在时间来读取临时列（Word 和 Replacement）中的数据。请看下面的函数定义：

1) 对于 CSV 文件：

def read_csv_file(path):
    quesData = []
    ansData = []
    asciiIgnoreQues = []
    qWithoutPunctuation = []
    colnames = ['Question','Answer']
    data = pandas.read_csv(path, names = colnames)
    quesData = data.Question.tolist()
    ansData = data.Answer.tolist()
    qWithoutPunctuation = quesData

    qWithoutPunctuation = [''.join(c for c in s if c not in string.punctuation) for s in qWithoutPunctuation]

    for x in qWithoutPunctuation:
        asciiIgnoreQues.append(x.encode('ascii','ignore'))

    return asciiIgnoreQues, ansData, quesData

2) 读取excel数据的函数：

def read_excel_file(path):
    book = open_workbook(path)
    sheet = book.sheet_by_index(0)
    quesData = []
    ansData = []
    asciiIgnoreQues = []
    qWithoutPunctuation = []

    for row in range(1, sheet.nrows):
        quesData.append(sheet.cell(row,0).value)
        ansData.append(sheet.cell(row,1).value)

    qWithoutPunctuation = quesData
    qWithoutPunctuation = [''.join(c for c in s if c not in string.punctuation) for s in qWithoutPunctuation]

    for x in qWithoutPunctuation:
        asciiIgnoreQues.append(x.encode('ascii','ignore'))

    return asciiIgnoreQues, ansData, quesData

【问题讨论】：

您考虑过pandas.read_csv 和pandas.read_excel 吗？它们将根据存在的列自动读取。
@tmrlvi，我在读取 csv 函数时使用了 pandas.read_csv，但列标题必须在 colnames 中提供。但是如果我没有单词和替换 cloumns 怎么办？
您不必提供它们。如果您不提供，pandas 推断名称。还是您的数据不包含标头？
我的数据包含标题（问题、答案、单词、替换）。所以你是说如果我不在代码中提供 colnames，熊猫将从第二行读取？
无论如何它从第二行读取，除非您提供header=None

标签： python excel csv pandas

【解决方案1】：

我不完全确定您试图实现什么，但读取和转换您的数据，pandas 方式，如下完成：

def read_file(path, typ):
    if typ == "excel":
        df = pd.read_excel(path, sheetname=0) # Default is zero
    else: # Assuming "csv". You can make it explicit
        df = pd.read_csv(path)

    qWithoutPunctuation = df["Question"].apply(lambda s: ''.join(c for c in s if c not in string.punctuation))
    df["asciiIgnoreQues"] = qWithoutPunctuation.apply(lambda x: x.encode('ascii','ignore'))

    return df

# Call it like this:
read_data("file1.csv","csv")
read_data("file2.xls","excel")
read_data("file2.xlsx","excel")

如果您的数据不包括Word 和Replacement，这将返回DataFrame 和["Question","Answer", "asciiIgnoreQues"] 列，如果包括["Question", "Word", "Replacemen", "Answer", "asciiIgnoreQues"]。

请注意，我使用了apply，它使您能够在所有系列上逐元素运行函数。

【讨论】：