【问题标题】：Getting length of columns in pandas dataframe without header在没有标题的熊猫数据框中获取列的长度
【发布时间】：2019-09-06 18:29:18
【问题描述】：

我有一个新手问题！

我有一个熊猫数据框，它的源是一个逗号分隔的 csv 文件。该文件没有标题。

我需要知道每一行的列的 len 是多少，并且在我需要删除 len 优于某个值（例如 5）的行之后。

我有什么：

1,2,3,4,5,6

1,2,3

9,6,8

1,2,3,5,6

期望的输出：

1,2,3

9,6,8

我搜索了一些问题和答案，例如：

Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError

Select row using the length of list in pandas cell

How to remove a row from pandas dataframe based on the length of the column values?

但据我了解，它总是使用一些列名来进行过滤，并且因为在文件中没有标题并且列数因行而异，所以我不明白该怎么做。

你能帮忙吗？

提前致谢！

【问题讨论】：

标签： python-3.x pandas

【解决方案1】：

我看到了三种可能性。

两次读取文件（第一次计算字段数，第二次使用 skiprows 方法将其读入 pandas）
将其读入内存过滤掉无效行，然后使用StringIO将其传递给pandas
将其读入所有列（或 num 所需列 + 1）的 pandas，然后只允许多余列包含 NaN 的行

以下示例使用变量 len_threshold 应设置为一行允许的列数和 your_file_name 应包含 csv 文本文件的名称。

方法一：两次读取文件

为方便起见，您可以使用 pandas 来执行此操作。像这样：

# read the rows into one text column
df= pd.read_csv(your_file_name, names=['text'], sep='\n')
# count the separators
counts= df['text'].str.count(',')
# now all rows which have more or less than two separators are skipped
rows_to_skip= counts[counts > len_threshold].index.get_level_values(0).to_list()
pd.read_csv(your_file_name, names=list(range(len_threshold)), index_col=False, skiprows=rows_to_skip)

请注意，要应用此方法，您应该确保您的字段不包含分隔符，因为它不会检查逗号是否在引用的文本中。

方法2：reding入内存/变体：逐行读入pandas

string_buffer= io.StringIO()
with open(your_file_name, 'rt') as fp:
    at_end= False
    i=0
    while not at_end:
        line= fp.readline()
        if line == '':
            break
        elif line.count(',') <= len_threshold:
            string_buffer.write(line)
# "rewind" the string_buffer in order to read it from it's start
string_buffer.seek(0)
df= pd.read_csv(string_buffer, names=list(range(len_threshold)), index_col=False)

请注意，如上所述，要应用此方法，您应该确保您的字段不包含分隔符，因为它不会检查逗号是否在引用的文本中。它需要更多内存，因此不适用于非常大的文件。但是，您也可以使用它的变体，而不是将正确的行写入字符串缓冲区，而是使用 read_csv 将它们读入 pandas。这样，您也不必担心类型转换，但 pandas 可能会在仅查看一列的情况下正确猜测类型时遇到问题。但是，如果您已经知道理想的列类型，您当然可以通过它们。变体看起来像这样：

df= pd.DataFrame([], columns=range(len_threshold))
df_len=0
string_buffer= io.StringIO()
with open(your_file_name, 'rt') as fp:
    at_end= False
    i=0
    while not at_end:
        line= fp.readline()
        if line == '':
            break
        elif line.count(',') <= len_threshold:
            tmp_df= pd.read_csv(io.StringIO(line), names=range(len_threshold), index_col=False)
            df.loc[df_len]= tmp_df.iloc[0]
            df_len+= 1

方法3：读入dataframe，过滤掉不正确的行

这是最简单的方法。

# read the whole dataframe with all columns
df= pd.read_csv(your_file_name, header=None, index_col=False)
# define an indexer that considers all rows to be good which
# have nothing else in the access rows as `NaN`
if len(df.columns) > len_threshold:
    good_rows= df.iloc[:, len_threshold:].isna().all(axis='columns')
    df.drop(df[~good_rows].index.get_level_values(0), inplace=True)
    df.drop(df.columns[3:], axis='columns', inplace=True)

因此，只要字段为空，此方法也可能允许行有多余的字段分隔符。在上面的版本中，它还允许行少于 3 列。例如，如果您的第三列始终包含有效行中的某些内容，则很容易排除太短的行。您只需将“good_rows”行更改为：

    good_rows= df.iloc[:, len_threshold:].isna().all(axis='columns') & ~df.iloc[:, 2].isna()

【讨论】：

感谢您的多个回答！我使用方法1：两次读取文件，效果很好！

【解决方案2】：

如果您将参数header=None 传递给pandas.read_csv()，则列名是从0 开始索引的整数。因此，如果您有以下“file.csv”：

1,2,3,4,5,6
1,2,3
9,6,8
1,2,3,5,6

您可以使用以下代码将其读入 DataFrame：

import pandas as pd

df = pd.read_csv("file.csv", header=None, dtype="Int64")

如果您要执行print(df)，您的结果将是：

   0  1  2    3    4    5
0  1  2  3    4    5    6
1  1  2  3  NaN  NaN  NaN
2  9  6  8  NaN  NaN  NaN
3  1  2  3    5    6  NaN

现在，如果您想删除所有大于或等于五个非 NaN 值的行，下面的代码应该可以解决问题：

for index, row in df.iterrows():
    if sum(row.notnull()) >= 5:
        df.drop(index, inplace=True)

df.dropna(axis=1, how="all", inplace=True)

如果您要执行print(df)，您的新结果将是：

   0  1  2
1  1  2  3
2  9  6  8

现在，如果您想覆盖 file.csv 并删除较长的行，那么简单：

df.to_csv("file.csv", header=False, index=False)

【讨论】：