【问题标题】:pandas merge rows inside of single dataframe熊猫合并单个数据框内的行
【发布时间】:2018-11-24 22:21:35
【问题描述】:

Pandas 新手,有一个我自己无法回答的问题。对于上下文,这是从防火墙输出的。它会生成数百万个数据包,我正在尝试将这些数据聚合到防火墙规则集中。我想出的最好方法是根据目标 IP 识别流量。

如果源/目标端口是短暂的,它们会发生变化,因此将它们聚合到同一行中很重要。这样我就可以确定规则集的端口范围。

RAW CSV:

dvc,"src_interface",transport,"src_ip","src_port","dest_ip","dest_port",方向,动作,原因,计数 "Firewall-1",outside,tcp,"4.4.4.4",53,"1.1.1.1",1025,outbound,allowed,"",2 "Firewall-1",outside,tcp,"4.4.4.4",53,"1.1.1.1",1026,outbound,allowed,"",2 "Firewall-1",outside,tcp,"4.4.4.4",22,"1.1.1.1",1028,outbound,allowed,"",2 "Firewall-1",outside,tcp,"3.3.3.3",22,"2.2.2.2",2200,outbound,allowed,"",2

数据框:

dvc src_interface transport   src_ip  src_port        dest_ip  dest_port direction   action  cause  count
0  Firewall-1       outside       tcp  4.4.4.4       53  1.1.1.1       1025  outbound  allowed    NaN      2
1  Firewall-1       outside       tcp  4.4.4.4       53  1.1.1.1       1026  outbound  allowed    NaN      2
2  Firewall-1       outside       tcp  4.4.4.4       53  1.1.1.1       1028  outbound  allowed    NaN      2
3  Firewall-1       outside       tcp  3.3.3.3       22  2.2.2.2       2200  outbound  allowed    NaN      2

我将如何合并具有相同 dest_ip 的行?

代码:

df = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')], ignore_index = True)
index_cols = df.columns.tolist()
index_cols.remove('dest_ip')
df = df.groupby(index_cols, as_index=False)['dest_ip'].apply(list)
print(df)

预期输出:

Firewall-1 outside tcp 4.4.4.4 53 1.1.1.1 1025-1026,1028 outbound allowed nan 2
Firewall-1 outside tcp 3.3.3.3 22 2.2.2.2 2200 outbound allowed nan 2

我在网上找到的大多数示例都涉及连接两个数据框,而我只有一个。任何帮助,将不胜感激。提前致谢!

【问题讨论】:

标签: python pandas dataframe merge row


【解决方案1】:

试试这个。将您希望复制信息的所有列分组,然后将不同的“dest_port”值聚合到一个列表中:

df = pd.DataFrame([
            ["Firewall-1","outside","tcp","4.4.4.4",53,"1.1.1.1",1025,"outbound","allowed","",2], 
            ["Firewall-1","outside","tcp","4.4.4.4",53,"1.1.1.1",1026,"outbound","allowed","",2], 
            ["Firewall-1","outside","tcp","4.4.4.4",22,"1.1.1.1",1028,"outbound","allowed","",2], 
            ["Firewall-1","outside","tcp","3.3.3.3",22,"2.2.2.2",2200,"outbound", "allowed","",2]
        ], 
        columns=["dvc","src_interface","transport","src_ip","src_port","dest_ip","dest_port","direction", "action", "cause", "count"])

index_cols = df.columns.tolist()
index_cols.remove("dest_port") 
df = df.groupby(index_cols)["dest_port"].apply(list)
df = df.reset_index()

这会导致剩余 3 行,而不是您想要的输出中的 2 行:

   dvc              src_interface transport   src_ip         src_port  dest_ip direction   action cause  count     dest_port
0  Firewall-1       outside       tcp         3.3.3.3        22  2.2.2.2  outbound  allowed            2        [2200]
1  Firewall-1       outside       tcp         4.4.4.4        22  1.1.1.1  outbound  allowed            2        [1028]
2  Firewall-1       outside       tcp         4.4.4.4        53  1.1.1.1  outbound  allowed            2  [1025, 1026]

【讨论】:

  • 我认为这让我更接近了,但是由于某种原因数据框是空的:$ python3.7 firewallData.py Empty DataFrame Columns: [] Index: []
  • df = pd.concat([pd.read_csv(f) for f in glob.glob('*.csv')], ignore_index = True) index_cols = df.columns.tolist() index_cols .remove('dest_ip') df = df.groupby(index_cols, as_index=False)['dest_ip'].apply(list) print(df)
  • 您能否在您的问题中添加一个非常简单的数据框构造函数,可用于从 3 行示例数据中制作测试数据框,然后我可以检查此代码
  • 再次感谢罗伯特调查此事。我添加了更多细节。
  • 快到了。我需要这个:df = pd.Dataframe({'colA': [“item1”, “item2”, “item3”, “item4”], 'colB': [...], ...}).
【解决方案2】:

我认为以下内容可能会满足您的需求:

    import pandas as pd
    #create practice dataframe. will remove rows if values in 'key' are duplicate
    df = pd.DataFrame({'key':[1,1,3,4],'color':[1,2,3,2],'house':[1,2,3,7]})
    print(df.drop_duplicates(['key']))

原始数据框:

    key  color  house
    1      1      1
    1      2      2
    3      3      3
    4      2      7

输出数据框:

    key  color  house
    1      1      1
    3      3      3
    4      2      7

【讨论】:

  • 感谢您的回复 :) 看起来您的示例数据有些不同。当我尝试运行您的代码时,我收到此错误:TypeError: 'method' object is not subscriptable
  • 您收到该错误是因为此示例中存在语法错误。如果应该是:'drop_duplicates(subset=['key'])'。对您来说,子集参数必须是除“dest_ip”之外的所有列
  • 罗伯特是对的,我漏掉了一组括号。我已经编辑了我的答案
猜你喜欢
  • 2017-06-11
  • 2016-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-09-26
  • 1970-01-01
相关资源
最近更新 更多