【问题标题】:Is there a more efficient way to retrieve rows with a list column which includes values in a list? ( either a subset, union, or superset )是否有更有效的方法来检索包含列表中值的列表列的行? (子集、联合或超集)
【发布时间】:2021-09-12 20:51:14
【问题描述】:

使用 pandas.dataframe,这样:

<class 'pandas.core.frame.DataFrame'>
Index: 685 entries, 7789285 to 8009947
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   sourcedId         685 non-null    string             
 1   status            685 non-null    string             
 2   dateLastModified  685 non-null    datetime64[ns, UTC]
 3   username          685 non-null    string             
 4   userIds           685 non-null    object             
 5   enabledUser       685 non-null    string             
 6   givenName         685 non-null    string             
 7   familyName        685 non-null    string             
 8   middleName        685 non-null    string             
 9   role              685 non-null    string             
 10  identifier        685 non-null    string             
 11  email             685 non-null    string             
 12  sms               685 non-null    string             
 13  phone             685 non-null    string             
 14  agents            685 non-null    object             
 15  orgs              685 non-null    object             
 16  grades            685 non-null    object             
 17  password          685 non-null    string             
dtypes: datetime64[ns, UTC](1), object(4), string(13)
memory usage: 101.7+ KB
df.head()

'grades' 列包含作为字符串的整数列表,即 ['9','10']。我可以通过

过滤单个值
mask = df.grades.apply(lambda x: '10' in x)

在我的测试数据集中,它是从我手动填充的列表列表创建的,我使用整数值,所以下面的工作正常(?)(为了论证,假设数据是一个列表整数而不是字符串列表)

gradeList = [9,10]
mask = df.grades.apply(lambda x: any(map(lambda x,y: x==y,x gradeList)))
df[mask].head()

我对 Python 比较陌生(在过去的五年中,我已经积累了我认为大约 6 到 8 个月的 Python 经验,如果那样的话)并且对 Pandas 完全陌生。我只是初步掌握了列表理解和地图功能。

我的本​​意是让我能够检索 grades 列中存在 gradesList 子集的任何记录。对于 grade 中的单个整数,这是通过以下方式完成的:

mask = df.grades.apply(lambda x: grade in x)

我没有使用上述嵌套的 lambda 和映射来实现我的目标,而是创建了一些查询参数 ( gradesList ) 中的术语顺序很重要的东西。下面是我的测试脚本的输出,它对输出中包含的测试数据进行操作。我试图不对任何一个列表进行排序...

--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   id         10 non-null     object
 1   email      10 non-null     object
 2   fullName   10 non-null     object
 3   jobTitles  10 non-null     object
 4   grades     10 non-null     object
dtypes: object(5)
memory usage: 528.0+ bytes
--------------------------------------------------------------------------------
          id                 email          fullName                                          jobTitles           grades
0    smithsm    smithsm@aplace.com         Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
1   mullenjb   mullenjb@aplace.com      Jason Mullen               [printer guy, supervisor, senior it]         [11, 12]
2    swainrl    swainrl@aplace.com        Ryan Swain                      [nap taker, goof-off, goober]          [9, 10]
3  rankinsns  rankinsns@aplace.com  Nicholas Rankins                           [manual tesla autopilot]          [9, 10]
4  carlsonrm  carlsonrm@aplace.com      Ryan Carlson                     [technician, snarky so-and-so]         [10, 11]
5     ragomv     ragomv@aplace.com         Mike Rago                                  [nice guy, swole]             [10]
6    smithdl    smithdl@aplace.com       David Smith                                         [old hand]              [9]
7  kappleraj  kappleraj@aplace.com   Allison Kappler      [girl coder, definitely not prettier than me]             [11]
8   iresonss   iresonss@aplace.com      Sandy Ireson                                      [hard worker]             [12]
9  conklincc  conklincc@aplace.com     Caleb Conklin                              [millenial magnum pi]          [12, 9]
--------------------------------------------------------------------------------
query for 'developer'
        id               email   fullName                                          jobTitles           grades
0  smithsm  smithsm@aplace.com  Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
--------------------------------------------------------------------------------
query for 11
          id                 email         fullName                                          jobTitles           grades
0    smithsm    smithsm@aplace.com        Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
1   mullenjb   mullenjb@aplace.com     Jason Mullen               [printer guy, supervisor, senior it]         [11, 12]
4  carlsonrm  carlsonrm@aplace.com     Ryan Carlson                     [technician, snarky so-and-so]         [10, 11]
7  kappleraj  kappleraj@aplace.com  Allison Kappler      [girl coder, definitely not prettier than me]             [11]
--------------------------------------------------------------------------------
query for 10
          id                 email          fullName                                          jobTitles           grades
0    smithsm    smithsm@aplace.com         Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
2    swainrl    swainrl@aplace.com        Ryan Swain                      [nap taker, goof-off, goober]          [9, 10]
3  rankinsns  rankinsns@aplace.com  Nicholas Rankins                                       [technician]          [9, 10]
4  carlsonrm  carlsonrm@aplace.com      Ryan Carlson                     [technician, snarky so-and-so]         [10, 11]
5     ragomv     ragomv@aplace.com         Mike Rago                                  [nice guy, swole]             [10]
--------------------------------------------------------------------------------
query for 11,12
          id                 email         fullName                                      jobTitles    grades
1   mullenjb   mullenjb@aplace.com     Jason Mullen           [printer guy, supervisor, senior it]  [11, 12]
7  kappleraj  kappleraj@aplace.com  Allison Kappler  [girl coder, definitely not prettier than me]      [11]
--------------------------------------------------------------------------------
query for 10,11
          id                 email      fullName                       jobTitles    grades
4  carlsonrm  carlsonrm@aplace.com  Ryan Carlson  [technician, snarky so-and-so]  [10, 11]
5     ragomv     ragomv@aplace.com     Mike Rago               [nice guy, swole]      [10]
--------------------------------------------------------------------------------
query for 9,10
          id                 email          fullName                                          jobTitles           grades
0    smithsm    smithsm@aplace.com         Stu Smith  [developer, licensed pretend nurse, worthless ...  [9, 10, 11, 12]
2    swainrl    swainrl@aplace.com        Ryan Swain                      [nap taker, goof-off, goober]          [9, 10]
3  rankinsns  rankinsns@aplace.com  Nicholas Rankins                                       [technician]          [9, 10]
6    smithdl    smithdl@aplace.com       David Smith                                         [old hand]              [9]
--------------------------------------------------------------------------------
query for 10,9
          id                 email       fullName                       jobTitles    grades
4  carlsonrm  carlsonrm@aplace.com   Ryan Carlson  [technician, snarky so-and-so]  [10, 11]
5     ragomv     ragomv@aplace.com      Mike Rago               [nice guy, swole]      [10]
9  conklincc  conklincc@aplace.com  Caleb Conklin           [millenial magnum pi]   [12, 9]

是否有人能够识别(希望是我遗漏的核心概念)或向我指出可以帮助我解开正在发生的事情的文档?

【问题讨论】:

    标签: python pandas list dataframe


    【解决方案1】:

    我使用了更轻量级的数据框:

    >>> df
              id           grades
    0    smithsm     [1, 9, 2, 6]  # <- 9
    1   mullenjb  [1, 5, 8, 4, 7]
    2    swainrl        [4, 2, 9]  # <- 9
    3  rankinsns           [5, 2]
    4  carlsonrm  [7, 4, 6, 3, 2]  # <- 3
    5     ragomv        [6, 1, 5]
    6    smithdl  [2, 9, 6, 7, 3]  # <- 3 & 9
    7  kappleraj        [9, 5, 8]  # <- 9
    8   iresonss  [8, 6, 7, 5, 4]
    9  conklincc           [8, 6]
    

    如何找到成绩单[3, 9]?

    展开你的专栏grades,发现成绩在成绩列表中。

    >>> df.loc[df['grades'].explode().isin([3, 9]).groupby(level=0).any()
              id           grades
    0    smithsm     [1, 9, 2, 6]
    2    swainrl        [4, 2, 9]
    4  carlsonrm  [7, 4, 6, 3, 2]
    6    smithdl  [2, 9, 6, 7, 3]
    7  kappleraj        [9, 5, 8]
    

    同:

    >>> df.loc[df['grades'].explode() \ .apply(lambda x: x in [3, 9]) \ .groupby(level=0).any()]`

    【讨论】:

    • 太棒了!到目前为止,我能做的最好的事情是定义一个以两个列表作为参数的函数,针对每个列表的 enumerate() 嵌套两个 for 循环,将元组迭代器拆分为索引和值,附加每个相等的结果测试到临时列表,然后针对临时列表返回 any()。您的解决方案要好得多。我将阅读有关 explode() 函数的文档并尝试再次围绕 level 参数扭曲我的头脑
    • 很高兴阅读它有帮助。如果它符合您的需要,请不要忘记投票和/或接受解决方案。
    猜你喜欢
    • 1970-01-01
    • 2022-07-17
    • 1970-01-01
    • 2018-03-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-01-01
    相关资源
    最近更新 更多