【问题标题】:How to match string and arrange dataframe accordingly?如何匹配字符串并相应地安排数据框?
【发布时间】:2022-01-07 14:39:26
【问题描述】:

得到输入 df1 和 df2

df1:

Subcategory_Desc    Segment_Desc    Flow            Side        Row_no
APPLE               APPLE LOOSE     Apple Kanzi     Front       Row 1
APPLE               APPLE LOOSE     Apple Jazz      Front       Row 1
CITRUS              ORANGES LOOSE   Orange Navel    Front       Row 1
PEAR                PEARS LOOSE     Lemon           Right End   Row 1
AVOCADOS            AVOCADOS LOOSE  Avocado         Back        Row 1
TROPICAL FRUIT      KIWI FRUIT      Kiwi Gold       Back        Row 1
TROPICAL FRUIT      KIWI FRUIT      Kiwi Green      Left End    Row 1

df2:

Subcategory_Desc    Segment_Desc    Flow
TROPICAL FRUIT      KIWI FRUIT      5pk Kids Kiwi
APPLE               APPLE LOOSE     Apple GoldenDel
AVOCADOS            AVOCADOS LOOSE  Avocado Tray

场景: 考虑到以下条件,应将数据帧 df2 行插入数据帧 df1

  1. 在 df1 中检查 df2 的相似的 Subcategory_Desc 和 Segment_Desc 并将该 df2 行插入到该特定侧面(前/后)的末尾。正如预期输出中给出的那样。
  2. 还需要考虑Row_no列,因为原始数据集包含n个Row_no,这里只给出Row 1作为样本数据。

预期输出:

Subcategory_Desc    Segment_Desc    Flow            Side        Row_no
APPLE               APPLE LOOSE     Apple Kanzi     Front       Row 1
APPLE               APPLE LOOSE     Apple Jazz      Front       Row 1
CITRUS              ORANGES LOOSE   Orange Navel    Front       Row 1
APPLE               APPLE LOOSE     Apple GoldenDel Front       Row 1
PEAR                PEARS LOOSE     Lemon           Right End   Row 1
AVOCADOS            AVOCADOS LOOSE  Avocado         Back        Row 1
TROPICAL FRUIT      KIWI FRUIT      Kiwi Gold       Back        Row 1
TROPICAL FRUIT      KIWI FRUIT      5pk Kids Kiwi   Back        Row 1
AVOCADOS            AVOCADOS LOOSE  Avocado Tray    Back        Row 1
TROPICAL FRUIT      KIWI FRUIT      Kiwi Green      Left End    Row 1

不确定什么简单的逻辑可以用于此目的。

【问题讨论】:

  • 欢迎任何想法!

标签: python pandas dataframe string-matching fuzzywuzzy


【解决方案1】:

因此,给定以下数据框:

import pandas as pd

df1 = pd.DataFrame(
    {
        "Subcategory_Desc": {
            0: "APPLE",
            1: "APPLE",
            2: "CITRUS",
            3: "PEAR",
            4: "AVOCADOS",
            5: "TROPICAL FRUIT",
            6: "TROPICAL FRUIT",
        },
        "Segment_Desc": {
            0: "APPLE LOOSE",
            1: "APPLE LOOSE",
            2: "ORANGES LOOSE",
            3: "PEARS LOOSE",
            4: "AVOCADOS LOOSE",
            5: "KIWI FRUIT",
            6: "KIWI FRUIT",
        },
        "Flow": {
            0: "Apple Kanzi",
            1: "Apple Jazz",
            2: "Orange Navel",
            3: "Lemon",
            4: "Avocado",
            5: "Kiwi Gold",
            6: "Kiwi Green",
        },
        "Side": {
            0: "Front",
            1: "Front",
            2: "Front",
            3: "Right_End",
            4: "Back",
            5: "Back",
            6: "Left_End",
        },
        "Row_no": {
            0: "Row 1",
            1: "Row 1",
            2: "Row 1",
            3: "Row 1",
            4: "Row 1",
            5: "Row 1",
            6: "Row 1",
        },
    }
)

df2 = pd.DataFrame(
    {
        "Subcategory_Desc": {0: "TROPICAL FRUIT", 1: "APPLE", 2: "AVOCADOS"},
        "Segment_Desc": {0: "KIWI FRUIT", 1: "APPLE LOOSE", 2: "AVOCADOS LOOSE"},
        "Flow": {0: "5pk Kids Kiwi", 1: "Apple GoldenDel", 2: "Avocado Tray"},
    }
)

你可以试试这个:

# Initialize new column
df2["idx"] = ""

# Find indice of first match in df1
for _, row2 in df2.iterrows():
    for i, row1 in df1.iterrows():
        if i + 1 >= df1.shape[0]:
            break
        if (
            row1["Subcategory_Desc"] == row2["Subcategory_Desc"]
            and row1["Segment_Desc"] == row2["Segment_Desc"]
        ):
            row2["idx"] = i

df2 = df2.sort_values(by="idx").reset_index(drop=True)

# Starting from previous indice, find insertion indice in df1
for i, idx in enumerate(df2["idx"]):
    side_of_idx = df1.loc[idx, "Side"]
    df2.loc[i, "pos"] = df1.index[df1["Side"] == side_of_idx].to_list()[-1] + 1
positions = df2["pos"].astype("int").to_list()

# Clean up df2
df2 = df2.drop(columns=["idx", "pos"])
df2["Side"] = df2["Row_no"] = ""

# Iterate on df1 to insert new rows
for i, pos in enumerate(positions):

    # Fill missing values
    df2.loc[i, "Side"] = df1.loc[pos - 1, "Side"]
    df2.loc[i, "Row_no"] = df1.loc[pos, "Row_no"]

    # Insert row
    df1 = pd.concat(
        [df1.iloc[:pos], pd.DataFrame([df2.iloc[i]]), df1.iloc[pos:]], ignore_index=True
    ).reset_index(drop=True)

    # Increment next position since df1 has changed
    if i < len(positions) - 1:
        positions[i + 1] += 1

所以:

print(df1)
# Outputs
  Subcategory_Desc    Segment_Desc             Flow       Side Row_no
0            APPLE     APPLE LOOSE      Apple Kanzi      Front  Row 1
1            APPLE     APPLE LOOSE       Apple Jazz      Front  Row 1
2           CITRUS   ORANGES LOOSE     Orange Navel      Front  Row 1
3            APPLE     APPLE LOOSE  Apple GoldenDel      Front  Row 1
4             PEAR     PEARS LOOSE            Lemon  Right_End  Row 1
5         AVOCADOS  AVOCADOS LOOSE          Avocado       Back  Row 1
6   TROPICAL FRUIT      KIWI FRUIT        Kiwi Gold       Back  Row 1
7   TROPICAL FRUIT      KIWI FRUIT    5pk Kids Kiwi       Back  Row 1
8         AVOCADOS  AVOCADOS LOOSE     Avocado Tray       Back  Row 1
9   TROPICAL FRUIT      KIWI FRUIT       Kiwi Green   Left_End  Row 1

【讨论】:

  • 感谢您的意见@Laurent。一点点改变对我来说很完美。这是增加下一个位置的变化。如果 i
猜你喜欢
  • 1970-01-01
  • 2023-04-02
  • 2014-05-07
  • 1970-01-01
  • 1970-01-01
  • 2022-12-15
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多