使用for循环从列表中提取2个值答案

【问题标题】：Extracting 2 values from list with for-loop使用for循环从列表中提取2个值
【发布时间】：2021-10-04 14:25:56
【问题描述】：

我有一个大型 Excel 表格，其中有一列包含多个不同的标识符（例如 ISBN）。我已将工作表转换为 pandas 数据框，并将带有标识符的列转换为列表。原始列的一行的列表条目如下所示：

'ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534'

但是，它们并不完全相同，有些带有 ISBN，有些没有，有些条目较多，有些条目较少（上例中为 5 个），并且不同的 ID 大多是，但不是全部, 用逗号分隔。

在下一步中，我构建了一个函数，它遍历各种列表项（一个长字符串，如上面的那个），然后将其拆分为不同的单词（所以我得到类似

'ISBN:978-9941-30-551-1', 'Broschur :', 'GEL', '14.90', 'IDN:1215507534'

我希望提取 ISBN 和 IDN 的值（如果存在），然后将一个指定的 ISBN 列和一个用于 IDN 的列添加到我的原始数据框中（而不是包含混合数据的“标识符”列）。

我现在有下面的代码，它可以做它应该做的事情，只是我的字典中有列表，因此结果数据框中的每个条目都有一个列表。我确信必须有更好的方法来做到这一点，但似乎无法想到......

def find_stuff(item): 
        
    list_of_words = item.split()
    ISBN = list()
    IDN = list()
    
    for word in list_of_words:

        if 'ISBN' in word: 
            var = word
            var = var.replace("ISBN:", "")
            ISBN.append(var)
             
        if 'IDN' in word: 
            var2 = word
            var2 = var2.replace("IDN:", "")
            IDN.append(var2)

    
    sum_dict = {"ISBN":ISBN, "IDN":IDN}
    
    return sum_dict



output = [find_stuff(item) for item in id_lists]
print(output)

非常感谢任何帮助:)

【问题讨论】：

您能检查一下my answer 是否适合您吗？使用自定义函数在文本上手动循环要高效得多。如果您想要关于后处理的不同输出或建议，请提供预期输出和用例。

标签： python pandas string for-loop

【解决方案1】：

您不需要您的函数，只需将regex with named groups 应用于包含长字符串的原始列。

让我们想象一下这个例子：

df = pd.DataFrame({'other_column': ['blah', 'blah'],
                   'identifier': ['ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534',
                                  'ISBN:123-4567-89-012-3 blah IDN:1234567890 other'
                                 ],
                  })

  other_column                                                    identifier
0         blah  ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534
1         blah              ISBN:123-4567-89-012-3 blah IDN:1234567890 other

如果ISBN总是在IDN之前，你可以使用pandas.Series.str.extract：

df['identifier'].str.extract('(?P<ISBN>ISBN:[\d-]+).*(?P<IDN>IDN:\d+)')

输出：

                     ISBN             IDN
0  ISBN:978-9941-30-551-1  IDN:1215507534
1  ISBN:123-4567-89-012-3  IDN:1234567890

如果有可能并不总是按此顺序，则使用pandas.Series.str.extractall 并使用groupby 重新处理输出：

(df['identifier'].str.extractall('(?P<ISBN>ISBN:[\d-]+)|(?P<IDN>IDN:\d+)')
                 .groupby(level=0).first()
)

最后，如果您不想要标识符名称，请将正则表达式更改为'(?:ISBN:(?P<ISBN>[\d-]+))|(?:IDN:(?P<IDN>\d+))'：

(df['identifier'].str.extractall('(?:ISBN:(?P<ISBN>[\d-]+))|(?:IDN:(?P<IDN>\d+))')
                 .groupby(level=0).first()
)

输出：

                ISBN         IDN
0  978-9941-30-551-1  1215507534
1  123-4567-89-012-3  1234567890

注意。如果需要字典作为输出，可以在命令末尾附加 .to_dict('index')。这给了你

{0: {'ISBN': '978-9941-30-551-1', 'IDN': '1215507534'},
 1: {'ISBN': '123-4567-89-012-3', 'IDN': '1234567890'}}

【讨论】：

非常感谢！看来，我将不得不更适应正则表达式。
嗯，这肯定是一项投资，但它非常强大。例如你可以check that the ISBN format is correct，或者其他很多东西……

【解决方案2】：

由于您在pandas 工作，我建议使用pandas 的string methods 提取相关信息并直接将它们分配到新列。在下面的答案中，我展示了一些可能性：

import pandas as pd

df = pd.DataFrame(['ISBN:978-9941-30-551-1 Broschur :  GEL 14.90, IDN:1215507534'], columns=['identifier'])

def retrieve_text(lst, text):
    try:
        return [i for i in lst if text in i][0]
    except:
        return None

df['ISBN'] = df['identifier'].str.split().apply(lambda x: retrieve_text(x, 'ISBN')) #use a custom function to filter the list
df['IDN'] = df['identifier'].str.split().apply(lambda x: retrieve_text(x, 'IDN'))
df['name'] = df['identifier'].str.split().str[1] #get by index
df['price'] = df['identifier'].str.extract(r'(\d+\.\d+)').astype('float') #use regex, no need to split the string here

输出：

	identifier	ISBN	IDN	name	price
0	ISBN:978-9941-30-551-1 Broschur : GEL 14.90, IDN:1215507534	ISBN:978-9941-30-551-1	IDN:1215507534	Broschur	14.9

【讨论】：

非常感谢，这非常有效！目前还不确定 lambda 部分到底做了什么，将研究它来学习。谢谢！