【问题标题】:finding pattern within csv file在 csv 文件中查找模式
【发布时间】:2017-02-18 04:44:38
【问题描述】:

我有一个 CSV Excel 文件示例:

Receipt Name    Address      Date       Time    Items
25007   A      ABC pte ltd   4/7/2016   10:40   Cheese, Cookie, Pie
.
.
25008   B      CCC pte ltd   4/7/2016   12:40   Cheese, Cookie

有什么简单的方法可以比较“商品”列并找出人们一起购买的商品的最常见模式并显示最热门的组合? 在这种情况下,类似的模式是 Cheese, Cookie。

【问题讨论】:

  • 您的文件的实际格式是什么?
  • 我认为您需要一个更完整的示例。如果其他人买了奶酪和巧克力,而另一个人只买了奶酪怎么办?目前尚不清楚您在寻找什么......
  • 一些问题:在Items中,你们有逗号分隔的产品吗?你不知道所有的产品?最常见的模式可以是任何顺序?
  • @Darryl Dan,您是在寻找配对还是标准是什么?

标签: python regex csv


【解决方案1】:

假设在处理完 CSV 文件后,您发现 CSV 文件中的项目列表为:

>>> items=['Cheese,Cookie,Pie', 'Cheese,Cookie,Pie', 'Cake,Cookie,Cheese', 
... 'Cheese,Mousetrap,Pie', 'Cheese,Jam','Cheese','Cookie,Cheese,Mousetrap']

首先确定所有可能的配对:

>>> from itertools import combinations
>>> all_pairs={frozenset(t) for e in items for t in combinations(e.split(','),2)}

那么你可以这样做:

from collections import Counter
pair_counts=Counter()
for s in items:
    for pair in {frozenset(t) for t in combinations(s.split(','), 2)}:
        pair_counts.update({tuple(pair):1})

>>> pair_counts
Counter({('Cheese', 'Cookie'): 4, ('Cheese', 'Pie'): 3, ('Cookie', 'Pie'): 2, ('Cheese', 'Mousetrap'): 2, ('Cookie', 'Mousetrap'): 1, ('Cheese', 'Jam'): 1, ('Mousetrap', 'Pie'): 1, ('Cake', 'Cheese'): 1, ('Cake', 'Cookie'): 1})

这可以扩展到更一般的情况:

max_n=max(len(e.split(',')) for e in items)
for n in range(max_n, 1, -1):
    all_groups={frozenset(t) for e in items for t in combinations(e.split(','),n)}
    group_counts=Counter()
    for s in items:
        for group in {frozenset(t) for t in combinations(s.split(','), n)}:
            group_counts.update({tuple(group):1})      
    print 'group length: {}, most_common: {}'.format(n, group_counts.most_common())     

打印:

group length: 3, most_common: [(('Cheese', 'Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap', 'Pie'), 1), (('Cheese', 'Cookie', 'Mousetrap'), 1), (('Cake', 'Cheese', 'Cookie'), 1)]
group length: 2, most_common: [(('Cheese', 'Cookie'), 4), (('Cheese', 'Pie'), 3), (('Cookie', 'Pie'), 2), (('Cheese', 'Mousetrap'), 2), (('Cookie', 'Mousetrap'), 1), (('Cheese', 'Jam'), 1), (('Mousetrap', 'Pie'), 1), (('Cake', 'Cheese'), 1), (('Cake', 'Cookie'), 1)]

【讨论】:

    【解决方案2】:

    假设您有逗号分隔的值,您可以使用 frozenset 的配对并使用 Counter 字典来获取计数:

    from collections import Counter
    import csv
    
    with open("test.csv") as f:
        next(f)
        counts = Counter(frozenset(tuple(row[-1].split(",")))
                         for row in csv.reader(f))
        print(counts.most_common())
    

    如果您希望根据更新后的输入获得所有组合或配对:

    from collections import Counter
    from itertools import combinations
    
    def combs(s):
        return  combinations(s.split(","), 2)
    
    import csv
    with open("test.csv") as f:
        next(f)
        counts = Counter(frozenset(t)
                         for row in csv.reader(f)
                                for t in combs(row[-1]))
        # counts -> Counter({frozenset(['Cheese', 'Cookie']): 2, frozenset(['Cheese', 'Pie']): 1, frozenset(['Cookie', 'Pie']): 1})
        print(counts.most_common())
    

    配对的顺序无关紧要,因为 frozenset([1,2])frozenset([2,1]) 将被视为相同。

    如果你想考虑2-n的所有组合:

    def combs(s):
        indiv_items = s.split(",")
        return chain.from_iterable(combinations(indiv_items, i) for i in range(2, len(indiv_items) + 1))
    
    
    import csv
    
    with open("test.csv") as f:
        next(f)
        counts = Counter(frozenset(t)
                         for row in csv.reader(f)
                             for t in combs(row[-1]))
        print(counts)
        print(counts.most_common())
    

    为:

    Receipt,Name,Address,Date,Time,Items
    25007,A,ABC,pte,ltd,4/7/2016,10:40,"Cheese,Cookie,Pie"
    25008,B,CCC,pte,ltd,4/7/2016,12:40,"Cheese,Cookie"
    25009,B,CCC,pte,ltd,4/7/2016,12:40,"Cookie,Cheese,pizza"
    25010,B,CCC,pte,ltd,4/7/2016,12:40,"Pie,Cheese,pizza"
    

    会给你:

    Counter({frozenset(['Cheese', 'Cookie']): 3, frozenset(['Cheese', 'pizza']): 2, frozenset(['Cheese', 'Pie']): 2, frozenset(['Cookie', 'Pie']): 1, frozenset(['Cheese', 'Cookie', 'Pie']): 1, frozenset(['Cookie', 'pizza']): 1, frozenset(['Pie', 'pizza']): 1, frozenset(['Cheese', 'Cookie', 'pizza']): 1, frozenset(['Cheese', 'Pie', 'pizza']): 1})
    [(frozenset(['Cheese', 'Cookie']), 3), (frozenset(['Cheese', 'pizza']), 2), (frozenset(['Cheese', 'Pie']), 2), (frozenset(['Cookie', 'Pie']), 1), (frozenset(['Cheese', 'Cookie', 'Pie']), 1), (frozenset(['Cookie', 'pizza']), 1), (frozenset(['Pie', 'pizza']), 1), (frozenset(['Cheese', 'Cookie', 'pizza']), 1), (frozenset(['Cheese', 'Pie', 'pizza']), 1)]
    

    【讨论】:

    • 显然它只有在只有 2 个项目时才有效,但如果有 3 个项目其中 2 个相同,则它不计入模式内。
    • @DarrylDan,当然不是,但您的样本输入中只有对,所以答案基于该事实
    猜你喜欢
    • 2016-08-23
    • 2012-03-06
    • 2012-07-24
    • 2020-12-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-09-06
    • 1970-01-01
    相关资源
    最近更新 更多