Pandas 和 Excel 中部分重复项的条件格式答案

【问题标题】：Conditional formatting for partial duplicates in Pandas and ExcelPandas 和 Excel 中部分重复项的条件格式
【发布时间】：2017-12-02 13:04:38
【问题描述】：

我有以下名为 reviews.csv 的 csv 数据：

Movie,Reviewer,Sentence,Tag,Sentiment,Text,
Jaws,John,s1,Plot,Positive,The plot was great,
Jaws,Mary,s1,Plot,Positive,The plot was great,
Jaws,John,s2,Acting,Positive,The acting was OK,
Jaws,Mary,s2,Acting,Neutral,The acting was OK,
Jaws,John,s3,Scene,Positive,The visuals blew me away,
Jaws,Mary,s3,Effects,Positive,The visuals blew me away,
Vertigo,John,s1,Scene,Negative,The scenes were terrible,
Vertigo,Mary,s1,Acting,Negative,The scenes were terrible,
Vertigo,John,s2,Plot,Negative,The actors couldn’t make the story believable,
Vertigo,Mary,s2,Acting,Positive,The actors couldn’t make the story believable,
Vertigo,John,s3,Effects,Negative,The effects were awful,
Vertigo,Mary,s3,Effects,Negative,The effects were awful,

我的目标是将此 csv 文件转换为具有条件格式的 Excel 电子表格。具体来说，我想应用以下规则：

如果“Movie”、“Sentence”、“Tag”和“Sentiment”值相同，则整行应为绿色。
如果“Movie”、“Sentence”和“Tag”值相同，但“Sentiment”值不同，则该行应为蓝色。
如果“Movie”和“Sentence”值相同，但“Tag”值不同，则该行应为红色。

所以我想创建一个如下所示的 Excel 电子表格 (.xlsx)：

我一直在查看 Pandas 的样式文档，以及 XlsxWriter 上的条件格式教程，但我似乎无法将它们放在一起。这是我到目前为止所拥有的。我可以将 csv 读入 Pandas 数据框，对其进行排序（尽管我不确定是否有必要），然后将其写回 Excel 电子表格。如何进行条件格式化，代码在哪里？

def csv_to_xls(source_path, dest_path):
    """
    Convert a csv file to a formatted xlsx spreadsheet
    Input: path to hospital review csv file
    Output: formatted xlsx spreadsheet
    """
    #Read the source file and convert to Pandas dataframe
    df = pd.read_csv(source_path)

    #Sort by Filename, then by sentence number
    df.sort_values(['File', 'Sent'], ascending=[True, True], inplace = True)

    #Create the xlsx file that we'll be writing to
    orig = pd.ExcelWriter(dest_path, engine='xlsxwriter')

    #Convert the dataframe to Excel, create the sheet
    df.to_excel(orig, index=False, sheet_name='report')

    #Variables for the workbook and worksheet
    workbook = orig.book
    worksheet = orig.sheets['report']

    #Formatting for exact, partial, mismatch, gold
    exact = workbook.add_format({'bg_color':'#B7F985'}) #green
    partial = workbook.add_format({'bg_color':'#D3F6F4'}) #blue
    mismatch = workbook.add_format({'bg_color':'#F6D9D3'}) #red

    #Do the conditional formatting somehow

    orig.save()

【问题讨论】：

标签： python excel csv pandas

【解决方案1】：

免责声明：我是我要推荐的库的作者之一

这可以通过StyleFrame 和DataFrame.duplicated 轻松实现：

from styleframe import StyleFrame, Styler

sf = StyleFrame(df)

green = Styler(bg_color='#B7F985')
blue = Styler(bg_color='#D3F6F4')
red = Styler(bg_color='#F6D9D3')

sf.apply_style_by_indexes(sf[df.duplicated(subset=['Movie', 'Sentence'], keep=False)],
                          styler_obj=red)
sf.apply_style_by_indexes(sf[df.duplicated(subset=['Movie', 'Sentence', 'Tag'], keep=False)],
                          styler_obj=blue)
sf.apply_style_by_indexes(sf[df.duplicated(subset=['Movie', 'Sentence', 'Tag', 'Sentiment'],
                                           keep=False)],
                          styler_obj=green)

sf.to_excel('test.xlsx').save()

这会输出以下内容：

【讨论】：

刚刚在谷歌搜索熊猫等颜色单元时遇到了这个。非常酷的包。大约 100k 行的数据帧需要相当长的时间。这是正常的吗？有什么办法可以加快速度？
@SCool 不幸的是，您发现速度是较大数据帧的问题，因为 StyleFrame 必须迭代每个单元格才能应用样式。我还没有找到更快的方法。话虽如此，我可以在几秒钟内生成一个包含 100k 行和 2 列的文件。
好的，我有 60 列。需要几分钟。无论如何，有用的包。 :)