数据透视表聚合的 Excel 性能答案

【问题标题】：Excel performance on Pivot-table aggregations数据透视表聚合的 Excel 性能
【发布时间】：2019-07-29 23:50:46
【问题描述】：

在我使用 Excel 的过程中，我总是对 Excel 在以下两个聚合操作中的表现感到惊讶：

日期/时间聚合。
不区分大小写的聚合。

Excel 是如何实现这种性能的？他们是否为与数据透视相关的信息和聚合存储额外的数据结构？这是否记录在任何地方，或者我在哪里可以找到更多相关信息？我查看了 Libreoffice 源代码，但实际产品在聚合/数据透视性能方面甚至不接近 Excel。

如果了解 Excel 的人可以分享更多关于 Excel 用于实现此性能的低级聚合行为或结构 - 例如，他们是否将任何标签存储两次 - 一次在其本机案例和曾经为聚合目的而降低？虽然我知道这个问题过于宽泛，而不是代码答案本身，而且它更具概念性，但我希望答案可以作为优化 excel 样式聚合性能的方法的一个很好的参考。

根据 ARGeo 的一些建议，我注意到以下几点——

（1）Pivot Cache相关的文件有两个——Definitions（字段级信息）：

(2) 和记录（行/单元格级别信息）--

那么有几个问题：

Excel 如何确定何时按原样存储值以及何时将其存储为共享记录。例如，为什么 B2 中的值“LifeLock”（混合大小写的字符串）按原样存储，而 F2 中的值“AZ”却存储在 sharedItems (v="0") 中？
是否有关于 Excel 将内存中用于其 pivotCache 的内部 C/C++ Struct 的任何信息（而不是作为存储的各种 XML 文档）？
是否有任何关于存储在字段级别的“帮助信息”如何在 Excel 内部使用的信息？例如，这些信息：

<cacheField name="numEmps" numFmtId="0"><sharedItems containsString="0" containsBlank="1" containsNumber="1" containsInteger="1" minValue="0" maxValue="20000"/></cacheField>

【问题讨论】：

这个问题对于 Stack Overflow 来说太宽泛了。单体应用程序如何能够在执行某些操作时实现一流的性能，这可能需要整个系列的博客文章来涵盖。
为什么会有惊喜？它的软件旨在做类似的事情。
对于 Notted Things，您将数据序列化到磁盘/从磁盘序列化的方式与数据结构在内存中的操作方式混为一谈。它们之间的关系非常非常松散，因此几乎无法通过检查序列化数据来了解内存中数据结构的性能。
从 n 行数据和 m 行+列+过滤器创建数据透视表的大 O 复杂度基本上是 O(n*m)。在某些极端情况下，它可能会爆炸到 O(n log n * m)，但那是当您以不合理的方式设置数据透视表时（将双精度值作为一行添加并对其进行排序）。它应该很快。
@MineR 您是否想使用上述示例数据共享该算法的基本实现（或 O(n logn*m) 的示例，以显示创建背后的基本算法数据透视表？

标签： c# c++ excel xml libreoffice

【解决方案1】：

数据透视表性能基于Pivot Cache。虽然关于这个主题的信息很少（我的意思是缺乏官方文档），但我发现了一些有趣的帖子和 MS 文档。

定义：

Pivot Cache 是保存数据透视表记录的特殊内存区域。

创建Pivot Table 时，Excel 会获取源数据的副本并将其存储在Pivot Cache 中。 Pivot Cache 保存在 Excel 的内存中。您看不到它，但这是您构建数据透视表时数据透视表引用的数据。

This enables Excel to be very responsive to changes in the Pivot Table but it can also double the size of your file。毕竟，数据透视缓存只是源数据的副本，因此文件大小可能会翻倍是有道理的。

请将此link 和link 用作起始参考点以获取更多信息。

此外，您还可以阅读 Pivot Cache in Excel 101 和 Excel Pivot Cache 101 的帖子，了解它是什么以及它有什么副作用。

这里有一些VB代码sn-ps和如何使用PivotCache object的例子。

这是一个用 C# 编写的代码，它允许您创建一个带有一些 Pivot Tables 的 Excel 工作簿，当然，使用 Pivot Cache：

System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Reflection;
using Excel = Microsoft.Office.Interop.Excel;
using System.IO;
using System.Diagnostics;
using System.Configuration;
using System.Data.SqlClient;
using System.Data;
 
namespace ConsoleApplication1 {

    class Program {
 
        static void Main(string[] args) {
 
            Excel.Application objApp;
            Excel.Workbook objBook;
            Excel.Sheets objSheets;
            Excel.Workbooks objBooks;
 
            string command = (@"SELECT * FROM dbo.Client");
 
            using (SqlConnection connection = new SqlConnection(GetConnectionStringByName("CubsPlus"))) {

                DataTable data = new DataTable();

                try {
                    connection.Open();
                }
                catch (Exception e) {
                    StackTrace st = new StackTrace(new StackFrame(true));
                    StackFrame sf = st.GetFrame(0);
                    Console.WriteLine (e.Message + "\n" + "Method" + sf.GetMethod().ToString() + "\n" + "Line" + sf.GetFileLineNumber().ToString());
                }
                try {
                    data = DataTools.SQLQueries.getDataTableFromQuery(connection, command);
 
                    if (data == null) {
                        throw new ArgumentNullException();
                    }
                }
                catch (Exception e) {

                    StackTrace st = new StackTrace(new StackFrame(true));
                    StackFrame sf = st.GetFrame(0);
                    Console.WriteLine (e.Message + "\n" + "Method" + sf.GetMethod().ToString() + "\n" + "Line" + sf.GetFileLineNumber().ToString());
                }
 
                objApp = new Excel.Application();

                try {     
                    objBooks = objApp.Workbooks;
                    objBook = objApp.Workbooks.Add(Missing.Value);
                    objSheets = objBook.Worksheets;
 
                    Excel.Worksheet sheet1 = (Excel.Worksheet)objSheets[1];
                    sheet1.Name = "ACCOUNTS";
                    string message = DataTools.Excel.copyDataTableToExcelSheet(data, sheet1);

                    if (message != null) {
                        Console.WriteLine("Problem importing the data to Excel");
                        Console.WriteLine(message);
                        Console.ReadLine();
                    }
                         
                    //CREATE A PIVOT CACHE BASED ON THE EXPORTED DATA
                    Excel.PivotCache pivotCache = objBook.PivotCaches().Add(Excel.XlPivotTableSourceType.xlDatabase,sheet1.UsedRange);
 
                    Console.WriteLine(pivotCache.SourceData.ToString());
                    
                    Console.ReadLine();
 
                    //WORKSHEET FOR NEW PIVOT TABLE
                    Excel.Worksheet sheet2 = (Excel.Worksheet)objSheets[2];
                    sheet2.Name = "PIVOT1";
                    
                    //PIVOT TABLE BASED ON THE PIVOT CACHE OF EXPORTED DATA
                    Excel.PivotTables pivotTables = (Excel.PivotTables)sheet2.PivotTables(Missing.Value);
                    Excel.PivotTable pivotTable = pivotTables.Add(pivotCache, objApp.ActiveCell, "PivotTable1", Missing.Value, Missing.Value);
 
                    pivotTable.SmallGrid = false;
                    pivotTable.TableStyle = "PivotStyleLight1";
 
                    //ADDING PAGE FIELD
                    Excel.PivotField pageField = (Excel.PivotField)pivotTable.PivotFields("ParentName");
                    pageField.Orientation = Excel.XlPivotFieldOrientation.xlPageField;
 
                    //ADDING ROW FIELD
                    Excel.PivotField rowField = (Excel.PivotField)pivotTable.PivotFields("State");
                    rowField.Orientation = Excel.XlPivotFieldOrientation.xlRowField;
 
                    //ADDING DATA FIELD
                    pivotTable.AddDataField(pivotTable.PivotFields("SetupDate"), "average setup date", Excel.XlConsolidationFunction.xlAverage);
 
                    ExcelSaveAs(objApp, objBook, @"J:\WBK");
 
                    objApp.Quit();
                }     
                catch (Exception e) {

                    objApp.Quit();
                    Console.WriteLine(e.Message);
                    Console.ReadLine();
                }
            }
        }
 
        static string ExcelSaveAs(Excel.Application objApp, Excel.Workbook objBook, string path) {
            try {
                objApp.DisplayAlerts = false;
                objBook.SaveAs(path, Excel.XlFileFormat.xlExcel7, Missing.Value, Missing.Value, Missing.Value, Missing.Value, Excel.XlSaveAsAccessMode.xlNoChange, Missing.Value, Missing.Value, Missing.Value, Missing.Value, Missing.Value);
                objApp.DisplayAlerts = true;
                return null;
            }
            catch (Exception e) {
                StackTrace st = new StackTrace(new StackFrame(true));
                StackFrame sf = st.GetFrame(0);
                return (e.Message + "\n" + "Method" + sf.GetMethod().ToString() + "\n" + "Line" + sf.GetFileLineNumber().ToString());
            }
        }
        static string GetConnectionStringByName(string name) {
            //ASSUME FAILURE
            string returnValue = null;
 
            //Look for the name in the connectionStrings section
            ConnectionStringSettings settings = ConfigurationManager.ConnectionStrings[name];
 
            // If found, return the connection string
            if (settings != null) {
                returnValue = settings.ConnectionString;
            }
            return returnValue;
        }
    }
}

这是一个用 VB 编写的代码，它允许我们为选定的Pivot Table 创建一个新的Pivot Cache：

Sub SelPTNewCache()

    Dim wsTemp As Worksheet
    Dim pt As PivotTable
    
    On Error Resume Next
    Set pt = ActiveCell.PivotTable
    
    If pt Is Nothing Then
        MsgBox "Active cell is not in a pivot table"
    Else
        Set wsTemp = Worksheets.Add
        
        ActiveWorkbook.PivotCaches.Create( _
            SourceType:=xlDatabase, _
            SourceData:=pt.SourceData).CreatePivotTable _
            TableDestination:=wsTemp.Range("A3"), _
            TableName:="PivotTableTemp"
        
        pt.CacheIndex = wsTemp.PivotTables(1).CacheIndex
        
        Application.DisplayAlerts = False
        wsTemp.Delete
        Application.DisplayAlerts = True
    End If
    
exitHandler:
        Set pt = Nothing

End Sub

1.在您的asd.js 文件中有以下元素：

–s代表一个字符串值

– n 代表数值

– d 代表日期值

–x代表一个索引值

– v 表示一个值本身

那么，让我们用人类语言翻译此表的F2 单元格中包含的数据：

<x v="0"/>

0 的值是存储美国各州缩写的字符串数组中的 zero index。该数组中的第一个索引为我们检索Arizona。我不知道为什么下一行的单元格包含小写的az 而其他所有的单元格都包含大写的AZ，但我确定这与Shared Record 无关。

2.我没有找到任何关于 Excel 在内存中用于其 pivotCache 的内部 C/C++ 结构的有用信息。

最后：

3.这是一个LINK，在第三个额外问题中包含有关“帮助信息”的有用信息。

附言

关于大 O 表示法。

Big O notation 在计算机科学中用于描述算法的性能或复杂性。 Big O 专门描述了最坏的情况，可用于描述算法所需的执行时间或使用的空间（在内存中或磁盘上）。 Big O notation 是根据输入大小衡量程序复杂性的指标。

O(1) 表示无论输入数据集大小如何，始终同时执行的算法。
O(N) 表示性能线性增长且与输入数据集大小成正比的算法。
O(N*N) 表示其性能与输入数据集大小的平方成正比的算法。
T(N) = O(log N) 表示性能取决于对数时间的算法。取对数时间的算法常见于binary trees 的操作或使用二分搜索时。

但是好的排序算法是苛刻的O(N log N)。具有这种效率的算法示例可以是合并排序，它将一个数组分成两半，通过递归调用它们对这两半进行排序，然后将结果合并回来到单个数组中。

这是一个抽象的 C# 代码的 sn-p，展示了 O(N log N) 算法的工作原理（大致相同的方法可用于创建数据透视表）：

public static int[] MergeSort(int[] inputItems, int lowerBound, int upperBound) {
    if (lowerBound < upperBound) {
        int middle = (lowerBound + upperBound) / 2;
        MergeSort(inputItems, lowerBound, middle);
        MergeSort(inputItems, middle + 1, upperBound);
 
        int[] leftArray = new int[middle - lowerBound + 1];
        int[] rightArray = new int[upperBound - middle];
 
        Array.Copy(inputItems, lowerBound, leftArray, 0, middle - lowerBound + 1);
        Array.Copy(inputItems, middle + 1, rightArray, 0, upperBound - middle);
 
        int i = 0;
        int j = 0;
        for (int count = lowerBound; count < upperBound + 1; count++) {
            if (i == leftArray.Length) {
                inputItems[count] = rightArray[j];
                j++;
            }
            else if (j == rightArray.Length) {
                inputItems[count] = leftArray[i];
                i++;
            }
            else if (leftArray[i] <= rightArray[j]) {
                inputItems[count] = leftArray[i];
                i++;
            }
            else {
                inputItems[count] = rightArray[j];
                j++;
            }
        }
    }
    return inputItems;
}

【讨论】：

为答案喝彩。这非常有帮助，并为我指明了正确的方向。你能在 excel 中看到实际的数据透视缓存是什么吗？
大卫，你能告诉我你到底想说什么...what the actual pivot cache is... 吗？
当然，让我更新问题。对后续缓慢的跟进表示歉意。
哇，这是一个非常全面和出色的答案。您可以举一个非常简短的 C# 中 O(n*m) 复杂度算法的例子，例如，其中一个 cmets 提到了关于生成数据透视表缓存以完成答案/我会奖励赏金
非常感谢，大卫！我一直在开发 macOS 应用程序，我没有空闲时间做任何其他工作。谢谢你的建议。

【解决方案2】：

数据透视表与流行的看法形成鲜明对比，而不仅仅是 Excel 功能，但存在于许多处理表格的应用程序中结构化的数字数据——数据透视表是可视化和数据聚合的一般概念的交互结果取决于关于类别。
数据透视表始终链接到它们所源自的数据。
创建数据透视表时，Excel 会构建一个特殊的内存缓存在后台包含您的数据。此数据透视缓存存储一个源数据范围内的数据的副本。
如果数据透视表引用相同的源，则它们共享一个数据透视缓存数据范围。这有助于减小文件大小并防止我们刷新共享相同源数据范围的每个数据透视表。

数据透视表和数据透视缓存之间的关系可以得到复杂的。特别是因为数据透视缓存存储在背景，并且无法查看哪些数据透视表正在共享工作簿中的数据透视缓存。

Anatomy of Spreadsheet File

PivotCache ClassPivotCache。当对象被序列化为 xml，它的限定名称是 x:pivotCache。
PivotCache Members (Excel) 表示内存缓存数据透视表。
表示一个基类，它将Office Open XML 中的所有元素文档源自。
OpenXML specification 是一只庞大而复杂的野兽。
cacheField (PivotCache Field) 表示单个字段数据透视缓存。此定义包含有关该字段的信息，例如它的来源、数据类型和在一个级别中的位置或等级制度。 sharedItems 元素存储附加信息关于这个领域的数据。如果没有共享项目，那么值直接存储在 pivotCacheRecords 部分中。
定义SharedItems Class。当对象被序列化出来时作为 xml，它的限定名称是 x:sharedItems。
How to create pivot table in C++
How do update pivot table datafrom C# code
Use C++ to show memory use in an Excel Pivot table
How to create excel pivot table from C++ (ole/com without mfc)
How to Create Pivot Table in Excel in C#.NET Code
How to Export Data to One Worksheet and Create Pivot Table in Another Based on the Data
How to add PivotTables and Slicers to MS Excel programmatically

【讨论】：