OpenXML 每行的文件越大越慢？答案

【问题标题】：OpenXML gets slower per row with larger files?OpenXML 每行的文件越大越慢？
【发布时间】：2012-07-05 19:41:38
【问题描述】：

我正在 VB.NET 中构建一个应用程序来从 Excel 文件中读取行并将它们填充到 DataTable 中。

dtRow = dataTable.NewRow()
Dim startTime As DateTime = DateTime.Now

dtRow("name") = suppliers.CellValue("A", rowCount)
/* SNIP - just more string retrieval */
dtRow("statistics") = suppliers.CellValue("P", rowCount)

dataTable.Rows.Add(dtRow)

Dim endTime As DateTime = DateTime.Now

Debug.Print(String.Format("Time elapsed to retrieve '{0}': {1} ms", rowCount, (endTime - startTime).ToString("fffffff")))

CellValue 是我自己创造的——但它是一个小函数，我已经测量了它的运行时间。速度挺快的。

但是，当我打开一个 10,000 行的 Excel 文件（填充相同的数据）时，处理时间会变慢。

3,000 行：

Time elapsed to retrieve '2': 0510051 ms
Time elapsed to retrieve '3': 0500050 ms
Time elapsed to retrieve '4': 0340034 ms
Time elapsed to retrieve '5': 0350035 ms
Time elapsed to retrieve '6': 0340034 ms
Time elapsed to retrieve '7': 0340034 ms
Time elapsed to retrieve '8': 0350035 ms

6,000 行：

Time elapsed to retrieve '2': 0710071 ms
Time elapsed to retrieve '3': 0760076 ms
Time elapsed to retrieve '4': 0620062 ms
Time elapsed to retrieve '5': 0670067 ms
Time elapsed to retrieve '6': 0750075 ms
Time elapsed to retrieve '7': 0750075 ms
Time elapsed to retrieve '8': 0700070 ms

10,000 行：

Time elapsed to retrieve '2': 0920092 ms
Time elapsed to retrieve '3': 0920092 ms
Time elapsed to retrieve '4': 1790179 ms
Time elapsed to retrieve '5': 1810181 ms
Time elapsed to retrieve '6': 1930193 ms
Time elapsed to retrieve '7': 2240224 ms
Time elapsed to retrieve '8': 1820182 ms

为什么会发生这种情况？我可以修复它吗？

编辑：suppliers 是我创建的用于处理 Excel 文件的类，使用此构造函数：

Public Sub New(ByVal doc As SpreadsheetDocument, ByVal sheetName As String)
    pWorkbookPart = doc.WorkbookPart

    Dim sheet As Sheet = pWorkbookPart.Workbook.Descendants(Of Sheet).Where(Function(s) s.Name = sheetName).FirstOrDefault()

    pWorksheetPart = CType(pWorkbookPart.GetPartById(sheet.Id), WorksheetPart)

    pSharedStringTable = pWorkbookPart.GetPartsOfType(Of SharedStringTablePart).FirstOrDefault()
End Sub

CellValue:

Public Function CellValue(ByVal column As String, ByVal row As Integer) As String
    Dim cellAddress As String = column & row
    Dim cell As Cell = pWorksheetPart.Worksheet.Descendants(Of Cell).Where(Function(c) c.CellReference = cellAddress).FirstOrDefault()

    Dim index As Integer
    Dim returnValue As String

    If cell IsNot Nothing Then
        If cell.DataType IsNot Nothing Then
            index = Integer.Parse(cell.InnerText)
            returnValue = pSharedStringTable.SharedStringTable.ElementAt(index).InnerText
        Else
            returnValue = CStr(cell.InnerText)
        End If
    End If

    Return returnValue
End Function

【问题讨论】：

听起来内存压力导致垃圾收集占用了更多的墙时间。
供应商的类型是什么？
@sixlettervariables 有什么建议吗？我是否应该先处理 sans DataRows，然后将它们全部放入 DataTable？这能解决什么问题吗？
@JohnKoerner 我已添加到问题中。
你能在那个类上添加 CellValue 的定义吗？

标签： .net vb.net openxml openxml-sdk

【解决方案1】：

如果您的字符串表变得非常大，一个可能的问题是ElementAt 可能未针对遍历SharedStringTable 的情况进行优化。由于此表相对于您的处理是静态的，因此我建议删除该部分并使用List<string> 或数组来存储它：

' Use this instead of pSharedStringTable
' Dim sharedStringTable As New List(Of String)

' Initialize your string table
sharedStringTable.AddRange( _
    From xml In pSharedStringTable.SharedStringTable _
    Select xml.InnerText)

' Now you can use sharedStringTable.ElementAt(index) and enjoy optimization
' Or you can use sharedStringTable(index)

另一个可能的问题是通过引用对单元格进行恒定线性搜索。相反，您应该将其转换为字典：

' Dim cells As New Dictionary(Of String, Of Cell)
For Each cell In pWorksheetPart.Worksheet.Descendants(Of Cell)
    cells.Add(cell.CellReference.InnerText, cell)
Next cell
' Only one round-trip to Excel for cells using this method

在每种情况下，您都会用记忆换取时间，在这两种情况下，我认为这对您最有利：

' Revised lookup using data structures optimized for common access
If cells.TryGetValue(cellAddress, cell) Then
    If cell.DataType IsNot Nothing Then
        index = Integer.Parse(cell.InnerText)
        returnValue = sharedStringTable(index)
    Else
        returnValue = CStr(cell.InnerText)
    End If
End If

【讨论】：

虽然这是一个好主意，但我认为这不是答案，因为我的整个 Excel 文档都是一遍又一遍粘贴的同一行。它可能有 10 个唯一的字符串。但是，最好在实际应用中实现。
@DavidB：我最终添加了关于单元问题的建议。

【解决方案2】：

这条线看起来很可疑：

Dim cell As Cell = pWorksheetPart.Worksheet.Descendants(Of Cell).Where(Function(c) c.CellReference = cellAddress).FirstOrDefault()

如果 .Where() 条件针对电子表格中的每个单元格执行。随着行数的增加，单元地址比较的数量增加（行 x 列）。即使单元格引用比较操作非常简单，但它加起来很快。

如果显示的 OpenXML 或 Workbook 类不提供方便的 x,y 单元格寻址，您可能必须创建自己的索引。遍历所有单元格以将它们添加到您自己的列列表中，然后您可以通过 x,y 放弃索引。 x = 列列表中列列表的索引，y = 列列表中的索引以获取单元格。

【讨论】：

我会做一些测试，然后返回关于该命令的时间跨度的答案。
看来是这样，3000 行的查找速度比 10000 行的查找快得多。
您是自己阅读电子表格文件，还是连接到 Excel 程序的实例并通过 Excel 获取数据？
.Where() 子句中的表达式将对电子表格中的每个单元格执行。有没有其他方法来获取一列数据？
如果你的意思是使用 PIA，我不是，我使用的是 OpenXML SDK（我自己打开和读取文件）。最有可能的是，我现在要研究解决方案。