【问题标题】:Handle large delimited text files in VBA在 VBA 中处理大型分隔文本文件
【发布时间】:2012-03-22 11:03:25
【问题描述】:

使用 VBA,我需要将当前位于分隔文本文件中的数据(如数百列数万行)“反透视”为规范化形式(四列数百万行);也就是说,生成的表格将包含针对每个单元格的列:

  • 识别原始表/文件;
  • 在原始表格中识别单元格的行;
  • 在原始表格中识别单元格的列;
  • 包含该单元格的值。

对于如何有效地完成这项任务的任何想法,我一般都会感激不尽。

到目前为止,我已经考虑使用 ADODB 来构造一个构建输出表的 SELECT INTO ... UNION ... 查询,但遗憾的是,默认文本文件提供程序被限制为 255 列(有哪些不是吗?)。

Sébastien Lorion 构建了一个很棒的Fast CSV Reader,我很想使用它,但我不知道如何在 VBA 中使用它——感谢任何想法(我认为它没有被编译为导​​出 COM接口,我没有重新编译它的工具)。就此而言,Microsoft 还提供了一个 TextFieldParser 类,但我不知道是否/如何从 VBA 中使用它。

另一种方法可能是让 Excel >=2007 打开源文件并从那里构造输出表,但直观上“感觉”好像会产生相当大的开销...

【问题讨论】:

  • 你需要多久做一次(一次?几次?)?完成任务最多需要多长时间?文件大小(MB)是多少?
  • @assylias:每个输入文件将执行一次任务,但是会不时出现新的输入文件(具有不同的列!)——它们的数据必须附加到同一个聚合输出表;很难量化任务应该花费多长时间,但它应该对交互式用户有相当的响应 - 所以我们说不超过 20 秒?文件大小为 10MB。
  • 这其中的哪一部分你被难住了?它只是将 CSV 文件中的每一行解析为一个值数组吗?
  • 您是按位置还是按标题值来识别列?
  • @TimWilliams:通过标题值(在第一行中给出);让我难过的是如何使用现有的库而不是尝试重新发明轮子。这不就是普通的好习惯吗?

标签: excel vba csv


【解决方案1】:

已编译但未测试

Sub UnpivotFile(sPath As String)

    Const DELIM As String = ","
    Const QUOTE As String = """"

    Dim FSO As New FileSystemObject
    Dim arrHeader
    Dim arrContent
    Dim lb As Integer, ub As Integer
    Dim x As Integer
    Dim inData As Boolean
    Dim l As String, fName As String
    Dim fIn As Scripting.TextStream
    Dim fOut As Scripting.TextStream
    Dim tmp As String
    Dim lineNum As Long

    fName = FSO.GetFileName(sPath)

    Set fIn = FSO.OpenTextFile(sPath, ForReading)
    Set fOut = FSO.OpenTextFile(sPath & "_out", ForWriting)
    lineNum = 0

    Do While Not fIn.AtEndOfStream

        l = fIn.ReadLine
        lineNum = lineNum + 1
        arrContent = ParseLineToArray(l, DELIM, QUOTE)

        If Not inData Then
            arrHeader = arrContent
            lb = LBound(arrHeader)
            ub = UBound(arrHeader)
            inData = True
        Else
            For x = lb To ub
                fOut.WriteLine Join(Array(fName, lineNum, _
                               QID(arrHeader(x), DELIM, QUOTE), _
                               QID(arrContent(x), DELIM, QUOTE)), DELIM)

            Next x
        End If
    Loop
    fIn.Close
    fOut.Close
End Sub

'quote if delimiter found
Function QID(s, d As String, q As String)
    QID = IIf(InStr(s, d) > -1, q & s & q, s)
End Function


'Split a string into an array based on a Delimiter and a Text Identifier
Private Function ParseLineToArray(sInput As String, m_Delim As String, _
                                  m_TextIdentifier As String) As Variant
   'Dim vArr As Variant
   Dim sArr() As String
   Dim bInText As Boolean
   Dim i As Long, n As Long
   Dim sTemp As String, tmp As String

   If sInput = "" Or InStr(1, sInput, m_Delim) = 0 Then
      'zero length string, or delimiter not present
      'dump all input into single-element array (minus Text Identifier)
      ReDim sArr(0)
      sArr(0) = Replace(sInput, m_TextIdentifier, "")
      ParseLineToArray = sArr()
   Else
      If InStr(1, sInput, m_TextIdentifier) = 0 Then
         'no text identifier so just split and return
         sArr() = Split(sInput, m_Delim)
         ParseLineToArray = sArr()
      Else
         'found the text identifier, so do it the long way
         bInText = False
         sTemp = ""
         n = 0

         For i = 1 To Len(sInput)
            tmp = Mid(sInput, i, 1)
            If tmp = m_TextIdentifier Then
               'just toggle the flag - don't add to string
               bInText = Not bInText
            Else
               If tmp = m_Delim Then
                  If Not bInText Then
                     'delimiter not within quoted text, so add next array member
                     ReDim Preserve sArr(n)
                     sArr(n) = sTemp
                     sTemp = ""
                     n = n + 1
                  Else
                     sTemp = sTemp & tmp
                  End If
               Else
                  sTemp = sTemp & tmp
               End If           'character is a delimiter
            End If              'character is a quote marker
         Next i

         ReDim Preserve sArr(n)
         sArr(n) = sTemp

         ParseLineToArray = sArr()
      End If 'has any quoted text
   End If 'parseable

End Function

【讨论】:

  • 谢谢,蒂姆 - 但我仍然厌恶构建自己的解析器:为什么要重新发明轮子并遭受其他人已经克服的陷阱?例如,我认为您的解决方案不能处理带引号的行分隔符......库已经编写、测试并广泛部署用于读取/写入分隔文本文件 - 肯定有一种方法可以从 VBA 中使用它们?!
  • 什么是带引号的行分隔符?您是否希望行由换行序列(\n、\n\r\ 或 \r)以外的其他内容分隔?
  • 当 (\n, \n\r 或 \r) 出现在 "..." 中时会发生什么?在我遇到的大多数分隔文本格式中(当然在此应用程序中使用的格式),这些字符应该被视为引号中字段值的一部分,而不是行分隔符。此外,我认为您的方法不能正确处理包含在字段中的文本标识符标记......很公平,我可以调整这些问题,但是我没有考虑过的其他边缘情况呢?其他人已经在经过全面测试的库中遇到并解决了边缘案例。
  • 同意 - 处理所有这些情况很麻烦。如果您必须能够处理这些问题,那么只需在 Excel 2007+ 中打开它们似乎是显而易见的解决方案。
【解决方案2】:

这应该足够快(在我的机器上,一个 18MB 的文件需要 8 秒,但我只复制数据,我不对其进行重组 - 如果您不进行计算而只对内容进行重新排序,那么您应该得到相同的结果种表现)。即使行数/列数不适合电子表格,它也能正常工作。

TODO:它有点长,但您应该能够 (a) 复制粘贴 (b) 更改文件名和 (c) 修改操作数据函数以满足您的需要。其余代码是一堆可重用的实用程序函数,您不需要更改。

我不确定使用 VBA 是否可以更快 - 如果您需要更快,您应该考虑另一种语言。通常,Java 或 C# 中的相同代码会更短,因为它们已经有标准库来读取/写入文件等,而且速度也会更快。

Option Explicit

Public Sub doIt()
    Dim sourceFile As String
    Dim destinationFile As String
    Dim data As Variant
    Dim result As Variant

    sourceFile = "xxxxxxx"
    destinationFile = "xxxxxxx"

    data = getDataFromFile(sourceFile, ",")
    If Not isArrayEmpty(data) Then
       result = manipulateData(data)
       writeToCsv result, destinationFile, ","
    Else
       MsgBox ("Empty file")
    End If
End Sub

Function manipulateData(sourceData As Variant) As Variant
    Dim result As Variant
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim m As Long

    'redim the result array to the right size - here I only copy so same size as source
    ReDim result(1 To UBound(sourceData, 1), 1 To UBound(sourceData, 2)) As Variant

    For i = LBound(sourceData, 1) To UBound(sourceData, 1)
        For j = LBound(sourceData, 2) To UBound(sourceData, 2)
            k = i 'k to be defined - here I only copy data
            m = j 'm to be defined - here I only copy data
            result(k, m) = sourceData(i, j)
        Next j
    Next i

    manipulateData = result
End Function

Public Sub writeToCsv(parData As Variant, parFileName As String, parDelimiter As String)

    If getArrayNumberOfDimensions(parData) <> 2 Then Exit Sub

    Dim i As Long
    Dim j As Long
    Dim fileNum As Long
    Dim locLine As String
    Dim locCsvString As String

    fileNum = FreeFile
    If Dir(parFileName) <> "" Then Kill (parFileName)
    Open parFileName For Binary Lock Read Write As #fileNum

    For i = LBound(parData, 1) To UBound(parData, 1)
      locLine = ""
      For j = LBound(parData, 2) To UBound(parData, 2)
        If IsError(parData(i, j)) Then
          locLine = locLine & "#N/A" & parDelimiter
        Else
          locLine = locLine & parData(i, j) & parDelimiter
        End If
      Next j
      locLine = Left(locLine, Len(locLine) - 1)
      If i <> UBound(parData, 1) Then locLine = locLine & vbCrLf
      Put #fileNum, , locLine
    Next i

error_handler:
    Close #fileNum

End Sub

Public Function isArrayEmpty(parArray As Variant) As Boolean
'Returns false if not an array or dynamic array that has not been initialised (ReDim) or has been erased (Erase)

  If IsArray(parArray) = False Then isArrayEmpty = True
  On Error Resume Next
  If UBound(parArray) < LBound(parArray) Then isArrayEmpty = True: Exit Function Else: isArrayEmpty = False

End Function

Public Function getArrayNumberOfDimensions(parArray As Variant) As Long
'Returns the number of dimension of an array - 0 for an empty array.

    Dim i As Long
    Dim errorCheck As Long

    If isArrayEmpty(parArray) Then Exit Function 'returns 0

    On Error GoTo FinalDimension
    'Visual Basic for Applications arrays can have up to 60000 dimensions
    For i = 1 To 60001
        errorCheck = LBound(parArray, i)
    Next i

    'Not supposed to happen
    getArrayNumberOfDimensions = 0
    Exit Function

FinalDimension:
    getArrayNumberOfDimensions = i - 1

End Function

Private Function getDataFromFile(parFileName As String, parDelimiter As String, Optional parExcludeCharacter As String = "") As Variant
'parFileName is supposed to be a delimited file (csv...)
'parDelimiter is the delimiter, "," for example in a comma delimited file
'Returns an empty array if file is empty or can't be opened
'number of columns based on the line with the largest number of columns, not on the first line
'parExcludeCharacter: sometimes csv files have quotes around strings: "XXX" - if parExcludeCharacter = """" then removes the quotes


  Dim locLinesList() As Variant
  Dim locData As Variant
  Dim i As Long
  Dim j As Long
  Dim locNumRows As Long
  Dim locNumCols As Long
  Dim fso As Variant
  Dim ts As Variant
  Const REDIM_STEP = 10000

  Set fso = CreateObject("Scripting.FileSystemObject")

  On Error GoTo error_open_file
  Set ts = fso.OpenTextFile(parFileName)
  On Error GoTo unhandled_error

  'Counts the number of lines and the largest number of columns
  ReDim locLinesList(1 To 1) As Variant
  i = 0
  Do While Not ts.AtEndOfStream
    If i Mod REDIM_STEP = 0 Then
      ReDim Preserve locLinesList(1 To UBound(locLinesList, 1) + REDIM_STEP) As Variant
    End If
    locLinesList(i + 1) = Split(ts.ReadLine, parDelimiter)
    j = UBound(locLinesList(i + 1), 1) 'number of columns
    If locNumCols < j Then locNumCols = j
    If j = 13 Then
      j = j
    End If
    i = i + 1
  Loop

  ts.Close

  locNumRows = i

  If locNumRows = 0 Then Exit Function 'Empty file

  ReDim locData(1 To locNumRows, 1 To locNumCols + 1) As Variant

  'Copies the file into an array
  If parExcludeCharacter <> "" Then

    For i = 1 To locNumRows
      For j = 0 To UBound(locLinesList(i), 1)
        If Left(locLinesList(i)(j), 1) = parExcludeCharacter Then
          If Right(locLinesList(i)(j), 1) = parExcludeCharacter Then
            locLinesList(i)(j) = Mid(locLinesList(i)(j), 2, Len(locLinesList(i)(j)) - 2)       'If locTempArray = "", Mid returns ""
          Else
            locLinesList(i)(j) = Right(locLinesList(i)(j), Len(locLinesList(i)(j)) - 1)
          End If
        ElseIf Right(locLinesList(i)(j), 1) = parExcludeCharacter Then
          locLinesList(i)(j) = Left(locLinesList(i)(j), Len(locLinesList(i)(j)) - 1)
        End If
        locData(i, j + 1) = locLinesList(i)(j)
      Next j
    Next i

  Else

    For i = 1 To locNumRows
      For j = 0 To UBound(locLinesList(i), 1)
        locData(i, j + 1) = locLinesList(i)(j)
      Next j
    Next i

  End If

  getDataFromFile = locData

  Exit Function

error_open_file:                 'returns empty variant
unhandled_error:                 'returns empty variant

End Function

【讨论】:

  • 天哪。虽然我很欣赏您为实现自己的解析器所做的努力,但我认为这不是一个特别好的解决方案。一方面,字段可能用引号括起来以逃避字段/行分隔符的使用。另一方面,您正在将整个文件加载到内存中(这对于我的目的来说似乎不是特别必要);一般来说,我宁愿依赖现有的、经过测试的、受支持的库,也不愿尝试重新发明轮子,并带来所有的陷阱!不过,还是谢谢你。
  • @eggyal 我明白你的意思。 FWIW 该代码已在生产中使用了一段时间,没有任何问题。现在它已在特定条件下使用,如果您有不同的条件(例如引号内有逗号),它可能不会达到您的预期。
【解决方案3】:

我决定在 VB.NET 中围绕 TextFieldParser 构建一个小型的 COM 感知包装器。不理想,但目前我能想到的最好的。

【讨论】:

  • 请发布代码。超过 255 列的 csv 文件并不常见,但大多数开发人员都会遇到一次。
【解决方案4】:

我过去曾亲自使用 CSV Reader 来解析巨大的 CSV 文件(最大 1 GB)。性能和简单性令人难以置信。我强烈建议您使用它。

既然您说您使用的是 VB.NET,我建议您构建一个引用 CSV Reader 的简单控制台应用程序。此控制台应用程序会将 csv 文件的路径作为命令行参数来“取消透视”。然后,在 VBA 中,您可以使用 VBA.Shell 来运行您的控制台应用程序,并将 CSV 文件的路径作为参数提供给它。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-09-02
    • 1970-01-01
    • 1970-01-01
    • 2015-10-10
    相关资源
    最近更新 更多