【问题标题】:Efficiently append .csv files together (VB.NET)有效地将 .csv 文件附加在一起 ​​(VB.NET)
【发布时间】:2019-11-20 19:31:24
【问题描述】:

我有一个关于我想在 Visual Basic 中提高效率的一段代码的问题。我想做的是以下几点:

  • 我有一个包含 100 个 .csv 文件(逗号分隔)的文件夹,这些文件有大约 5000 行和大约 200 列。列的顺序可能因文件而异,并且某些文件中缺少某些列。
  • 我的目标是创建一个包含所有 100 个 .csv 文件的大 .csv 文件,以及我预先指定的列选择。
  • 我是这样进行的:

    1. 创建一个数组来存储我想要在最终“大 .csv”中的列的名称
    2. 遍历文件夹中的所有文件。对于每个文件,
    3. 对于文件中的每一行,使用拆分函数创建一个数组,其中包含给定行的所有值。
    4. 创建一个映射数组,为第一步选择的每个列名存储该列在文件中的位置(仅对每个文件的第一行执行此操作)
    5. 在文件(“大 .csv”)中写入标题(仅执行一次)
    6. 在同一个大文件中写入,每个文件的每一行,根据列的位置数据。

所以这个过程运行良好,我得到了我想要的结果,但它很慢......(在我的计算机上大约需要 40 分钟才能处理 ~200 个文件,这些文件一旦附加包含 500,000 行和 200 列。一位同事设法做到了一个类似的过程,附加所有文件,使用 R 中的 data.table 包,他能够在同一台计算机上在 5-10 分钟内对相同的 .csv 表执行相同的附加) 我想知道是否有比“逐个单元格”地浏览文件更好的选择?我可以从源文件中识别出我不想要的列并完全删除它们吗?是否有将文件附加在一起而不是读取每个单元格然后将它们写回的功能?

编辑:或者,是否有另一种更高效的编程语言(Python?Power-Shell?)来执行这种文件操作?

Edit2:关于为什么我认为它慢的更多细节。

Edit3:与我在 cmets 中要求的问题相关的代码:

Public Module Public_Variables

    'Initializr technical parameters
    Public Enable_SQL_Upload As String = "Yes"
    Public Enable_CSV_Output As String = "Yes"
    Public Enable_Runlog As String = "Yes"
    Public MPF_Type As String

    'Initialize Path and Folder location
    Public Path As String '= "L:\Prophet\1902_Analysis\results\RUN_200" ' "S:\Users\Jonathan\12_Project_Space\Input" '"K:\Prophet\1809\Model_B2_LS_Analysis\results\RUN_31" 

    'Initialize parameters for Actuarial SQL database connection
    Public SQLServer As String '= "MELAIPWBAT01"
    Public SQLDataBase As String '= "VALDATA"
    Public SQLTableName As String '= "RPT_1902_" & Right(Path, 7) '"aMPF_Retail_1808"

    Public HeaderScopeFileName As String

    Public ValidFilesFileName As String = "S:\Users\Jonathan\12_Project_Space\Tables\ValidFiles.txt"
    Public InputFileName As String = "S:\Users\Jonathan\12_Project_Space\Tables\Input.txt"

    Public tsrl As String = Now.Year.ToString() + Now.Month.ToString() + Now.Day.ToString() + "_" + Now.Hour.ToString() + "h" + Now.Minute.ToString() + "m" + Now.Second.ToString() + "s"
    Public RunLogFileName As String = "\\Melaipwbat01\c$\Users\AACT064\Desktop\SQL_CSV_BULK_INSERT\RunLog" & tsrl & ".csv"
    Public RunLogFile As IO.StreamWriter = My.Computer.FileSystem.OpenTextFileWriter(RunLogFileName, False)

    Public HeaderFile1 As String '= "K:\Prophet\1903\MPF\MPF_SNAPSHOT\C_TROC.rpt" ' "K:\Prophet\1903\Model_B2_LS_Analysis\results\RUN_200\C_TROC.rpt" '"K:\Prophet\1809\Model_B2_LS_Analysis\results\RUN_31\C_DIO0.rpt" 
    Public HeaderFile2 As String '= "K:\Prophet\1901\Model_GRP\results\RUN_16\CORS_0.rpt" '"K:\Prophet\1809\Model_B2_LS_Analysis\results\RUN_31\C_DIO0.rpt" 

    Public ValidFilesFile As New System.IO.StreamReader(ValidFilesFileName)


    'UserForm Design
    'Public UserFormHeight_Start As Integer = 800
    'Public UserFormWidth_Start As Integer = 800
    Public UserFormHeight_ProgressBarExtension As Integer = 0
    Public UserFormWidth_ProgressBarExtension As Integer = 0

    'Initialize Misc.
    Public Input_Array(,) As String
    Public TextLine As String
    Public TextLineSplit() As String
    Public ValidFilesArray(,) As String
    Public RecordCount As Integer = 0
    Public BodyString As String = ""
    Public SqlCommandText1 As String = ""
    Public SqlCommandText2 As String = ""

    'Initialize IS variables
    Public Is_RPT_Name As Integer = 1
    Public Is_RPT_Type As Integer = 2
    Public Is_RPT_Valid As Integer = 3

    Public Is_Name As Integer = 1
    Public Is_Type As String = 2

    Public Is_Not_found As Integer = -1
    Public Is_RetailDCS As Integer = 0
    Public Is_GroupDCS As Integer = 1
    Public Is_MPF_Type As Integer

    Public Is_Input_Header As Integer = 1
    Public Is_Input_Path As Integer = 2
    Public Is_Input_Server As Integer = 3
    Public Is_Input_Database As Integer = 4
    Public Is_Input_TableName As Integer = 5
    Public Is_Input_HeaderFileRetailDCS As Integer = 6
    Public Is_Input_HeaderFileGroupDCS As Integer = 7

    'Initialize temp variables
    Public temp_Valid As Integer

    'Initialize the header of the SQL table that is created from that application
    Public HeaderScope(,) As String
    Public HeaderMapId(,) As String
    Public HeaderStringSQL As String = ""
    Public HeaderStringCSV As String = ""
    Public MappingFound As Boolean

    'Initialization for the files looping
    Public temp_file As Integer = 1
    Public temp_line As Integer
    Public temp_headerfile As String
    Public FileSize As Integer
    Public TimerCounter As Integer = 0

End Module

Public Class UF_UserForm

    Private Sub UF_UserForm_Load(sender As Object, e As EventArgs) Handles MyBase.Load

        'Me.Height = UserFormHeight_Start
        'Me.Width = UserFormWidth_Start

        TB_Input.Text = InputFileName

        CB_SQLUpload.Text = Enable_SQL_Upload
        CB_CSVOutput.Text = Enable_CSV_Output
        CB_Runlog.Text = Enable_Runlog

        GB_Progress.Visible = False
    End Sub

    Private Sub B_Run_Click_1(sender As Object, e As EventArgs) Handles B_Run.Click

        'Disable the button, switch to 'Progress' tab
        B_Run.Enabled = False
        TabControl1.SelectedIndex = 1

        'Start the Timer
        Timer1.Interval = 1000
        TimerCounter = 0
        Timer1.Start()

        'Initilialize the parameters with the Text Box values
        TB_Server.Enabled = False
        TB_Database.Enabled = False
        TB_TableName.Enabled = False
        TB_Path.Enabled = False
        TB_FileType.Enabled = False
        CB_SQLUpload.Enabled = False
        CB_CSVOutput.Enabled = False
        CB_Runlog.Enabled = False

        Enable_SQL_Upload = CB_SQLUpload.Text
        Enable_CSV_Output = CB_CSVOutput.Text
        Enable_Runlog = CB_Runlog.Text

        'Extract the inputs from the input.txt file
        Dim temp_Input As Integer
        Dim InputFirstCol As String
        Dim InputSecCol As String
        Dim Nb_Of_Runs As Integer
        Dim EndOfLoop As Boolean

        InputFileName = TB_Input.Text
        Dim InputFile As New System.IO.StreamReader(InputFileName)

        temp_Input = 0
        EndOfLoop = False
        Do While InputFile.Peek() <> -1 And EndOfLoop = False
            TextLine = InputFile.ReadLine()
            TextLineSplit = TextLine.Split(",")
            InputFirstCol = TextLineSplit(0)

            If InputFirstCol <> "#" And InputFirstCol <> "" And InputFirstCol <> "--End--" Then
                InputSecCol = TextLineSplit(1)
            Else
                InputSecCol = ""
            End If

            If InputFirstCol = "--End--" Then
                EndOfLoop = True
            Else
                If InputFirstCol = "#" Then
                    temp_Input = temp_Input + 1
                ElseIf InputFirstCol = "HeaderScope" Then
                    ReDim Preserve Input_Array(7, temp_Input)
                    Input_Array(Is_Input_Header, temp_Input) = InputSecCol
                ElseIf InputFirstCol = "Path" Then
                    Input_Array(Is_Input_Path, temp_Input) = InputSecCol
                ElseIf InputFirstCol = "Server" Then
                    Input_Array(Is_Input_Server, temp_Input) = InputSecCol
                ElseIf InputFirstCol = "Database" Then
                    Input_Array(Is_Input_Database, temp_Input) = InputSecCol
                ElseIf InputFirstCol = "TableName" Then
                    Input_Array(Is_Input_TableName, temp_Input) = InputSecCol
                ElseIf InputFirstCol = "HeaderFileRetailDCS" Then
                    Input_Array(Is_Input_HeaderFileRetailDCS, temp_Input) = InputSecCol
                ElseIf InputFirstCol = "HeaderFileGroupDCS" Then
                    Input_Array(Is_Input_HeaderFileGroupDCS, temp_Input) = InputSecCol
                End If
            End If
        Loop

        Nb_Of_Runs = temp_Input

        'Create an array to store the timer per run
        Dim Timer_Array(Nb_Of_Runs) As String

        'Let's start the loop for each run
        For temp_Run = 1 To Nb_Of_Runs

            'Initialize date stamp variables and create the date stamp (called ts)
            Dim now As DateTime = DateTime.Now
            Dim ts As String = now.Year.ToString() + now.Month.ToString() + now.Day.ToString() + "_" + now.Hour.ToString() + "h" + now.Minute.ToString() + "m" + now.Second.ToString() + "s"
            Dim temp_full_count As Integer = 0
            Dim File_Count As Integer = 0
            Dim temp_count As Integer = 0

            'Open the.csv file
            Dim OutputCSVFileName As String = "\\Melaipwbat01\c$\Users\AACT064\Desktop\SQL_CSV_BULK_INSERT\SQL_Upload" & ts & ".csv" ' "S:\Users\Jonathan\12_Project_Space\Output\RPT_Files" & ts & ".csv"
            Dim SQLUploadFileName As String = Replace(OutputCSVFileName, "\\Melaipwbat01\c$", "C:")
            Dim SQLQueriesFileName As String = "\\Melaipwbat01\c$\Users\AACT064\Desktop\SQL_CSV_BULK_INSERT\SQL_Queries" & ts & ".csv"

            Dim outFile As IO.StreamWriter = My.Computer.FileSystem.OpenTextFileWriter(OutputCSVFileName, False)
            Dim SQLQueriesFile As IO.StreamWriter = My.Computer.FileSystem.OpenTextFileWriter(SQLQueriesFileName, False)

            'Store the inputs from the Input_Array
            SQLServer = Input_Array(Is_Input_Server, temp_Run)
            TB_Server.Text = Input_Array(Is_Input_Server, temp_Run)

            SQLDataBase = Input_Array(Is_Input_Database, temp_Run)
            TB_Database.Text = Input_Array(Is_Input_Database, temp_Run)

            SQLTableName = Input_Array(Is_Input_TableName, temp_Run)
            TB_TableName.Text = Input_Array(Is_Input_TableName, temp_Run)

            Path = Input_Array(Is_Input_Path, temp_Run)
            TB_Path.Text = Input_Array(Is_Input_Path, temp_Run)

            HeaderScopeFileName = Input_Array(Is_Input_Header, temp_Run)
            TB_FileType.Text = Input_Array(Is_Input_Header, temp_Run)

            HeaderFile1 = Input_Array(Is_Input_HeaderFileRetailDCS, temp_Run)
            TB_HeaderRetailDCS.Text = Input_Array(Is_Input_HeaderFileRetailDCS, temp_Run)

            HeaderFile2 = Input_Array(Is_Input_HeaderFileGroupDCS, temp_Run)
            TB_HeaderGroupDCS.Text = Input_Array(Is_Input_HeaderFileGroupDCS, temp_Run)

            'Open the folder location and store all the files objetcs into files()
            Dim files() As String = IO.Directory.GetFiles(Path)

            'Open the Header scope .txt file
            Dim HeaderScopeFile As New System.IO.StreamReader(HeaderScopeFileName)

            'Initialize the variable ValidFilesArray
            temp_Valid = 1
            Do While ValidFilesFile.Peek() <> -1
                TextLine = ValidFilesFile.ReadLine()
                TextLineSplit = TextLine.Split(", ")

                ReDim Preserve ValidFilesArray(3, temp_Valid)
                ValidFilesArray(Is_RPT_Name, temp_Valid) = TextLineSplit(Is_RPT_Name - 1)
                ValidFilesArray(Is_RPT_Type, temp_Valid) = TextLineSplit(Is_RPT_Type - 1)
                ValidFilesArray(Is_RPT_Valid, temp_Valid) = TextLineSplit(Is_RPT_Valid - 1)

                temp_Valid = temp_Valid + 1
            Loop

            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            '' Display the Progress Group Box ''''''''''''''''''''''''''''''''''''''''''''''''
            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            L_ProgressPC.Text = "Initialisation"
            ProgressBar.Value = 0
            GB_Progress.Visible = True
            'Me.Height = UserFormHeight_Start + UserFormHeight_ProgressBarExtension
            'Me.Width = UserFormWidth_Start + UserFormWidth_ProgressBarExtension

            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            '' Nb of Files '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            ' The goal of this piece of code aims at calculating the number of .rpt files
            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

            TB_Runlog.Text = "Checking number of .rpt files..." & Environment.NewLine & TB_Runlog.Text
            For Each file As String In files
                If CheckValidRPTFile(file, False) = True Then
                    File_Count = File_Count + 1
                    L_NbOfFiles.Text = File_Count
                    L_NbOfRuns.Text = temp_Run & "/" & Nb_Of_Runs
                End If
                Application.DoEvents()
            Next
            TB_Runlog.Text = "... " & " Run number " & temp_Run & Environment.NewLine & TB_Runlog.Text
            TB_Runlog.Text = "... " & File_Count & " rpt files founds" & Environment.NewLine & TB_Runlog.Text
            TB_Runlog.Text = "---------------------------------------" & Environment.NewLine & TB_Runlog.Text

            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            ' The key is to define the header with the field names and the field types
            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            Dim HeaderNbOfField As Integer = 0
            Do While HeaderScopeFile.Peek() <> -1

                HeaderNbOfField = HeaderNbOfField + 1

                'We split the libe into an array using the comma delimiter
                TextLine = HeaderScopeFile.ReadLine()
                TextLineSplit = TextLine.Split(", ")

                ReDim Preserve HeaderScope(2, HeaderNbOfField)
                HeaderScope(Is_Name, HeaderNbOfField) = TextLineSplit(0)
                HeaderScope(Is_Type, HeaderNbOfField) = TextLineSplit(1)

            Loop

            'That array stores the position of a given field in the file
            'It is important to initialise the array to -1
            'When further down we assign HeaderMapId, if a value remains "-1" il will mean the field was not assigned thus not found.
            'We will then assign a default value for those not found fields
            ReDim HeaderMapId(1, HeaderNbOfField)
            For temp_headermapini = 0 To HeaderNbOfField
                HeaderMapId(Is_RetailDCS, temp_headermapini) = Is_Not_found
                HeaderMapId(Is_GroupDCS, temp_headermapini) = Is_Not_found
            Next

            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            '' Header ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            ' The goal of this piece of code is to populate the variable HeaderMapId
            ' HeaderMapId stores the position of each field define is HeaderScope variable
            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            L_ProgressPC.Text = "Find position of each field in the flat file"

            'We use HeaderFile as the base to define the position of each field
            'The application will function properly only if every .rpt file in the folder have same header as HeaderFile

            For temp_header = 0 To 1

                'We loop through 2 different type of MPFs
                'Typicaly coming from Retail DCS and Group DCS
                If temp_header = Is_RetailDCS Then
                    temp_headerfile = HeaderFile1
                ElseIf temp_header = Is_GroupDCS Then
                    temp_headerfile = HeaderFile2
                Else
                    temp_headerfile = "" 'Error
                End If

                Dim objReaderHeader As New System.IO.StreamReader(temp_headerfile)

                'We read line by line until the end of the file
                Do While objReaderHeader.Peek() <> -1

                    TextLine = objReaderHeader.ReadLine()

                    'We only care about the header, which starts with the character "!" in prophet .rpt files
                    If Strings.Left(TextLine, 1) = "!" Then

                        Dim temp_scope As Integer
                        Dim temp_scope_id As Integer

                        'We split the line in an array delimited by comma
                        TextLineSplit = TextLine.Split(", ")

                        'We loop through the array
                        'Once a field match one of the field define in HeaderScope, we store the position of that field in HeaderMapId
                        temp_scope_id = 1
                        For Each s As String In TextLineSplit
                            For temp_scope = 1 To UBound(HeaderScope, 2)
                                If HeaderScope(Is_Name, temp_scope) = s Then
                                    HeaderMapId(temp_header, temp_scope) = temp_scope_id
                                End If
                            Next
                            temp_scope_id = temp_scope_id + 1
                        Next

                    End If
                Loop

            Next

            '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            '' Header''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            'Create the query for SQL table creation
            Dim temp_field As Integer
            For temp_field = 1 To HeaderNbOfField
                If temp_field = 1 Then
                    HeaderStringSQL = "(Prophet_Name varchar(255),"
                    HeaderStringCSV = "Prophet_Name,"
                End If
                'We replace the bracket by underscore to avoid crashes when creating the SQL table
                HeaderStringSQL = HeaderStringSQL & Replace(Replace(HeaderScope(Is_Name, temp_field), "(", "_"), ")", "_") & " " & HeaderScope(Is_Type, temp_field)
                HeaderStringCSV = HeaderStringCSV & Replace(Replace(HeaderScope(Is_Name, temp_field), "(", "_"), ")", "_")

                If temp_field <> HeaderNbOfField Then
                    HeaderStringSQL = HeaderStringSQL & ","
                    HeaderStringCSV = HeaderStringCSV & ","
                Else
                    HeaderStringSQL = HeaderStringSQL & ")"
                    HeaderStringCSV = HeaderStringCSV & ","
                End If
            Next

            If Enable_CSV_Output = "Yes" Then
                'Remove braquets and single quotes
                outFile.WriteLine(Replace(Replace(Replace(HeaderStringCSV, ")", ""), "(", ""), "'", ""))
            End If

            '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            '' Body '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            ' We loop through all the files and pick the information we need based on HeaderMapId
            '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

            'We loop through each file in the folder location
            For Each file As String In files

                Dim objReader As New System.IO.StreamReader(file)
                Dim fileInfo As New IO.FileInfo(file)

                'We only loop through the valid .rpt files
                If CheckValidRPTFile(file, True) = True Then

                    'Count the number of files we go through
                    temp_count = temp_count + 1

                    'Count the number of lines in the file (called FileSize)
                    'Dim objReaderLineCOunt As New System.IO.StreamReader(file)
                    'FileSize = 0
                    'Do While objReaderLineCOunt.Peek() <> -1
                    'TextLine = objReaderLineCOunt.ReadLine()
                    'FileSize = FileSize + 1
                    'Loop
                    FileSize = 1000000

                    'temp_line = 1
                    ''We loop through line by line for a given file
                    'Do While objReader.Peek() <> -1

                    Dim TextLines() As String = System.IO.File.ReadAllLines(file)
                    For Each TextLine2 In TextLines

                        'Update the Progress Bar
                        temp_full_count = temp_full_count + 1

                        If temp_full_count Mod 200 = 0 Then
                            ProgressBar.Value = Int(100 * temp_count / File_Count)
                            L_ProgressPC.Text = "Processing file " & fileInfo.Name & " " & temp_count & "/" & File_Count & " - line " & temp_line & "/" & FileSize
                            L_RecordsProcessed.Text = temp_full_count
                            Application.DoEvents()
                        End If

                        'We split the libe into an array using the comma delimiter
                        'TextLine = objReader.ReadLine()
                        TextLineSplit = TextLine2.Split(", ")

                        'Skip line that are not actual prophet records (skip header and first few lines)
                        If Strings.Left(TextLine2, 1) = "*" Then

                            'We loop through the number of field we wish to extract for the file
                            For temp_field = 1 To HeaderNbOfField

                                If temp_field = 1 Then
                                    BodyString = "('" & Strings.Left(fileInfo.Name, Len(fileInfo.Name) - 4) & "',"
                                End If

                                If MPF_Type = "RetailDCS" Then
                                    Is_MPF_Type = Is_RetailDCS
                                ElseIf MPF_Type = "GroupDCS" Then
                                    Is_MPF_Type = Is_GroupDCS
                                Else
                                    MsgBox("Is_MPF_Type value is nor recognized")
                                    End
                                End If

                                'The array HeaderMapId tells us where to pick the information from the file
                                'This assumes that each file in the folder have same header as the 'HeaderFile'
                                If HeaderMapId(Is_MPF_Type, temp_field) = Is_Not_found Then
                                    BodyString = BodyString & "98766789"
                                Else
                                    BodyString = BodyString & TextLineSplit(HeaderMapId(Is_MPF_Type, temp_field) - 1)
                                End If


                                If temp_field <> HeaderNbOfField Then
                                    BodyString = BodyString & ","
                                Else
                                    BodyString = BodyString & ")"
                                End If
                            Next

                            'We replace double quotes with single quotes
                            BodyString = Replace(BodyString, """", "'")

                            'This Line is to add records to the .csv file
                            If Enable_CSV_Output = "Yes" Then
                                'Remove braquets and single quotes
                                outFile.WriteLine(Replace(Replace(Replace(BodyString, ")", ""), "(", ""), "'", ""))
                            End If

                        End If

                        temp_line = temp_line + 1

                    Next

                    TB_Runlog.Text = "Completed: " & fileInfo.Name & Environment.NewLine & TB_Runlog.Text
                    temp_file = temp_file + 1
                End If

            Next

            outFile.Close()

            ProgressBar.Value = Int(100 * temp_count / File_Count)
            L_RecordsProcessed.Text = temp_full_count
            Application.DoEvents()

            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            '' Upload to SQL '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            ' The goal of this code is to create the SQL table
            ' And push the .csv created into the SQL table using BULK INSERT function
            ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
            L_ProgressPC.Text = "Creating SQL Table"
            Application.DoEvents()

            If Enable_SQL_Upload = "Yes" Then

                'First query is to create the table
                SqlCommandText1 = "CREATE TABLE [" & SQLDataBase & "].[analysis]." & SQLTableName & "_" & ts & " " & HeaderStringSQL

                'Second query is to populate the data
                SqlCommandText2 = "BULK INSERT [" & SQLDataBase & "].[analysis]." & SQLTableName & "_" & ts & " " & " FROM '" & SQLUploadFileName & "' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n' , FIRSTROW=2)"

                SQLQueriesFile.WriteLine(SqlCommandText1)
                SQLQueriesFile.WriteLine(SqlCommandText2)
                SQLQueriesFile.Close()

                Using connection As New SqlConnection("Data Source=" & SQLServer & ";Integrated Security=True;Connection Timeout=2000;Initial Catalog=" & SQLDataBase & ";")
                    connection.Open()

                    Dim command As New SqlCommand(SqlCommandText1, connection)
                    command.ExecuteNonQuery()

                    Dim command2 = New SqlCommand(SqlCommandText2, connection)
                    command2.CommandTimeout = 2000
                    command2.ExecuteNonQuery()

                    connection.Close()
                End Using

            End If

            Timer_Array(temp_Run) = L_Timer.Text

        Next 'temp_Run

        RunLogFile.Close()
        Timer1.Stop()

        L_ProgressPC.Text = "Job completed"

【问题讨论】:

  • 请解释您所说的非常慢是什么意思,以及您的期望是什么。另外,请向我们展示您的代码,可能有一些方法可以提高其速度。
  • 感谢您的反馈,我编辑了原帖并添加了您要求的信息

标签: vb.net csv append


【解决方案1】:

加快程序速度的一种方法是尽量减少对磁盘的访问次数。现在,您正在逐行读取每个文件两次。每个文件很可能适合内存。因此,您可以做的是读取内存中文件的所有行,然后处理其行。这会快得多。

类似:

'We only loop through the valid .rpt files
If CheckValidRPTFile(file, True) = True Then

    ''Count the number of files we go through
    'temp_count = temp_count + 1

    ''Count the number of lines in the file (called FileSize)
    'Dim objReaderLineCOunt As New System.IO.StreamReader(file)
    'FileSize = 0
    'Do While objReaderLineCOunt.Peek() <> -1
    '    TextLine = objReaderLineCOunt.ReadLine()
    '    FileSize = FileSize + 1
    'Loop

    'temp_line = 1
    ''We loop through line by line for a given file
    'Do While objReader.Peek() <> -1

    Dim TextLines() As String = System.IO.File.ReadAllLines(file)
    For Each TextLine In TextLines

        'We split into an array using the comma delimiter
        'TextLine = objReader.ReadLine()
        TextLineSplit = TextLine.Split(", ")

我查看了您更新的代码。您的代码中还有一些额外的性能问题。

  1. 读取和写入网络共享文件夹文件效率不高,尤其是在对文件进行大量来回访问时,因为一切都通过网络而不是直接本地驱动器访问。

  2. 您的程序中效率最低的部分可能是遍历文件每一行的 200 个字段。假设我们平均每个文件有 5000 行,那么这意味着 200 x 5000 x 284 = 2.84 亿次迭代!

访问网络共享文件的有效方法是使用System.IO.ReadAllLines()System.IO.File.ReadAllText()读取内存中的整个文件,然后处理其内容。同样,写入网络共享文件应包括使用StringBuilderList(Of String) 在内存中构建文件内容(如果可能),然后使用System.IO.File.WriteAllText()System.IO.File.WriteAllLines 将整个文件写入网络共享。这应该是访问网络共享文件以获得最佳性能的首选方式。

对于第二个性能问题,循环通过字段可以简化如下。

Dim BodyStringBuilder As New StringBuilder("")
BodyStringBuilder.Append("('" & Strings.Left(fileInfo.Name, Len(fileInfo.Name) - 4) & "',")

If MPF_Type = "RetailDCS" Then
    Is_MPF_Type = Is_RetailDCS
ElseIf MPF_Type = "GroupDCS" Then
    Is_MPF_Type = Is_GroupDCS
Else
    MsgBox("Is_MPF_Type value is nor recognized")
    End
End If

'Skip line that are not actual prophet records (skip header and first few lines)
If Strings.Left(TextLine2, 1) = "*" Then

    'We loop through the number of field we wish to extract for the file
    For temp_field = 1 To HeaderNbOfField

        'The array HeaderMapId tells us where to pick the information from the file
        'This assumes that each file in the folder have same header as the 'HeaderFile'
        If HeaderMapId(Is_MPF_Type, temp_field) = Is_Not_found Then
            BodyStringBuilder.Append("98766789,")
        Else
            BodyStringBuilder.Append(TextLineSplit(HeaderMapId(Is_MPF_Type, temp_field) - 1) & ",")
        End If
    Next

    ' Replace last "," by ")".
    BodyStringBuilder.Remove(BodyStringBuilder.Length - 1, 1).Append(")")

    'We replace double quotes with single quotes
    BodyString = Replace(BodyString.ToString, """", "'")

【讨论】:

  • 您好 RobertBaron,我实施了您的解决方案,但结果却相反。在我的初始构建中,我能够每 40 分钟处理 500,000 行,使用您更新的代码,同样的 500,000 行需要 50 分钟
  • @user2833411 - 嗯,这很奇怪。你能发布你更新的代码吗?
  • 我已经用新的编码更新了原始帖子。感谢您的帮助!
  • @user2833411 - 我查看了您更新的代码。 (1) 小事:您应该删除Dim objReader As New System.IO.StreamReader(file) 行,因为它不需要。 (2) 鉴于你没有改善,磁盘访问可能不是性能问题。您还应该发布 TextLineSplitHeaderMapId 函数。这可能是存在性能问题的地方。另外,每个文件有多大?每个文件有多少个字段?你有多少内存?
  • 再次感谢您抽出宝贵时间查看此内容。这次我用完整的代码更新了原始帖子。我处理的 284 个文件加起来为 1.2Gb,每个文件的大小从 30kb 到 15Mb 不等。它们都包含 200 个字段和从 3 行到约 10,000 行的行数。内存为 992 Gb DDR3(我使用的是用于数据分析的服务器)
猜你喜欢
  • 2014-05-03
  • 2013-11-25
  • 1970-01-01
  • 2021-09-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-07-21
相关资源
最近更新 更多