【发布时间】:2014-05-09 15:44:19
【问题描述】:
我正在寻找一些关于如何让这个功能更快的建议。
该函数旨在通过分隔文本文件(以 CRLF 行结束)运行并删除数据行之间的任何回车或换行符。
例如一个文件 -
A|B|C|D
A|B|C|D
A|B|
C|D
A|B|C|D
会变成 -
A|B|C|D
A|B|C|D
A|B|C|D
A|B|C|D
该功能似乎运行良好,但是当我们开始处理大文件时,性能太慢了。一个例子是 - 80 万行需要 3 秒,1.3 亿行需要一个多小时....
代码是-
private void CleanDelimitedFile(string readFilePath, string writeFilePath, string delimiter, string problemFilePath, string rejectsFilePath, int estimateNumberOfRows)
{
ArrayList rejects = new ArrayList();
ArrayList problems = new ArrayList();
int safeSameLengthBreak = 0;
int numberOfLinesSameLength = 0;
int lineCount = 0;
int maxCount = 0;
string previousLine = string.Empty;
string currentLine = string.Empty;
// determine after how many rows with the same number of delimiter chars that we can safety
// say that we have found the expected length of a row (to save reading the full file twice)
if (estimateNumberOfRows > 100000000)
safeSameLengthBreak = estimateNumberOfRows / 200; // set the safe check limit as 0.5% of the file (minimum of 500,000)
else if (estimateNumberOfRows > 10000000)
safeSameLengthBreak = estimateNumberOfRows / 50; // set the safe check limit as 2% of the file (minimum of 200,000)
else
safeSameLengthBreak = 50000; // set the safe check limit as 50,000 (if there are less than 50,000 this wont be required anyway)
// open a reader
using (var reader = new StreamReader(readFilePath))
{
// check the file is still being read
while (!reader.EndOfStream)
{
// append the line count (for debugging)
lineCount += 1;
// get the current line
currentLine = reader.ReadLine();
// get the number of chars in the new line
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
// if the number is higher than the previous maximum set the new maximum
if (maxCount < chars)
{
maxCount = chars;
// the maximum has changed, reset the number of lines in a row with the same delimiter
numberOfLinesSameLength = 0;
}
else
{
// the maximum has not changed, add to the number of lines in a row with the same delimiter
numberOfLinesSameLength += 1;
}
// is the number of lines parsed in a row with the same number of delimiter chars above the safe limit? If so break the loop
if (numberOfLinesSameLength > safeSameLengthBreak)
{
break;
}
}
}
// reset the line count
lineCount = 0;
// open a writer for the duration of the next read
using (var writer = new StreamWriter(writeFilePath))
{
using (var reader = new StreamReader(readFilePath))
{
// check the file is still being read
while (!reader.EndOfStream)
{
// append the line count (for debugging)
lineCount += 1;
// get the current line
currentLine = reader.ReadLine();
// get the number of chars in the new line
int chars = (currentLine.Length - currentLine.Replace(delimiter, "").Length);
// check the number of chars in the line matches the required number
if (chars == maxCount)
{
// write line
writer.WriteLine(currentLine);
// clear the previous line variable as this was a valid write
previousLine = string.Empty;
}
else
{
// add the line to problems
problems.Add(currentLine);
// append the new line to the previous line
previousLine += currentLine;
// get the number of chars in the new appended previous line
int newPreviousChars = (previousLine.Length - previousLine.Replace(delimiter, "").Length);
// check the number of chars in the previous appended line matches the required number
if (newPreviousChars == maxCount)
{
// write line
writer.WriteLine(previousLine);
// clear the previous line as this was a valid write
previousLine = string.Empty;
}
else if (newPreviousChars > maxCount)
{
// the number of delimiter chars in the new line is higher than the file maximum, add to rejects
rejects.Add(previousLine);
// clear the previous line and move on
previousLine = string.Empty;
}
}
}
}
}
// rename the original file as _original
System.IO.File.Move(readFilePath, readFilePath.Replace(".txt", "") + "_Original.txt");
// rename the new file as the original file name
System.IO.File.Move(writeFilePath, readFilePath);
// Write rejects
using (var rejectWriter = new StreamWriter(rejectsFilePath))
{
// loop through the problem array list and write the problem row to the problem file
foreach (string reject in rejects)
{
rejectWriter.WriteLine(reject);
}
}
// Write problems
using (var problemWriter = new StreamWriter(problemFilePath))
{
// loop through the reject array list and write the reject row to the problem file
foreach (string problem in problems)
{
problemWriter.WriteLine(problem);
}
}
}
任何指针将不胜感激。
提前致谢。
【问题讨论】:
-
把
ArrayList扔到垃圾桶里。你有一个字符串列表,所以是List<string>。我假设您遇到了这么多行的内存问题。 -
您是在读写本地文件还是跨网络文件?
-
我需要从 ArrayList 切换到通用列表,但无论如何几乎没有任何行进入这些列表(少于 10 行),所以我认为这不是问题所在。它目前正在跨网络运行(我本地没有足够的空间来复制此文件)。
-
您的内存使用率是否在长期运行中保持低位?
-
它目前正在网络上运行通过网络一一发送这些行听起来可能是瓶颈的候选者。你是否测量过(较小的)本地和网络版本?