【发布时间】:2015-10-04 06:16:50
【问题描述】:
我编写了一个 Perl 代码来处理大量 CSV 文件并获得输出,这需要 0.8326 秒才能完成。
my $opname = $ARGV[0];
my @files = `find . -name "*${opname}*.csv";mtime -10 -type f`;
my %hash;
foreach my $file (@files) {
chomp $file;
my $time = $file;
$time =~ s/.*\~(.*?)\..*/$1/;
open(IN, $file) or print "Can't open $file\n";
while (<IN>) {
my $line = $_;
chomp $line;
my $severity = (split(",", $line))[6];
next if $severity =~ m/NORMAL/i;
$hash{$time}{$severity}++;
}
close(IN);
}
foreach my $time (sort {$b <=> $a} keys %hash) {
foreach my $severity ( keys %{$hash{$time}} ) {
print $time . ',' . $severity . ',' . $hash{$time}{$severity} . "\n";
}
}
现在我正在用 Java 编写相同的逻辑,但需要 2600 毫秒,即 2.6 秒才能完成。我的问题是为什么 Java 需要这么长时间?如何达到与 Perl 相同的速度? 注意:我忽略了 VM 初始化和类加载时间。
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>();
static String opname;
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
String timestamp=mf.getName().split("~")[5].replace(".csv", "");
BufferedReader br= new BufferedReader(new FileReader(mf),1024*500);
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
String severity=line.split(",")[6];
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
System.out.println(store);
}
public static void main(String[] args) throws IOException
{
opname = args[0];
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
}
}
文件输入格式(A~B~C~D~E~20150715080000.csv),约500个文件,每个~1MB,
A,B,C,D,E,F,CRITICAL,G
A,B,C,D,E,F,NORMAL,G
A,B,C,D,E,F,INFO,G
A,B,C,D,E,F,MEDIUM,G
A,B,C,D,E,F,CRITICAL,G
Java 版本:1.7
/////////////////////////////////////////////////////////////////////////////////////////////////
根据以下 cmets , 我用 regex 替换了 split ,性能提升了很多。 现在我在循环中执行此操作,经过 3-10 次迭代后,性能完全可以接受。
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new HashMap<String, Map<String,Integer>>();
static String opname="Etis_Egypt";
static Pattern pattern1=Pattern.compile("(\\d+\\.)");
static Pattern pattern2=Pattern.compile("(?:\"([^\"]*)\"|([^,]*))(?:[,])");
static long currentsystime=System.currentTimeMillis();
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((currentsystime-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
Matcher matcher=pattern1.matcher(mf.getName());
matcher.find();
//String timestamp=mf.getName().split("~")[5].replace(".csv", "");
String timestamp=matcher.group();
BufferedReader br= new BufferedReader(new FileReader(mf));
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
matcher=pattern2.matcher(line);
matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();
//String severity=line.split(",")[6];
String severity=matcher.group();
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
br.close();
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
//System.out.println(time+"ms");
//System.out.println(store);
}
public static void main(String[] args) throws IOException
{
//opname = args[0];
for(int i=0;i<20;i++){
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println("Time taken for "+i+" is "+time+"ms");
}
}
}
但我现在还有一个问题,
在小型数据集上运行时查看结果。
**Time taken for 0 is 218ms
Time taken for 1 is 134ms
Time taken for 2 is 127ms**
Time taken for 3 is 98ms
Time taken for 4 is 90ms
Time taken for 5 is 77ms
Time taken for 6 is 71ms
Time taken for 7 is 72ms
Time taken for 8 is 62ms
Time taken for 9 is 57ms
Time taken for 10 is 53ms
Time taken for 11 is 58ms
Time taken for 12 is 59ms
Time taken for 13 is 46ms
Time taken for 14 is 44ms
Time taken for 15 is 45ms
Time taken for 16 is 53ms
Time taken for 17 is 45ms
Time taken for 18 is 61ms
Time taken for 19 is 42ms
对于最初的几个例子,花费的时间更多,然后减少,.. 为什么???
谢谢,
【问题讨论】:
-
Perl 也是如此。 metacpan.org/pod/Text::CSV 将比您自己的实现安全得多。
-
perl 基本上是一种文本处理专用语言。他们考虑到了文本处理
-
您可以做很多事情来让 Perl 代码运行得更快!
-
好的,这更有意义。看我的回答。您的 Java 远非很好,如果您有兴趣使代码变得更好,请将其发布到 code review。请先清理一下。
标签: java performance perl