【发布时间】:2014-06-24 13:55:28
【问题描述】:
任何人都可以帮助我使用正则表达式吗? 我有一个 Java 程序,它读取 .csv 文件以将其加载到数据库中。
目前ist使用Pattern csvPattern = Pattern.compile("\\s*(\"[^\"]*\"|[^|]*)\\s*,?");
但是我用matcher = csvPattern.matcher(line); 逐行读取文件,我只得到空值。
这些文件具有以下格式(许多行,其中一些带有逗号,'|' 作为分隔符并在每行的末尾):
第一个文件的摘要:
0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|
1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon|
2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |
秒:
|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref|
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov|
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou|
第三个:
5|Supplier#000000005|Gcdm2rJRzl5qlTVzc|11|21-151-690-3663|-283.84|. slyly regular pinto bea|
6|Supplier#000000006|tQxuVm7s7CnK|14|24-696-997-4969|1365.79|final accounts. regular dolphins use against the furiously ironic decoys. |
7|Supplier#000000007|s,4TicNGB4uO6PaSqNBUq|23|33-990-965-2201|6820.35|s unwind silently furiously regular courts. final requests are deposits. requests wake quietly blit|
8|Supplier#000000008|9Sq4bBH2FQEmaFOocY45sRTxo6yuoG|17|27-498-742-3860|7627.85|al pinto beans. asymptotes haggl|
9|Supplier#000000009|1KhUgZegwM3ua7dsYmekYBsK|10|20-403-398-8662|5302.37|s. unusual, even requests along the furiously regular pac|
第四:
1|2|3325|771.64|, even theodolites. regular, final theodolites eat after the carefully pending foxes. furiously regular deposits sleep slyly. carefully bold realms above the ironic dependencies haggle careful|
1|2502|8076|993.49|ven ideas. quickly even packages print. pending multipliers must have to are fluff|
1|5002|3956|337.09|after the fluffily ironic deposits? blithely special dependencies integrate furiously even excuses. blithely silent theodolites could have to haggle pending, express requests; fu|
1|7502|4069|357.84|al, regular dependencies serve carefully after the quickly final pinto beans. furiously even deposits sleep quickly final, silent pinto beans. fluffily reg|
第五:
1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |
第六:
134823|saddle midnight thistle honeydew lime|Manufacturer#4|Brand#43|STANDARD BURNISHED BRASS|44|WRAP CAN|1857.82|ges. furiously ir|
134824|coral red indian thistle sandy|Manufacturer#5|Brand#55|PROMO BURNISHED COPPER|29|LG JAR|1858.82|final p|
134825|saddle purple orchid cornsilk medium|Manufacturer#4|Brand#44|PROMO POLISHED NICKEL|21|LG CASE|1859.82|nal accounts us|
134826|turquoise sky lime cornsilk peach|Manufacturer#1|Brand#11|SMALL BURNISHED TIN|25|SM CAN|1860.82| haggle|
第七:
0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1|AMERICA|hs use ironic, even requests. s|
第八:
4|136777|O|32151.78|1995-10-11|5-LOW|Clerk#000000124|0|sits. slyly regular warthogs cajole. regular, regular theodolites acro|
5|44485|F|144659.20|1994-07-30|5-LOW|Clerk#000000925|0|quickly. bold deposits sleep slyly. packages use slyly|
(csv 是使用 TPC fpr tpc-h 的 DBGen 工具创建的,以防您想知道)
我希望您了解我的需求并能帮助我解决这个问题。非常感谢!
编辑:使用 String.split("|");'当然看起来很明显,但问题是,我正在使用的程序非常复杂,并且在各个部分都依赖于 regex.pattern 和 regex.matcher。因此,由于我对程序和 java 本身不是很熟悉,所以对我来说唯一的解决方案是使用给定的代码并将正则表达式替换为适合我的正则表达式。
EDIT2:问题是我正在尝试使用来自 OLTP-Bench 的 TPC-H 实现:https://github.com/ben-reilly/oltpbench/blob/master/src/com/oltpbenchmark/benchmarks/tpch/TPCHLoader.java#L347
有问题的行是 347。它是 TPC-H 数据库基准测试的完整实现,但没有数据生成器。所以我使用TPC提供的dbgen工具来生成csv文件。很遗憾,我无法与开发者取得联系。
【问题讨论】:
-
您使用正则表达式而不是
String.split()是否有原因? -
你有理由不只使用 CSV 解析器吗?
-
你能解释一下拆分的哪一部分需要正则表达式吗?例如,某些工具在没有正确转义值的情况下导出 CSV 文件是很常见的,然后您就会陷入无法仅对分隔符进行直接拆分的混乱中。
-
不抱歉。我不能,因为我真的不知道自己在做什么:(我更新了帖子