【发布时间】:2018-02-22 01:49:27
【问题描述】:
我有一个大约 700 pgs 的 pdf。我正在使用这个 gem 将 pdf 转换为这样的字符串:
require 'pdf/reader'
filename = File.expand_path(File.dirname(__FILE__)) + "/InvoiceS.pdf"
puts("PDF to Import/Convert to Text: #{filename}")
string = ""
PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
string << page.text
end
end
我只想要某个范围,所以我取那个范围:
z = string.index('Page: 3 of 776')
y = string.index('Page: 74 of 776')
string1 = string[z..y]
在进行任何操作之前,字符串是这样的:
830 N BROADWAY ST LOUIS MO 63147
1 314 381-3292 $43.19 $0.00 $0.00 $2.14 $45.33
2 314 382-2158 $43.19 $0.00 $0.00 $2.14 $45.33
3 314 385-9527 $43.19 $0.00 $0.00 $2.14 $45.33
4 314 385-9537 $48.69 $0.00 $0.00 $2.57 $51.26
Total $178.26 $0.00 $0.00 $8.99 $187.25
87 WESTERN AV UNIT 3A SOUTH PORTLAND ME 04106
1 207 773-3801 $31.19 $0.00 $0.00 $4.47 $35.66
2 207 773-3803 $36.69 $0.00 $0.00 $5.17 $41.86
3 207 773-3804 $31.19 $0.00 $0.00 $4.47 $35.66
4 207 773-8969 $35.81 $0.00 $0.00 $4.04 $39.85
5 207 773-8970 $31.19 $0.00 $0.00 $4.47 $35.66
Total $166.07 $0.00 $0.00 $22.62 $188.69
85 GERMANTOWN PKE PLYMOUTH PA 19462
1 484 322-0448 $53.19 $0.00 $0.00 $7.96 $61.15
2 484 322-0482 $53.19 $0.00 $0.00 $7.96 $61.15
3 484 322-0483 $53.19 $0.00 $0.00 $7.96 $61.15
4 484 322-0486 $47.19 $0.00 $0.00 $8.65 $55.84
5 484 322-0489 $47.19 $0.00 $0.00 $8.65 $55.84
6 610 275-3898 $53.19 $0.00 $0.00 $7.96 $61.15
Total $307.14 $0.00 $0.00 $49.14 $356.28
855 GULF FRWY HOUSTON TX 77017
1 723 910-0683 $46.69 $2.63 $0.00 $6.33 $55.65
2 713 910-0697 $41.19 $0.00 $0.00 $5.35 $46.54
3 520 297-3721 $0.00 $0.00 ($17.85) ($1.29) ($19.14)
4 520 297-5004 $32.19 $0.00 $0.00 $3.65 $35.84
5 520 297-5079 $32.19 $0.00 $0.00 $3.65 $35.84
6 520 297-9889 $0.00 $0.00 ($15.87) ($1.60) ($17.47)
7 520 297-9893 $0.00 $0.00 ($15.87) ($1.60) ($17.47)
Page: 69 of 776
我清理了字符串:
string2 = string1.squeeze(' ')
string3 = string2.gsub(/\n+/, "\n")
string4 = string3.gsub("\n ", "\n")
s = string4.gsub("Page:", "\nPage:")
新字符串:
830 N BROADWAY ST LOUIS MO 63147
1 314 381-3292 $43.19 $0.00 $0.00 $2.14 $45.33
2 314 382-2158 $43.19 $0.00 $0.00 $2.14 $45.33
3 314 385-9527 $43.19 $0.00 $0.00 $2.14 $45.33
4 314 385-9537 $48.69 $0.00 $0.00 $2.57 $51.26
Total $178.26 $0.00 $0.00 $8.99 $187.25
87 WESTERN AV UNIT 3A SOUTH PORTLAND ME 04106
1 207 773-3801 $31.19 $0.00 $0.00 $4.47 $35.66
2 207 773-3803 $36.69 $0.00 $0.00 $5.17 $41.86
3 207 773-3804 $31.19 $0.00 $0.00 $4.47 $35.66
4 207 773-8969 $35.81 $0.00 $0.00 $4.04 $39.85
5 207 773-8970 $31.19 $0.00 $0.00 $4.47 $35.66
Total $166.07 $0.00 $0.00 $22.62 $188.69
85 GERMANTOWN PKE PLYMOUTH PA 19462
1 484 322-0448 $53.19 $0.00 $0.00 $7.96 $61.15
2 484 322-0482 $53.19 $0.00 $0.00 $7.96 $61.15
3 484 322-0483 $53.19 $0.00 $0.00 $7.96 $61.15
4 484 322-0486 $47.19 $0.00 $0.00 $8.65 $55.84
5 484 322-0489 $47.19 $0.00 $0.00 $8.65 $55.84
6 610 275-3898 $53.19 $0.00 $0.00 $7.96 $61.15
Total $307.14 $0.00 $0.00 $49.14 $356.28
855 GULF FRWY HOUSTON TX 77017
1 723 910-0683 $46.69 $2.63 $0.00 $6.33 $55.65
2 713 910-0697 $41.19 $0.00 $0.00 $5.35 $46.54
3 520 297-3721 $0.00 $0.00 ($17.85) ($1.29) ($19.14)
4 520 297-5004 $32.19 $0.00 $0.00 $3.65 $35.84
5 520 297-5079 $32.19 $0.00 $0.00 $3.65 $35.84
6 520 297-9889 $0.00 $0.00 ($15.87) ($1.60) ($17.47)
7 520 297-9893 $0.00 $0.00 ($15.87) ($1.60) ($17.47)
Page: 69 of 776
现在我想解析这个,并创建一个 CSV:
i = []
j = []
k = []
l = ""
f = false
g = false
num = 0
c = 0
start = Time.now
s.to_enum(:scan,/(\n)/i).map do
i.push $`.size
end
finish = Time.now
puts("Indices Found!... in #{finish - start} seconds.")
start = Time.now
# THIS FOR LOOP PARSES THE DATA
for x in 0..i.size-1
if s[i[x]+1]!~ /\D/
if s[i[x]+2] == " " or s[i[x]+2]!~ /\D/
if s[i[x]+2] == " " and s[i[x]+3] != " " then f = true; y = 3 elsif s[i[x]+2] != " " and s[i[x]+3] == " " then f = true; y = 4 end
end
if f
if s[i[x]+y-1] == " " and s[i[x]+y] != " " and s[i[x]+y+1] != " " and s[i[x]+y+2] != " " and s[i[x]+y+3] == " " then g = true end
f = false
end
if g
j.push(s[i[x]+y..i[x+1]])
m = j[num].tr('- (', '')
k.push(m.split("$"))
g = false
num+=1
end
end
end
finish = Time.now; puts("Data Parsed!... in #{finish - start} seconds.")
# THIS FOR LOOP ACCOUNTS FOR NEGATIVE VALUES WHICH ARE IN (PARENTHESES) IN THE TEXT
for x in 0...k.size
for y in 0...k[x].size
if k[x][y].to_s.include? ")"
m = k[x][y].tr(')','')
m.prepend('-')
k[x][y] = m
l << k[x][y]
if y != 5 then l << "," end
else
l << k[x][y]
if y != 5 then l << "," end
end
end
end
# puts(l) # Prints the final csv in the terminal
puts("Extracted #{6*num} cells of data from a #{s.length} character file...)
最后的字符串是这样的:
3143813292,43.19,0.00,0.00,2.14,45.33
3143822158,43.19,0.00,0.00,2.14,45.33
3143859527,43.19,0.00,0.00,2.14,45.33
3143859537,48.69,0.00,0.00,2.57,51.26
2077733801,31.19,0.00,0.00,4.47,35.66
2077733803,36.69,0.00,0.00,5.17,41.86
2077733804,31.19,0.00,0.00,4.47,35.66
2077738969,35.81,0.00,0.00,4.04,39.85
2077738970,31.19,0.00,0.00,4.47,35.66
4843220448,53.19,0.00,0.00,7.96,61.15
4843220482,53.19,0.00,0.00,7.96,61.15
4843220483,53.19,0.00,0.00,7.96,61.15
4843220486,47.19,0.00,0.00,8.65,55.84
4843220489,47.19,0.00,0.00,8.65,55.84
6102753898,53.19,0.00,0.00,7.96,61.15
7239100683,46.69,2.63,0.00,6.33,55.65
7139100697,41.19,0.00,0.00,5.35,46.54
5202973721,0.00,0.00,-17.85,-1.29,-19.14
5202975004,32.19,0.00,0.00,3.65,35.84
5202975079,32.19,0.00,0.00,3.65,35.84
5202979889,0.00,0.00,-15.87,-1.60,-17.47
5202979893,0.00,0.00,-15.87,-1.60,-17.47
有没有办法简化这个?
请记住,字符串输出比我在这里粘贴的要大得多。
我还在研究如何使用最终字符串将 csv 文件写入我读取 pdf 的同一文件夹中。
请指出任何不好的做法,因为我是 Ruby 新手并想学习。
【问题讨论】:
-
您的输入和预期输出是什么?
标签: ruby string performance parsing