【问题标题】:Ruby streamline parsing processRuby 简化解析过程
【发布时间】:2018-02-22 01:49:27
【问题描述】:

我有一个大约 700 pgs 的 pdf。我正在使用这个 gem 将 pdf 转换为这样的字符串:

require 'pdf/reader'

filename = File.expand_path(File.dirname(__FILE__)) + "/InvoiceS.pdf"
puts("PDF to Import/Convert to Text: #{filename}")
string = ""

PDF::Reader.open(filename) do |reader|
  reader.pages.each do |page|
    string << page.text
  end
end

我只想要某个范围,所以我取那个范围:

z = string.index('Page: 3 of 776')
y = string.index('Page: 74 of 776')
string1 = string[z..y]

在进行任何操作之前,字符串是这样的:

830 N BROADWAY ST LOUIS MO 63147

                                     1 314 381-3292           $43.19            $0.00            $0.00            $2.14           $45.33

                                     2 314 382-2158           $43.19            $0.00            $0.00            $2.14           $45.33


                                     3 314 385-9527           $43.19            $0.00            $0.00            $2.14           $45.33

                                     4 314 385-9537           $48.69            $0.00            $0.00            $2.57           $51.26

                                       Total                 $178.26            $0.00            $0.00            $8.99          $187.25


87 WESTERN AV UNIT 3A SOUTH PORTLAND ME 04106


                                     1 207 773-3801           $31.19            $0.00            $0.00            $4.47           $35.66

                                     2 207 773-3803           $36.69            $0.00            $0.00            $5.17           $41.86

                                     3 207 773-3804           $31.19            $0.00            $0.00            $4.47           $35.66

                                     4 207 773-8969           $35.81            $0.00            $0.00            $4.04           $39.85

                                     5 207 773-8970           $31.19            $0.00            $0.00            $4.47           $35.66


                                       Total                 $166.07            $0.00            $0.00           $22.62          $188.69


85 GERMANTOWN PKE PLYMOUTH PA 19462


                                     1 484 322-0448           $53.19            $0.00            $0.00            $7.96           $61.15

                                     2 484 322-0482           $53.19            $0.00            $0.00            $7.96           $61.15

                                     3 484 322-0483           $53.19            $0.00            $0.00            $7.96           $61.15

                                     4 484 322-0486           $47.19            $0.00            $0.00            $8.65           $55.84

                                     5 484 322-0489           $47.19            $0.00            $0.00            $8.65           $55.84

                                     6 610 275-3898           $53.19            $0.00            $0.00            $7.96           $61.15


                                       Total                 $307.14            $0.00            $0.00           $49.14          $356.28


855 GULF FRWY HOUSTON TX 77017

                                     1 723 910-0683           $46.69            $2.63            $0.00            $6.33           $55.65


                                     2 713 910-0697           $41.19            $0.00            $0.00            $5.35           $46.54

                                     3 520 297-3721             $0.00            $0.00          ($17.85)          ($1.29)         ($19.14)

                                     4 520 297-5004            $32.19            $0.00            $0.00            $3.65           $35.84

                                     5 520 297-5079            $32.19            $0.00            $0.00            $3.65           $35.84

                                     6 520 297-9889             $0.00            $0.00          ($15.87)          ($1.60)         ($17.47)

                                     7 520 297-9893             $0.00            $0.00          ($15.87)          ($1.60)         ($17.47)
                                                                                                                                   Page: 69 of 776

我清理了字符串:

string2 = string1.squeeze(' ')
string3 = string2.gsub(/\n+/, "\n")
string4 = string3.gsub("\n ", "\n")
s = string4.gsub("Page:", "\nPage:")

新字符串:

830 N BROADWAY ST LOUIS MO 63147
1 314 381-3292 $43.19 $0.00 $0.00 $2.14 $45.33
2 314 382-2158 $43.19 $0.00 $0.00 $2.14 $45.33
3 314 385-9527 $43.19 $0.00 $0.00 $2.14 $45.33
4 314 385-9537 $48.69 $0.00 $0.00 $2.57 $51.26
Total $178.26 $0.00 $0.00 $8.99 $187.25
87 WESTERN AV UNIT 3A SOUTH PORTLAND ME 04106
1 207 773-3801 $31.19 $0.00 $0.00 $4.47 $35.66
2 207 773-3803 $36.69 $0.00 $0.00 $5.17 $41.86
3 207 773-3804 $31.19 $0.00 $0.00 $4.47 $35.66
4 207 773-8969 $35.81 $0.00 $0.00 $4.04 $39.85
5 207 773-8970 $31.19 $0.00 $0.00 $4.47 $35.66
Total $166.07 $0.00 $0.00 $22.62 $188.69
85 GERMANTOWN PKE PLYMOUTH PA 19462
1 484 322-0448 $53.19 $0.00 $0.00 $7.96 $61.15
2 484 322-0482 $53.19 $0.00 $0.00 $7.96 $61.15
3 484 322-0483 $53.19 $0.00 $0.00 $7.96 $61.15
4 484 322-0486 $47.19 $0.00 $0.00 $8.65 $55.84
5 484 322-0489 $47.19 $0.00 $0.00 $8.65 $55.84
6 610 275-3898 $53.19 $0.00 $0.00 $7.96 $61.15
Total $307.14 $0.00 $0.00 $49.14 $356.28
855 GULF FRWY HOUSTON TX 77017
1 723 910-0683 $46.69 $2.63 $0.00 $6.33 $55.65
2 713 910-0697 $41.19 $0.00 $0.00 $5.35 $46.54 
3 520 297-3721 $0.00 $0.00 ($17.85) ($1.29) ($19.14)
4 520 297-5004 $32.19 $0.00 $0.00 $3.65 $35.84
5 520 297-5079 $32.19 $0.00 $0.00 $3.65 $35.84
6 520 297-9889 $0.00 $0.00 ($15.87) ($1.60) ($17.47)
7 520 297-9893 $0.00 $0.00 ($15.87) ($1.60) ($17.47)
Page: 69 of 776

现在我想解析这个,并创建一个 CSV:

i = []
j = []
k = []
l = ""
f = false
g = false
num = 0
c = 0

start = Time.now

s.to_enum(:scan,/(\n)/i).map do
  i.push $`.size
end

finish = Time.now

puts("Indices Found!... in #{finish - start} seconds.")

start = Time.now
# THIS FOR LOOP PARSES THE DATA
for x in 0..i.size-1
  if s[i[x]+1]!~ /\D/
    if s[i[x]+2] == " " or s[i[x]+2]!~ /\D/
      if s[i[x]+2] == " " and s[i[x]+3] != " " then f = true; y = 3 elsif s[i[x]+2] != " " and s[i[x]+3] == " " then f = true; y = 4 end
    end

    if f
      if s[i[x]+y-1] == " " and s[i[x]+y] != " " and s[i[x]+y+1] != " " and s[i[x]+y+2] != " " and s[i[x]+y+3] == " " then g = true end
      f = false
    end

    if g
      j.push(s[i[x]+y..i[x+1]])
      m = j[num].tr('- (', '')
      k.push(m.split("$"))
      g = false
      num+=1
    end
  end
end

finish = Time.now; puts("Data Parsed!... in #{finish - start} seconds.")

# THIS FOR LOOP ACCOUNTS FOR NEGATIVE VALUES WHICH ARE IN (PARENTHESES) IN THE TEXT
for x in 0...k.size
  for y in 0...k[x].size
    if k[x][y].to_s.include? ")"
      m = k[x][y].tr(')','')
      m.prepend('-')
      k[x][y] = m
      l << k[x][y]
      if y != 5 then l << "," end
    else
      l << k[x][y]
      if y != 5 then l << "," end
    end
  end
end



# puts(l) # Prints the final csv in the terminal
puts("Extracted #{6*num} cells of data from a #{s.length} character file...)

最后的字符串是这样的:

3143813292,43.19,0.00,0.00,2.14,45.33
3143822158,43.19,0.00,0.00,2.14,45.33
3143859527,43.19,0.00,0.00,2.14,45.33
3143859537,48.69,0.00,0.00,2.57,51.26
2077733801,31.19,0.00,0.00,4.47,35.66
2077733803,36.69,0.00,0.00,5.17,41.86
2077733804,31.19,0.00,0.00,4.47,35.66
2077738969,35.81,0.00,0.00,4.04,39.85
2077738970,31.19,0.00,0.00,4.47,35.66
4843220448,53.19,0.00,0.00,7.96,61.15
4843220482,53.19,0.00,0.00,7.96,61.15
4843220483,53.19,0.00,0.00,7.96,61.15
4843220486,47.19,0.00,0.00,8.65,55.84
4843220489,47.19,0.00,0.00,8.65,55.84
6102753898,53.19,0.00,0.00,7.96,61.15
7239100683,46.69,2.63,0.00,6.33,55.65
7139100697,41.19,0.00,0.00,5.35,46.54
5202973721,0.00,0.00,-17.85,-1.29,-19.14
5202975004,32.19,0.00,0.00,3.65,35.84
5202975079,32.19,0.00,0.00,3.65,35.84
5202979889,0.00,0.00,-15.87,-1.60,-17.47
5202979893,0.00,0.00,-15.87,-1.60,-17.47

有没有办法简化这个?

请记住,字符串输出比我在这里粘贴的要大得多。

我还在研究如何使用最终字符串将 csv 文件写入我读取 pdf 的同一文件夹中。

请指出任何不好的做法,因为我是 Ruby 新手并想学习。

【问题讨论】:

  • 您的输入和预期输出是什么?

标签: ruby string performance parsing


【解决方案1】:
string[z..y].
  squeeze(' ').
  gsub(/\n+/, "\n").
  gsub("\n ", "\n").
  gsub("Page:", "\nPage:")

【讨论】:

  • 虽然此代码可能会回答问题,但提供有关此代码为何和/或如何回答问题的额外上下文可提高其长期价值。
  • @rollstuhlfahrer 这个问题没有长期价值,答案也没有。随意提供您认为有价值的任何内容。
  • @mudasobwa 我已经更新了我的问题希望这会有所帮助
【解决方案2】:

您可以链接所有这些,因为每个都将在执行操作后返回值:

s = string[z..y].squeeze(' ').gsub(/\n+/, "\n").gsub("\n ", "\n").gsub("Page:", "\nPage:")

您可能可以做一些小的优化。 你原来的两个gsub的意思是:

  1. 用一个新行替换所有多个新行。
  2. 用空格替换所有新行

你可以改成:

gsub(/\n+\s*/,"\n")

意思是,用一个换行符替换所有换行符(1 次或多次出现)以及在它们之后找到的任何空白。

或者

gsub(/\n+/,"\n").gsub(/^\s*|\s*$/,'')

意思是,用一个换行符替换所有换行符(1 次或多次出现)。 删除行首和行尾的所有空格。

发布编辑答案:

这更紧凑一点,从目测结果来看,它们看起来和你的一样:

s = string.squeeze(' ').gsub(/\n+\s*/,"\n").gsub("Page:", "\nPage:")


csv = []
s.split("\n").each do |line|
  tmp = line.chomp.split.map { |i|  i.gsub(/^\(/,'-').gsub(/\)/,'').gsub('$','') }
  next unless tmp.size == 8
  csv << "#{tmp[1..2].join.gsub('-','')},#{tmp[3..-1].join(',')}"
end
puts csv.join("\n")

结果:

3143813292,43.19,0.00,0.00,2.14,45.33
3143822158,43.19,0.00,0.00,2.14,45.33
3143859527,43.19,0.00,0.00,2.14,45.33
3143859537,48.69,0.00,0.00,2.57,51.26
2077733801,31.19,0.00,0.00,4.47,35.66
2077733803,36.69,0.00,0.00,5.17,41.86
2077733804,31.19,0.00,0.00,4.47,35.66
2077738969,35.81,0.00,0.00,4.04,39.85
2077738970,31.19,0.00,0.00,4.47,35.66
4843220448,53.19,0.00,0.00,7.96,61.15
4843220482,53.19,0.00,0.00,7.96,61.15
4843220483,53.19,0.00,0.00,7.96,61.15
4843220486,47.19,0.00,0.00,8.65,55.84
4843220489,47.19,0.00,0.00,8.65,55.84
6102753898,53.19,0.00,0.00,7.96,61.15
7239100683,46.69,2.63,0.00,6.33,55.65
7139100697,41.19,0.00,0.00,5.35,46.54
5202973721,0.00,0.00,-17.85,-1.29,-19.14
5202975004,32.19,0.00,0.00,3.65,35.84
5202975079,32.19,0.00,0.00,3.65,35.84
5202979889,0.00,0.00,-15.87,-1.60,-17.47
5202979893,0.00,0.00,-15.87,-1.60,-17.47

【讨论】:

  • 我已经更新了我的问题,在我将此标记为答案之前,您还想提供任何其他反馈吗?
  • @Lyres 立即查看
  • 我通过 tmp.size 检测到评估,如果真实案例有更复杂的案例,您可以将它们添加到“下一个”条件中
  • s 是每一行的字符串。 split("\n") 通过在每个换行符上拆分它来将其转换为一个数组。 line.chomp.split 删除换行符并默认将行拆分为一个数组。 map 在每个生成的数组单元格上运行多个 gsub 命令,以根据您的需要对其进行格式化。
  • 尽量避免unless 中的多个逻辑条件,因为这基本上是德摩根定律并且可能会造成混淆。试试next if ( tmp.size != 8 &amp;&amp; !tmp[0].match(/\d/) )。我个人更喜欢使用match,而且我想我过去在if 内联中看到了多个逻辑条件的一些问题,所以为了安全起见,我总是加上括号(可能不需要)
猜你喜欢
  • 1970-01-01
  • 2015-10-29
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2012-09-26
  • 2011-03-25
  • 1970-01-01
  • 2014-09-15
相关资源
最近更新 更多