【问题标题】:In Ruby, how do I deal with non-UTF 8 characters in PDF content?在 Ruby 中,如何处理 PDF 内容中的非 UTF 8 字符?
【发布时间】:2016-09-23 15:20:11
【问题描述】:

我使用的是 Rails 4.2.7。我正在从网上下载和编写 PDF 内容,就像这样……

    res1 = Net::HTTP.SOCKSProxy('127.0.0.1', 50001).start(uri.host, uri.port) do |http|
      puts "launching #{uri}"
      resp = http.get(uri)
      status = resp.code
      content = resp.body
      content_type = resp['content-type']
      content_encoding = resp['content-encoding']
    end
…
  if content_type == 'application/pdf' || content_type.include?('application/x-javascript')
    File.open(file_location, "w") { |file| file.write content }

我注意到对于某些内容,我收到以下错误

Error during processing: "\xC2" from ASCII-8BIT to UTF-8
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `write'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `block in pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `open'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:76:in `process_race_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_race_finder_service.rb:75:in `process_race_link'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:29:in `block in process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `each'
/Users/davea/Documents/workspace/myproject/app/services/abstract_race_finder_service.rb:28:in `process_data'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each'

我尝试通过替换无效字符来解决这个问题,就像这样……

File.open(file_location, "w") { |file| file.write content }
content.encode('UTF-8', :invalid => :replace, :undef => :replace)

然后我得到错误

error: PDF malformed, expected 'endstream' but found 0 instead

尝试阅读 PDF 文件时。有谁知道处理下载的不会损坏它们的 PDF 文档的更好方法?

【问题讨论】:

    标签: ruby-on-rails ruby pdf encoding utf-8


    【解决方案1】:

    我认为最简单的解决方案是使用IO#binwrite

    File.binwrite(file_location, content)
    

    如果您收到的文件可能采用不同编码,上述可能会失败,在这种情况下,我会尝试

    content.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)
    

    【讨论】:

      猜你喜欢
      • 2017-08-16
      • 2014-07-18
      • 2019-12-16
      • 2013-04-22
      • 2011-11-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多