无法打开的 PDF 下载问题答案

【问题标题】：Problem with PDF download that I cannot open无法打开的 PDF 下载问题
【发布时间】：2020-11-19 08:36:32
【问题描述】：

我正在编写一个脚本，使用 https://case.law/docs/site_features/api 从法律案件中提取文本。我已经创建了搜索和 create-xlsx 的方法，效果很好，但我正在努力打开在线 pdf 链接，在临时文件中写入（wb），读取和提取数据（核心文本），然后关闭它。最终的目标是将这些案例的内容用于NLP。

我已经准备了一个函数（见下文）来下载文件：

def download_file(file_id):
    http = urllib3.PoolManager()
    folder_path = "path_to_my_desktop"
    file_download = "https://cite.case.law/xxxxxx.pdf"
    file_content = http.request('GET', file_download)
    file_local = open( folder_path + file_id + '.pdf', 'wb' )
    file_local.write(file_content.read())
    file_content.close()
    file_local.close()

该脚本在下载文件并在我的桌面上创建时运行良好，但是当我尝试在桌面上手动打开文件时，我从 acrobat 阅读器收到以下消息：

Adobe Acrobat Reader 无法打开“file_id.pdf”，因为它不是受支持的文件类型或文件已损坏（例如，它作为电子邮件附件发送且未正确解码

我以为是图书馆所以我尝试使用 Requests / xlswriter / urllib3...（下面的示例 - 我也尝试从脚本中读取它以查看是否是 Adobe 的问题，但显然不是）

# Download the pdf from the search results
URL = "https://cite.case.law/xxxxxx.pdf"
r = requests.get(URL, stream=True)
with open('path_to_desktop + pdf_name + .pdf', 'w') as f:
      f.write(r.text)

# open the downloaded file and remove '<[^<]+?>' for easier reading
with open('C:/Users/amallet/Desktop/r.pdf', 'r') as ff:
      data_read = ff.read()
      stripped = re.sub('<[^<]+?>', '', data_read)
      print(stripped)

输出是：

document.getElementById('next').value = document.location.toString();
document.getElementById('not-a-bot-form').submit();

用 'wb' 和 'rb' 代替（并删除 *** 剥离 *** sript 是：

r = requests.get(test_case_pdf, stream=True)
with open('C:/Users/amallet/Desktop/r.pdf', 'wb') as f:
      f.write(r.content)

with open('C:/Users/amallet/Desktop/r.pdf', 'rb') as ff:
      data_read = ff.read()
      print(data_read)

输出是：

<html>
<head>
<noscript>
<meta http-equiv="Refresh" content="0;URL=?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%
20(1994).pdf" />
</noscript>
</head>
<body>
<form method="post" id="not-a-bot-form">
<input type="hidden" name="csrfmiddlewaretoken" value="5awGW0F4A1b7Y6bx
rYBaA6GIvqx4Tf6DnK0qEMLVoJBLoA3ZqOrpMZdUXDQ7ehOz">
<input type="hidden" name="not_a_bot" value="yes">
<input type="hidden" name="next" value="/pdf/7840543/In%20re%20
the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%20(1994).pdf" id="next">
</form>
<script>
document.getElementById(\'next\').value = document.loc
ation.toString();
document.getElementById(\'not-a-bot-form\').submit();
</script>
<a href="?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%2
0890%20F.%20Supp.%20914%20(1994).pdf">Click here to continue</a>
</body>
</html>

但没有一个工作。 pdf没有密码保护，我在其他网站上试过，也没有。

因此，我想知道我是否还有另一个与代码本身无关的问题。

如果您需要更多信息，请告诉我。

谢谢

【问题讨论】：

标签： python pdf python-requests urllib3 data-collection

【解决方案1】：

看起来，Web 服务器为您提供的不是 PDF，而是旨在防止机器人从该站点下载数据的网页。

您的代码没有任何问题，但如果您仍想这样做，则必须解决网站的 bot 防护问题。

【讨论】：