【发布时间】:2020-05-08 14:32:33
【问题描述】:
我喜欢创建一个 Dockerfile,它安装所有必要的组件以在 Docker 容器中运行 python-tika。
到目前为止,这是我的 Dockerfile:
###Get python
FROM python:3
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas
RUN mkdir scripts
ADD runner.py /scripts/
CMD [ "python", "./scripts/runner.py" ]
我构建它并运行 Dockerfile:
docker build -t docker-tika .
docker run docker-tika
但它抱怨以下错误:
[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika
2020-05-08 13:49:52,528 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 13:50:09,742 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 13:50:10,133 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 13:50:10,134 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
2020-05-08 13:50:10,271 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 13:50:10,271 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
runner.py 脚本如下:
import tika
tika.initVM()
我有以下两个问题: 1. 我读到我们需要下载 tika-server jar 2.在后台启动tika-server的python脚本中调用initVM()。
我不知道其中缺少什么。 Docker 文件。感谢帮助!
我也用 Java 更新了 Docker 文件,但它仍然在抱怨 Java
### 1. Get Linux
FROM alpine:3.7
### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre
ENV JAVA_HOME=/opt/java/openjdk \
PATH="/opt/java/openjdk/bin:$PATH"
###3. Get ython
FROM python:3
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas
RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output
ADD runner2.py /scripts/
ADD sample.pdf .
CMD [ "python", "./scripts/runner2.py" ]
猫亚军2.py:
#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('sample.pdf')
print(parsed["metadata"])
print(parsed["content"])
[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika
2020-05-08 14:40:23,183 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 14:41:00,480 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 14:41:02,324 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 14:41:02,324 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
【问题讨论】:
-
Tika 是一个 Java 应用程序,因此您的图像需要安装 JVM。您最好在单独的容器中运行 Tika 服务器,然后通过适当的 environment variable 将 Python
tika模块指向它。 Tika 团队似乎有一张apache/tika 的图片。
标签: python docker apache-tika