我自己也去了。几乎是我第一次使用pytesseract,所以把它当作它的价值:o)...
查看页面来源,可以从url获取表格图片:
https://www.netztransparenz.de/DesktopModules/LotesNetztransparenz/ImageCharts/EpexChartImageHandler.ashx?date=2021/08/23&type=1
网址的后半部分(不完全在末尾)有日期:date=2021/08/23。如果这被改变,数据的时间段就会改变,你会得到一个新的图像。因此,以下代码似乎纯粹是作为尝试访问三个图像的检查:
import urllib.request
from time import sleep
main_url = 'https://www.netztransparenz.de/DesktopModules/LotesNetztransparenz/ImageCharts/EpexChartImageHandler.ashx?date='
tail_url = '&type=1'
dates = ['2021/08/21', '2021/08/22', '2021/08/23']
for date in dates:
r = urllib.request.urlopen(main_url+date+tail_url)
print(r.getcode())
sleep(1)
>> 200 #'200' is 'OK'
>> 200
>> 200
无论如何,对于一个 url,以下代码会接近:
# https://pypi.org/project/opencv-python/
import cv2
import urllib.request #I think (?) cv2 wants/prefers urllib instead of requests
import numpy as np
# sudo apt-get update
# sudo apt install tesseract-ocr
# sudo apt install libtesseract-dev
# ...or find for Mac/Windows. Then...
# https://pypi.org/project/pytesseract/
import pytesseract
import pandas as pd
from io import StringIO
# first part of the url
main_url = 'https://www.netztransparenz.de/DesktopModules/LotesNetztransparenz/ImageCharts/EpexChartImageHandler.ashx?date='
# end part of the url
tail_url = '&type=1'
# date part - the code could be expanded to loop different dates
# and get different frames and concat them together. For demo let's just get just one
date = '2021/08/23'
#construction a valid url. If looping and the 'date' above was a list of dates this method would make more sense
url = main_url + date + tail_url
# this probably wants a try except around it and proceed only on an OK/200 response
url_response = urllib.request.urlopen(url)
# download the image
img = cv2.imdecode(np.array(bytearray(url_response.read()), dtype=np.uint8), -1)
# get the image into text
text = pytesseract.image_to_string(img)
# create a rough frame and clean up
df = pd.read_csv(StringIO(text), sep=';', header=0)
df = df[:-2]
# df['a'].str.rsplit(n=7, expand=True)
df['Stunden'] = df['a'].str.extract(r'((?:\d{2}:?\d{2}) - (?:\d{2}:?\d{2}))\s')
df = df.set_index(df['Stunden'])
df['a'] = df['a'].str.replace(r'((?:\d{2}:?\d{2}) - (?:\d{2}:?\d{2}))\s', '', regex=True)
df = df['a'].str.rsplit(n=7, expand=True)
df = df.reset_index()
# create a dict for column renaming
datesplit = [int(x) for x in date.split('/')]
weekdates = {}
count = 0
for i in range(7, -1, -1):
d = datesplit[2] - i
weekdates[count] = (str(datesplit[0]) + '-' + str(datesplit[1]) + '-' + str(d))
count += 1
# rename columns
df = df.rename(columns=weekdates)
# output frame
display(df)
看起来还不错。 然而,如果你仔细观察,它并不完美......
如果帧在图像处理中只有一些小故障,您可以使用以下行手动纠正它们:
df.loc[12, '2021-8-20'] = '9,911'
df.loc[17, 'Stunden'] = '17:00 - 18:00'
但是,如果有更多问题(并且可能存在),最好完全替换点和逗号和/或尝试更广泛的清理策略。或者希望改进图像的处理(如果您遇到困难,可以发布另一个问题,以及对模块有更多了解的人的回答 pytesseract 和/或 cv2)。
顺便说一句:您可以看出问题出在图像处理中,因为检查 text 变量会显示在将其放入数据帧之前最初提取的内容...