@ewwink 找到了退出 unid 的方法,但无法退出价格。我试图在这个答案中提取价格。
目标 div sn-p:
<div mp="2" id="line_e3724364" class="estoque-linha primeiro"><div class="e-col1"><a href="b/?p=e3724364" target="_blank"><img title="Rayearth Games" src="//www.lmcorp.com.br/arquivos/up/ecom/comparador/155937.jpg"></a></div><div class="e-col9-mobile"><div class="e-mob-edicao"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="19"></div><div class="e-mob-edicao-lbl"><p>Amonkhet</p></div><div class="e-mob-preco e-mob-preco-desconto"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div></div><div class="e-col2"><a href="./?view=cards/search&card=ed=akh" class="ed-simb"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="21"></a><font class="nomeedicao"><a href="./?view=cards/search&card=ed=akh" class="ed-simb">Amonkhet</a></font></div><div class="e-col3"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div>
<div class="e-col4 e-col4-offmktplace">
<img src="https://www.lmcorp.com.br/arquivos/img/bandeiras/pten.gif" title="Português/Inglês"> <font class="azul" onclick="cardQualidade(3);">SP</font>
</div>
<div class="e-col5 e-col5-offmktplace "><div class="cIiVr lHfXpZ mZkHz"> </div> <div class="imgnum-unid"> unid</div></div><div class="e-col8 e-col8-offmktplace "><div><a target="_blank" href="b/?p=e3724364" class="goto" title="Visitar Loja">Ir à loja</a></div></div></div>
如果我们仔细观察,我们可以,
for item in soup.findAll('div', {"id": re.compile('^line')}):
print(re.findall("R\$ (.*?)</div>", str(item), re.DOTALL))
输出[截断]:
['10,00</s></font><br/>R$ 8,00', '10,00</s></font><br/>R$ 8,00']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,75</s></font><br/>R$ 8,78', '9,75</s></font><br/>R$ 8,78']
[]
[]
它提取主要块,我们会得到价格。
但这也会跳过多个项目。
要获取所有数据,我们可以使用 OCR API 和 Selenium 来完成此操作。我们可以使用以下 sn-p 捕获感兴趣的元素:
from selenium import webdriver
from PIL import Image
from io import BytesIO
fox = webdriver.Firefox()
fox.get('https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons')
#element = fox.find_element_by_id('line_e3724364')
element = fox.find_elements_by_tag_name('s')
location = element.location
size = element.size
png = fox.get_screenshot_as_png() # saves screenshot of entire page
fox.quit()
im = Image.open(BytesIO(png)) # uses PIL library to open image in memory
left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']
im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png') # saves new cropped image
得到了https://stackoverflow.com/a/15870708的帮助。
我们可以像上面那样使用re.findall() 进行迭代以保存所有图像。在我们拥有所有图像之后,我们可以使用 OCR Space 来提取文本数据。这是一个快速的 sn-p :
import requests
def ocr_space_file(filename, overlay=False, api_key='api_key', language='eng'):
payload = {'isOverlayRequired': overlay,
'apikey': api_key,
'language': language,
}
with open(filename, 'rb') as f:
r = requests.post('https://api.ocr.space/parse/image',
files={filename: f},
data=payload,
)
return r.content.decode()
e = ocr_space_file(filename='1.png')
print(e) # prints JSON
1.png:
来自 ocr.space 的 JSON 响应:
{"ParsedResults":[{"TextOverlay":{"Lines":[],"HasOverlay":false,"Message":"Text overlay is not provided as it is not requested"},"TextOrientation":"0","FileParseExitCode":1,"ParsedText":"RS 0',85 \r\n","ErrorMessage":"","ErrorDetails":""}],"OCRExitCode":1,"IsErroredOnProcessing":false,"ProcessingTimeInMilliseconds":"1996","SearchablePDFURL":"Searchable PDF not generated as it was not requested."}
它给了我们,"ParsedText" : "RS 0',85 \r\n"。