【发布时间】:2019-06-11 15:35:49
【问题描述】:
所以我正在使用这个 URL (http://www.ancient-hebrew.org/m/dictionary/1000.html)。
我正在尝试为每个希伯来词条目构建一个字典。
我现在拥有的只是输出我试图收集的每个文件。但是,我被困在如何遍历网站中的每个单词并为其构建字典。下面是我的代码。
from bs4 import BeautifulSoup
import re
raw_html = open('/Users/gansaikhanshur/TESTING/webScraping/1000.html').read()
# lxml is faster. If you don't have it, pip install lxml
html = BeautifulSoup(raw_html, 'lxml')
# outputs: "http://www.ancient-hebrew.org/files/heb-anc-sm-beyt.jpg"
images = html.find_all('img', src=re.compile('.jpg$'))
for image in images:
image = re.sub(
r"..\/..\/", r"http://www.ancient-hebrew.org/", image['src'])
# print(image)
# outputs: "unicode_hebrew_text"
fonts = html.find_all('font', face="arial", size="+1")
for f in fonts:
f = f.string.strip()
print(f)
# outputs: "http://www.ancient-hebrew.org/m/dictionary/audio/998.mp3"
mp3links = html.find_all('a', href=re.compile('.mp3$'))
for mp3 in mp3links:
mp3 = "http://www.ancient-hebrew.org/m/dictionary/" + \
mp3['href'].replace("\t", '')
# print(mp3)
所以在我们的 HTML 文件中,例如,
<!--501-1000--> <A Name= 505 ></A> <IMG SRC="../../files/heb-anc-sm-pey.jpg"><IMG SRC="../../files/heb-anc-sm-lamed.jpg"><IMG SRC="../../files/heb-anc-sm-aleph.jpg"> <Font face="arial" size="+1"> אֶלֶף </Font> e-leph <BR> Thousand <BR> Ten times one hundred in amount or number. <BR>Strong's Number: 505 <BR><A HREF="audio/ 505 .mp3"><IMG SRC="../../files/icon_audio.gif" width="25" height="25" border="0"></A><BR> <A HREF=../ahlb/aleph.html#505><Font color=A50000><B>AHLB</B></Font></A> <HR>
<A Name= 517 ></A> <IMG SRC="../../files/heb-anc-sm-mem.jpg"><IMG SRC="../../files/heb-anc-sm-aleph.jpg"> <Font face="arial" size="+1"> אֵם </Font> eym <BR> Mother <BR> A female parent. Maternal tenderness or affection. One who fulfills the role of a mother. <BR>Strong's Number: 517 <BR><A HREF="audio/ 517 .mp3"><IMG SRC="../../files/icon_audio.gif" width="25" height="25" border="0"></A><BR> <A HREF=../ahlb/aleph.html#517><Font color=A50000><B>AHLB</B></Font></A> <HR>
<A Name= 518 ></A> <IMG SRC="../../files/heb-anc-sm-mem.jpg"><IMG SRC="../../files/heb-anc-sm-yud.jpg"><IMG SRC="../../files/heb-anc-sm-aleph.jpg"> <Font face="arial" size="+1"> אִם </Font> eem <BR> If <BR> Allowing that; on condition that. A desire to bind two ideas together. <BR>Strong's Number: 518 <BR><A HREF="audio/ 518 .mp3"><IMG SRC="../../files/icon_audio.gif" width="25" height="25" border="0"></A><BR> <A HREF=../ahlb/aleph.html#518><Font color=A50000><B>AHLB</B></Font></A> <HR>
我想遍历其中的每一个,但它们从第 100 行开始。我想让它适用于与此类似的每个文件,因此我无法指定任何行号。我使用 wget 下载了 html。
或者使用 xpath 会更容易吗?
所以最后,我想要下面这样的东西。
{dict_1: [img1, img2, img3], hebrewTxt: hebrewtxt, pronunciation: prununciation, audio_file: audiofile}
{dict_2: [img1, img2, img3, img4], hebrewTxt: hebrewtxt, pronunciation: prununciation, audio_file: audiofile}
{dict3... and so on
【问题讨论】:
标签: python html python-3.x web-scraping beautifulsoup