【问题标题】:How To Construct a dictionary for each line in HTML file如何为 HTML 文件中的每一行构造一个字典
【发布时间】:2019-06-11 15:35:49
【问题描述】:

所以我正在使用这个 URL (http://www.ancient-hebrew.org/m/dictionary/1000.html)。

我正在尝试为每个希伯来词条目构建一个字典。

我现在拥有的只是输出我试图收集的每个文件。但是,我被困在如何遍历网站中的每个单词并为其构建字典。下面是我的代码。

from bs4 import BeautifulSoup
import re

raw_html = open('/Users/gansaikhanshur/TESTING/webScraping/1000.html').read()
# lxml is faster. If you don't have it, pip install lxml
html = BeautifulSoup(raw_html, 'lxml')

# outputs: "http://www.ancient-hebrew.org/files/heb-anc-sm-beyt.jpg"
images = html.find_all('img', src=re.compile('.jpg$'))
for image in images:
    image = re.sub(
        r"..\/..\/", r"http://www.ancient-hebrew.org/", image['src'])
    # print(image)

# outputs: "unicode_hebrew_text"
fonts = html.find_all('font', face="arial", size="+1")
for f in fonts:
    f = f.string.strip()
    print(f)

# outputs: "http://www.ancient-hebrew.org/m/dictionary/audio/998.mp3"
mp3links = html.find_all('a', href=re.compile('.mp3$'))
for mp3 in mp3links:
    mp3 = "http://www.ancient-hebrew.org/m/dictionary/" + \
        mp3['href'].replace("\t", '')
    # print(mp3)

所以在我们的 HTML 文件中,例如,

<!--501-1000--> <A Name=    505 ></A>   <IMG SRC="../../files/heb-anc-sm-pey.jpg"><IMG SRC="../../files/heb-anc-sm-lamed.jpg"><IMG SRC="../../files/heb-anc-sm-aleph.jpg">   <Font face="arial" size="+1">  &#1488;&#1462;&#1500;&#1462;&#1507; </Font>     e-leph  <BR>    Thousand    <BR>    Ten times one hundred in amount or number.  <BR>Strong's Number:    505 <BR><A HREF="audio/ 505 .mp3"><IMG SRC="../../files/icon_audio.gif"  width="25" height="25" border="0"></A><BR> <A HREF=../ahlb/aleph.html#505><Font color=A50000><B>AHLB</B></Font></A>    <HR>
    <A Name=    517 ></A>   <IMG SRC="../../files/heb-anc-sm-mem.jpg"><IMG SRC="../../files/heb-anc-sm-aleph.jpg">   <Font face="arial" size="+1">  &#1488;&#1461;&#1501;   </Font>     eym <BR>    Mother  <BR>    A female parent. Maternal tenderness or affection. One who fulfills the role of a mother.   <BR>Strong's Number:    517 <BR><A HREF="audio/ 517 .mp3"><IMG SRC="../../files/icon_audio.gif"  width="25" height="25" border="0"></A><BR> <A HREF=../ahlb/aleph.html#517><Font color=A50000><B>AHLB</B></Font></A>    <HR>
    <A Name=    518 ></A>   <IMG SRC="../../files/heb-anc-sm-mem.jpg"><IMG SRC="../../files/heb-anc-sm-yud.jpg"><IMG SRC="../../files/heb-anc-sm-aleph.jpg">     <Font face="arial" size="+1">  &#1488;&#1460;&#1501;   </Font>     eem <BR>    If  <BR>    Allowing that; on condition that. A desire to bind two ideas together.  <BR>Strong's Number:    518 <BR><A HREF="audio/ 518 .mp3"><IMG SRC="../../files/icon_audio.gif"  width="25" height="25" border="0"></A><BR> <A HREF=../ahlb/aleph.html#518><Font color=A50000><B>AHLB</B></Font></A>    <HR>

我想遍历其中的每一个,但它们从第 100 行开始。我想让它适用于与此类似的每个文件,因此我无法指定任何行号。我使用 wget 下载了 html。

或者使用 xpath 会更容易吗?

所以最后,我想要下面这样的东西。

{dict_1: [img1, img2, img3], hebrewTxt: hebrewtxt, pronunciation: prununciation, audio_file: audiofile}
{dict_2: [img1, img2, img3, img4], hebrewTxt: hebrewtxt, pronunciation: prununciation, audio_file: audiofile}
{dict3... and so on

【问题讨论】:

    标签: python html python-3.x web-scraping beautifulsoup


    【解决方案1】:

    在我看来,(几乎)每一行都是一组 img、mp3、字体等。
    因此,我认为您可以逐行解析 html 并即时提取所需的信息。

    为简单起见,我只创建了提取源图像链接src 和媒体链接mp3 的函数。

    from bs4 import BeautifulSoup
    import re
    
    def getsrc(str):
        """ Get a string and returns the link of the src image, if any. None otherwise"""
        if str is not None:
            src = re.search('src="(.*\.jpg)"', str)
            if src is not None:
                return src.group(1)
    
        return None
    
    
    def getmp3(str):
        """ Get a string and returns the link of the mp3 media, if any. None otherwise"""
        if str is not None:
            src = re.search('href="(.*\.mp3)"', str)
            if src is not None:
                return src.group(1)
    
        return None
    
    
    # ---------------
    
    raw_html = open('./page.html').readlines()
    
    for line in raw_html:
        html = BeautifulSoup(line, 'lxml')
    
        # Image
        img = str(html.find('img'))
        src = getsrc(img)
    
        # Mp3 link
        a = str(html.find_all('a'))
        mp3 = getmp3(a)
    
        dictionary = {
            'src':src,
            'media': mp3
        }
    
        print(dictionary)
    
    

    这个 sn-p 的输出是这样的:

    {'src': './page_files/heb-anc-sm-hey.jpg', 'media': 'http://www.ancient-hebrew.org/m/dictionary/audio/998.mp3'}
    

    【讨论】:

      猜你喜欢
      • 2011-11-24
      • 1970-01-01
      • 2021-11-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-06-23
      • 2016-03-07
      相关资源
      最近更新 更多