【问题标题】:parsing weird structure of the HTML using python - second ver使用 python 解析 HTML 的奇怪结构 - 第二版
【发布时间】:2015-04-20 05:31:03
【问题描述】:

我昨天问了一个问题,很清楚,但我现在有一个更棘手的问题。

首先,显示我要解析的这个 html 结构

<body>
    <div id="links">
        <a href='url1'>apple-explain</a>
        <blackquote>
            <a href='url1'>link-1</a>
        </blackquote>
    </div>
    <div id="info">
        <p>apple</p></div>

    <div id="links">
        <a href='batch_url1'>bear-explain</a>
        <blackquote>
            <a href='url2'>link-1</a>
            <a href='url3'>link-2</a>
        </blackquote>
    </div>
    <div id="info">
        <p>bear</p></div>

    <div id="links">
        <a href='url4'>cat-explain</a>
        <blackquote>
            <a href='url4'>link-1</a>
        </blackquote>
    </div>
    <div id="info">
        <p>cat</p></div>

    <div id="links">
        <a href='batchurl2'>duck-explain</a>
        <blackquote>
            <a href='url5'>link-1</a>
            <a href='url6'>link-2</a>
            <a href='url7'>link-3</a>
        </blackquote>
    </div>
    <div id="info">
        <p>duck</p></div>

    <div id="links">
        <a href='url8'>egg-explain</a></div>
        <blackquote>
            <a href='url8'>link-1</a>
        </blackquote>
    </div>
    <div id="info">
        <p>egg</p></div>
    #etc
</body>

看起来略长,但结构简单

<div id="links">
    <a href=url>some explain</a>
    <blackquote>
        <a href=url>link number</a>
    </blackquote></div>
<div id="info">
    <p>info keyword</p></div>

这是我的目的

到“抓取所有网址,删除重复,并将它们与信息关键字匹配”。

例如,apple 部分有两个 ,但它们是相同的 href 和bear部分,它有3个和3个href,一个在the中,两个在

我想清除元组并打印

元组是

(apple, url1)
(bear, [batch_url1, url2, url3])
etc...

打印出来的表格

url1 = apple
batch_url1 = bear
url2 = bear
url3 = bear
etc

这是我的代码,

soup = BeautifulSoup("""that HTML""")
url_list = soup.find_all('div', attrs={'id': 'links'})
info_list = soup.find_all('div', attrs={'id': 'links'})

for url, info in zip(url_list, info_list):
    for temp in url.find_all():
        infokeyword = info.text
        urls = temp.attrs['href']

zipped = zip(infokeyword, urls)
d=len(infokeyword)
for n in range(0, d+1):
    print(str(infokeyword[n]) + " = " + str(urls[n])

运行时,结果如下:

Traceback (most recent call last):
File "D:/Users/Hyungsoo/PycharmProjects/untitled1/zx.py", line 59, in <module>
urls = temp.attrs['href']
KeyError: 'href'

我怎样才能做到这样?

【问题讨论】:

    标签: html parsing python-3.x beautifulsoup


    【解决方案1】:

    为了获得不同的url,您可以使用collections.defaultdictset 作为default_factory

    In [72]: from collections import defaultdict
    
    In [73]: from bs4 import BeautifulSoup
    
    In [74]: soup = BeautifulSoup("""<body>
       ....:     <div id="links">
       ....:         <a href='url1'>apple-explain</a>
       ....:         <blackquote>
       ....:             <a href='url1'>link-1</a>
       ....:         </blackquote>
       ....:     </div>
       ....:     <div id="info">
       ....:         <p>apple</p></div>
       ....: 
       ....:     <div id="links">
       ....:         <a href='batch_url1'>bear-explain</a>
       ....:         <blackquote>
       ....:             <a href='url2'>link-1</a>
       ....:             <a href='url3'>link-2</a>
       ....:         </blackquote>
       ....:     </div>
       ....:     <div id="info">
       ....:         <p>bear</p></div>
       ....: 
       ....:     <div id="links">
       ....:         <a href='url4'>cat-explain</a>
       ....:         <blackquote>
       ....:             <a href='url4'>link-1</a>
       ....:         </blackquote>
       ....:     </div>
       ....:     <div id="info">
       ....:         <p>cat</p></div>
       ....: 
       ....:     <div id="links">
       ....:         <a href='batchurl2'>duck-explain</a>
       ....:         <blackquote>
       ....:             <a href='url5'>link-1</a>
       ....:             <a href='url6'>link-2</a>
       ....:             <a href='url7'>link-3</a>
       ....:         </blackquote>
       ....:     </div>
       ....:     <div id="info">
       ....:         <p>duck</p></div>
       ....: 
       ....:     <div id="links">
       ....:         <a href='url8'>egg-explain</a></div>
       ....:         <blackquote>
       ....:             <a href='url8'>link-1</a>
       ....:         </blackquote>
       ....:     </div>
       ....:     <div id="info">
       ....:         <p>egg</p></div>
       ....:     #etc
       ....: </body>""")
    
    In [75]: distinct_url = defaultdict(set)
    
    In [76]: links = soup.select('div#links')
    
    In [77]: infos = soup.select('div#info p')
    
    In [78]: for k, v in zip(links, infos):
       ....:     for l in k.find_all('a'):
       ....:         distinct_url[v.text].add(l.attrs['href'])
       ....:         
    
    In [79]: distinct_url
    Out[79]: defaultdict(<class 'set'>, {'apple': {'url1'}, 'duck': {'url5', 'url7', 'url6', 'batchurl2'}, 'bear': {'batch_url1', 'url3', 'url2'}, 'cat': {'url4'}, 'egg': {'url8'}})
    
    In [80]: for info, lks in distinct_url.items():
       ....:     for lk in lks:
       ....:         print(info, lk)
       ....:         
    apple url1
    duck url5
    duck url7
    duck url6
    duck batchurl2
    bear batch_url1
    bear url3
    bear url2
    cat url4
    egg url8 
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-05-21
      • 2016-11-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-07-15
      相关资源
      最近更新 更多