【问题标题】:parsing and matching weird structure of html using python使用python解析和匹配html的奇怪结构
【发布时间】:2015-04-19 05:21:10
【问题描述】:

首先,我要解析这个html:

<body>
    <div id="contents">
        <div id="links">
            <a href='url1'>link-1</a></div>
        <div id="info">
            <p>apple</p></div>
        <div id="links">
            <a href='url2'>link-2</a>
            <a href='url3'>link-3</a></div>
        <div id="info">
            <p>bear</p></div>
        <div id="links">
            <a href='url4'>link-4</a></div>
        <div id="info">
            <p>cat</p></div>
        <div id="links">
            <a href='url5'>link-5</a>
            <a href='url6'>link-6</a>
            <a href='url7'>link-7</a></div>
        <div id="info">
            <p>duck</p></div>
        <div id="links">
            <a href='url8'>link-8</a></div>
        <div id="info">
            <p>egg</p></div>
        #etc
    </div>
</body>

我的目的是“获取所有链接和信息,并匹配它们”。但是,有 8 个链接和 5 个信息,所以匹配不清晰。

def link_collect(soup):
    tempaddress = []
    link_list = [0]*10
    d = 0
    linksearch = soup.findAll("a")
    for r in linksearch:
        try:
            if "url" in r['href']:
                tempaddress.append(r['href']
        except:
            a=0
    for clearing in tempaddress:
        cleared = urlparse(str(clearing))
        clink = cleared.scheme + "://" + cleared.netloc + cleared.path
        link_list[d] = clink
        d = d+1
    link_list = delete_zero(link_list)
    return link_list

def info_collect(soup)
    tempinfo = soup.find_all(id="info")
    info_list = [0]*10
    d=0
    for r in tempinfo:
        infodata = r.get_text()
        info_list[d] = infodata
        d=d+1
    info_list = delete_zero(info_list)
    return info_list

targetpage = "http://address"
opening = urlopen(targetpage)
soup = BeautifulSoup(opening.read())
link = link_collect(soup)
info = info_collect(soup)
for n in range(0, len(info)):
    print(str(link[n]) + " = " + str(info[n]))

运行时,结果如下:

url1 = apple
url2 = bear
url3 = cat
url4 = duck
url5 = egg
Error : url 6, 7, 8 can't match

但是,我想要这样的结果:

url1 = apple
url2 = bear
url3 = bear
url4 = cat
url5 = duck
etc

我怎样才能做到这样?

【问题讨论】:

    标签: python html parsing python-3.x beautifulsoup


    【解决方案1】:

    您需要使用find_all() 方法获取adiv 父元素。接下来使用zip 函数迭代(div, info) 对,然后再次使用find_all() 方法获取每个div 的所有链接。

    In [85]: from bs4 import BeautifulSoup
    
    In [86]: soup = BeautifulSoup("""<body>
        <div id="contents">
            <div id="links">
                <a href='url1'>link-1</a></div>
            <div id="info">
                <p>apple</p></div>
            <div id="links">
                <a href='url2'>link-2</a>
                <a href='url3'>link-3</a></div>
            <div id="info">
                <p>bear</p></div>
            <div id="links">
                <a href='url4'>link-4</a></div>
            <div id="info">
                <p>cat</p></div>
            <div id="links">
                <a href='url5'>link-5</a>
                <a href='url6'>link-6</a>
                <a href='url7'>link-7</a></div>
            <div id="info">
                <p>duck</p></div>
            <div id="links">
                <a href='url8'>link-8</a></div>
            <div id="info">
                <p>egg</p></div>
            #etc
        </div>
    </body>""")
    
    In [87]: links = soup.find_all('div', attrs={'id': 'links'})
    
    In [88]: infos = soup.find_all('div', attrs={'id': 'info'})
    
    In [157]: for lk, inf in zip(links, infos):
       .....:     for tag in lk.find_all('a'):
       .....:         print(inf.text, tag.attrs['href'])
       .....:         
    
    apple url1
    
    bear url2
    
    bear url3
    
    cat url4
    
    duck url5
    
    duck url6
    
    duck url7
    
    egg url8
    

    【讨论】:

    • 谢谢迈克尔,我刚才运行了这段代码,所以在我的情况下,我没有使用 IDLE,无论如何我在第 91 行打印 (d),我只给 {'apple '}, 只有一个。还有一件事,我不想使用“link-n”,我想要的是“urls”,如何提取?
    • 非常感谢,我还有一个问题,现在运行时,我看到错误 on print(inf.text, tag.attrs['href']), keyerror : 'href'跨度>
    • @HyungsooKim 请再试一次。它现在应该可以工作了
    猜你喜欢
    • 1970-01-01
    • 2012-05-21
    • 2016-11-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多