【问题标题】:Iterating over urls fails to find correct href in Python using BeautifulSoup使用 BeautifulSoup 遍历 url 无法在 Python 中找到正确的 href
【发布时间】:2020-08-05 21:51:54
【问题描述】:

我在代码中遍历网站。以下是我的代码的作用。循环浏览 52 个页面并获取每个 URL 的链接。

然后它遍历这些 URL 并尝试获取英文翻译的链接。如果您看到蒙古语网站,它的右上角有一个“Орчуулга”部分,下面有“English” - 这是英文翻译的链接。

但是,我的代码无法获取英文翻译的链接并给出了错误的 url。 以下是第一篇文章的示例输出。

1
{'https://mn.usembassy.gov/mn/2020-naadam-mn/': 'https://mn.usembassy.gov/mn/sitemap-mn/'}

第一页的预期输出应该是

1
{'https://mn.usembassy.gov/mn/2020-naadam-mn/': 'https://mn.usembassy.gov/2020-naadam/'}

下面是我的代码

import requests
from bs4 import BeautifulSoup


url = 'https://mn.usembassy.gov/mn/news-events-mn/page/{page}/'

urls = []
for page in range(1, 53):
    print(str(page) + "/52")
    soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
    for h in soup.find_all('h2'):
        a = h.find('a')
        urls.append(a.attrs['href'])

print(urls)

i = 0
bilingual_dict = {}
for url in urls:
    i += 1
    print(i)
    soup = BeautifulSoup(requests.get(url.format(page=url)).content, 'html.parser')
    for div in soup.find_all('div', class_='translations_sidebar'):
        for ul in soup.find_all('ul'):
            for li in ul.find_all('li'):
                a = li.find('a')
    bilingual_dict[url] = a['href']
    print(bilingual_dict)
print(bilingual_dict)

【问题讨论】:

    标签: python python-3.x beautifulsoup


    【解决方案1】:

    此脚本将打印英文翻译链接:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://mn.usembassy.gov/mn/2020-naadam-mn/'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    link = soup.select_one('a[hreflang="en"]')
    print(link['href'])
    

    打印:

    https://mn.usembassy.gov/2020-naadam/
    

    完整代码:(没有英文翻译链接的地方设置为None

    import requests
    from bs4 import BeautifulSoup
    from pprint import pprint
    
    url = 'https://mn.usembassy.gov/mn/news-events-mn/page/{page}/'
    
    urls = []
    for page in range(1, 53):
        print('Page {}...'.format(page))
        soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
        for h in soup.find_all('h2'):
            a = h.find('a')
            urls.append(a.attrs['href'])
    
    pprint(urls)
    
    bilingual_dict = {}
    for url in urls:
        print(url)
        soup = BeautifulSoup(requests.get(url).content, 'html.parser')
        link = soup.select_one('a[hreflang="en"]')
        bilingual_dict[url] = link['href'] if link else None
    
    pprint(bilingual_dict)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-02-03
      • 2021-11-12
      • 1970-01-01
      • 2014-04-25
      • 2021-09-24
      • 1970-01-01
      • 2012-05-05
      • 2021-05-06
      相关资源
      最近更新 更多