【问题标题】:Redirecting to a new URL to parse through重定向到要解析的新 URL
【发布时间】:2018-12-10 21:26:22
【问题描述】:

我目前正在构建一个程序,该程序通过 wikipedia 解析以在地图上显示一个国家的山脉。

我已经能够找到感兴趣的 url,但是我无法重定向到新的 url(所有需要的数据所在的位置)。

非常感谢任何和所有建议,包括使用其他库!

import requests
from bs4 import BeautifulSoup
from  csv import writer
import urllib3

#Requests country name from user
user_input=input('Enter Country:')
fist_letter=user_input[0:1].upper()
country=fist_letter+user_input[1:] #takes the country name and capatalizes 
the first letter

#Request response for wikipedia parse
response=requests.get('https://en.wikipedia.org/wiki/Category:
Lists_of_mountains_by_country')
bs=BeautifulSoup(response.text,'html.parser')

#country query
for content in bs.find_all(class_='mw-category')[1]:
    category_letter=content.find('h3')

    #Locates target category to find the country of interest
    if fist_letter in category_letter:
    country_lists=category_letter.find_next_sibling('ul')

    #Locates the country of interest from the lists of countries in target 
    #category
        target=country_lists.find('li',text="List of mountains in 
        "+str(country))

    #Grabs the link which will redirect to the page containing the list of 
    #mountains for the country of interest.

        target_link=target.find('a')
        link=target_link.get('href')
        new_link='https://enwikipedia.org'+link

        #Attempts to redirect to the target link
        new_response=requests.get(new_link)
        BS=BeautifulSoup(new_response.text,'html.parser')
        mountain_list=content.find('tbody')
        print(mountain_list)

    else:
        pass

【问题讨论】:

  • https://enwikipedia.org 不应该是https://en.wikipedia.org。无论如何,只添加国家名称会更容易:https://en.wikipedia.org/wiki/Category:Lists_of_mountains_of_COUNTRYNAME
  • 哇,是的,可能是这样,我会试试看,看看效果如何!谢谢!
  • 不客气@jamil。你会接受我的评论作为答案吗?
  • 是的,当然! PS。我还没有足够的积分来投票...

标签: python url beautifulsoup python-requests html-parsing


【解决方案1】:

https://enwikipedia.org 不应该是https://en.wikipedia.org 吗?

无论如何,只添加国家名称会更容易:

https://en.wikipedia.org/wiki/Category:Lists_of_mountains_of_**COUNTRYNAME**

【讨论】:

  • 这是个好主意,但有些 URL 有不同的语法。例如:mountains_of_New_Zealandmountains_in_Poland。我想 OP 可以同时尝试。
  • 是的,我也注意到“加拿大”查询的问题,我玩了几个小时试图找到一种方法来在
  • 之间的文本中搜索用户的国家/地区名称标记但未成功。我会再试一次,如果再次失败,我可能会再次在这里发帖。感谢您的评论!
【解决方案2】:

我喜欢通过 Python 字符串 split()find() 解析 HTML。仅用一次切割进行拆分可以得到左右结果,并且只需使用数组语法表示法即可,例如:html_str.split('<a href="', 1)[1]

无论如何,一旦代码拆分出正确的 URL,只需类似地重新解析即可。哦,检查 HTTP 错误可能是值得的。

import requests
import urllib3

#Requests country name from user
user_input = input('Enter Country:')
country = user_input.strip().lower().capitalize()

#Request response for wikipedia parse
response = requests.get('https://en.wikipedia.org/wiki/Category:Lists_of_mountains_by_country')
response_body = str( response.content, "utf-8" )

# Find the "By Country" section in the HTML result
# This section begins at the Title "Lists of mountains by country"
country_section = response_body.split( 'Pages in category "Lists of mountains by country"' )[1]
search_term = "in_" + country

if ( country_section.find( search_term ) != -1 ):
    # each country URL begins "<li><a href="/wiki/List_of_mountains_..."
    country_urls = country_section.split('<li><a href="')
    for url in country_urls:
        if ( url.find( search_term ) != -1 ):
            # The URL ends "..._in_Uganda" title="List o..."
            # Split off the Right-Side text
            found_url = "https://en.wikipedia.org" + url.split('" title=')[0]
            print( "DEBUG: URL Is [" + found_url + "]" )

            ## Now fetch the country-url
            response = requests.get( found_url )
            response_body = str( response.content, "utf-8" )
            ### TODO - process mountain list
else:
    print( "That country [" + country + "] does not have an entry" )

【讨论】:

  • 我注意到有些人将 html 文本的正文存储到一个变量中,这样解析它是否更有效。另外,谢谢你的回答!
猜你喜欢
相关资源
最近更新 更多
热门标签