【问题标题】:Python: Web Scraping Weird OutputPython:网页抓取奇怪的输出
【发布时间】:2021-01-04 13:28:41
【问题描述】:
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
import requests

url = 'https://en.wikisource.org/wiki/Main_Page'
r = requests.get(url)

Soup = BeautifulSoup(r.text, "html5lib")
List = Soup.find("div",class_="enws-mainpage-widget-content", id="enws-mainpage-newtexts-content").find_all('a')
ebooks=[]
i=0
for ebook in List:
    x=ebook.get('title')
    for ch in x:
        if(ch==":"):
            x=""
    if x!="":
        ebooks.append(x)
        i=i+1
        

inputnumber=0
while inputnumber<len(ebooks):
    print(inputnumber+1, " - ", ebooks[inputnumber])
    inputnumber=inputnumber+1
input=int(input("Please select a book: "))
selectedbook = Soup.find("a", title=ebooks[input-1])
print(selectedbook['title'])
url1 = "https://en.wikisource.org/"+selectedbook['href']
print(url1)
r1 = requests.get(url1)
Soup1 = BeautifulSoup(r1.text, "html5lib")
List1 = Soup.find("div", class_="prp-pages-output")
print(List1)

这是我的代码。我想在最后一部分的 html 代码中获取 paragraghs。但作为输出我得到:

1  -  The Center of the Web
2  -  Bobby Bumps Starts a Lodge
3  -  May (Mácha)
4  -  Animal Life and the World of Nature/1903/06/Notes and Comments
5  -  The Czechoslovak Review/Volume 2/No Compromise
6  -  She's All the World to Me
7  -  Their One Love
Please select a book: 4
Animal Life and the World of Nature/1903/06/Notes and Comments
https://en.wikisource.org//wiki/Animal_Life_and_the_World_of_Nature/1903/06/Notes_and_Comments
None

为什么 List1 返回一个?它不应该。谁能告诉我哪里做错了。

【问题讨论】:

    标签: python web web-scraping


    【解决方案1】:

    猜你只是用 Soup 打错了 Soup1。 + 我认为您在查找项目列表时需要的不仅仅是一个,所以我添加了 find_all() 函数。

    from bs4 import BeautifulSoup
    from urllib.request import urlopen as uReq
    import requests
    
    url = "https://en.wikisource.org/wiki/Main_Page"
    r = requests.get(url)
    
    Soup = BeautifulSoup(r.text, "html5lib")
    List = Soup.find(
        "div", class_="enws-mainpage-widget-content", id="enws-mainpage-newtexts-content"
    ).find_all("a")
    ebooks = []
    i = 0
    for ebook in List:
        x = ebook.get("title")
        for ch in x:
            if ch == ":":
                x = ""
        if x != "":
            ebooks.append(x)
            i = i + 1
    
    
    inputnumber = 0
    while inputnumber < len(ebooks):
        print(inputnumber + 1, " - ", ebooks[inputnumber])
        inputnumber = inputnumber + 1
    input = int(input("Please select a book: "))
    selectedbook = Soup.find("a", title=ebooks[input - 1])
    print(selectedbook["title"])
    url1 = "https://en.wikisource.org/" + selectedbook["href"]
    print(url1)
    r1 = requests.get(url1)
    Soup1 = BeautifulSoup(r1.text, "html5lib")
    List1 = Soup1.find_all("div", class_="prp-pages-output")
    print(List1)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-12-31
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多