【问题标题】:Why is Beautiful Soup Returning duplicate results?为什么 Beautiful Soup 返回重复的结果?
【发布时间】:2021-12-08 17:02:28
【问题描述】:

我正在创建一个抓取确实网站的项目,它运行良好,但是当我今天运行它时,突然没有进行任何更改,而不是返回整个结果页面,它不仅显示第一个结果在重复。有人可以帮我纠正这个问题

from tkinter import *
import random
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import requests


html_text = requests.get('https://www.ign.com/').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find('section',class_='right')
#print(html_text)


driver = webdriver.Chrome(executable_path='/Users/Miscellaneous/PycharmProjects/RecursivePractice/chromedriver')
url= "https://www.indeed.com/jobs?q=developer&l=Westbury%2C%20NY&vjk=0b0cbe29e5f86422"
driver.maximize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("a",{"class":"tapItem"}

for official in officials:
  jobTitle = soup.find('h2',{'class': 'jobTitle'}).text
  companyName = soup.find('div',{'class': 'comapny_location'})
  location = soup.find('div',{'class': 'companyLocation'}).text
  salary = soup.find('div',{'class': 'salary-snippet'})
  actualSalary = salary.find('span').text
  summary = soup.find('div',{'class': 'job-snippet'}).text

print('Title: ' + str(jobTitle) + '\nCompany Name: ' + str(companyName) + '\nLocation: ' + str(location)
      + '\nSalary: ' + str(actualSalary) + "\nSummary: " + str(summary))
#print(str(official))
print(' ')


driver.quit()

【问题讨论】:

    标签: python html selenium web beautifulsoup


    【解决方案1】:

    试试这个

    from tkinter import *
    import random
    import urllib.request
    from bs4 import BeautifulSoup
    from selenium import webdriver
    import time
    import pandas as pd
    import requests
    
    
    html_text = requests.get('https://www.ign.com/').text
    soup = BeautifulSoup(html_text, 'lxml')
    jobs = soup.find('section',class_='right')
    
    
    driver = webdriver.Chrome(executable_path='/Users/Miscellaneous/PycharmProjects/RecursivePractice/chromedriver')
    url= "https://www.indeed.com/jobs?q=developer&l=Westbury%2C%20NY&vjk=0b0cbe29e5f86422"
    driver.maximize_window()
    driver.get(url)
    
    time.sleep(5)
    content = driver.page_source.encode('utf-8').strip()
    soup = BeautifulSoup(content,"html.parser")
    officials = soup.findAll("a",{"class":"tapItem"})
    
    for i in range(len(officials)):
        jobTitle = soup.findAll('h2',{'class': 'jobTitle'})[i].text
    
        companyName = soup.findAll('div',{'class': 'comapny_location'})[i].text if len(soup.findAll('div',{'class': 'comapny_location'})) > i else "NULL"
        location = soup.findAll('div',{'class': 'companyLocation'})[i].text if len(soup.findAll('div',{'class': 'companyLocation'})) > i else "NULL"
        salary = soup.findAll('div',{'class': 'salary-snippet'})[i].text if len(soup.findAll('div',{'class': 'salary-snippet'})) > i else "NULL"
        actualSalary = salary.find('span')
        summary = soup.findAll('div',{'class': 'job-snippet'})[i].text if len(soup.findAll('div',{'class': 'job-snippet'})) > i else "NULL"
    
        print('Title: ' + str(jobTitle) + '\nCompany Name: ' + str(companyName) + '\nLocation: ' + str(location)
            + '\nSalary: ' + str(actualSalary) + "\nSummary: " + str(summary))
        print(' ')
    
    driver.quit()
    

    【讨论】:

    • 大部分都成功了,非常感谢!您能否解释一下为什么添加了额外的“if 语句”。我只理解你的回答,直到每行中间的“[i].text”。并且仍然不知道为什么它首先返回重复项。请并谢谢你:)
    • 还有为什么你不必将额外的东西添加到“jobTitle”以及为什么“actualSalary”现在不起作用。漂亮的汤最近太令人沮丧了哈哈
    • 回答您的第一个问题:附加的 if 表达式可防止程序运行到 index out of range 错误。现在它首先检查列表中是否有足够的元素,然后尝试检索并返回“NULL”,否则。在您的代码的早期,打印语句不在您的循环中。这就是为什么它返回最后一个检索到的值。回答你的第二个问题:actualSalary 没有[i] 索引
    • 我实际上在将打印语句放在 SO 上时错误地缩进了它,它在我的实际程序的循环中。我最大的问题是,我最初的做法非常好。我觉得如果我没有改变任何东西,重复的结果问题就从字面上冒出来了:(
    猜你喜欢
    • 1970-01-01
    • 2023-03-12
    • 1970-01-01
    • 1970-01-01
    • 2014-11-29
    • 1970-01-01
    • 1970-01-01
    • 2020-06-06
    • 1970-01-01
    相关资源
    最近更新 更多