为什么 Beautiful Soup 返回重复的结果？答案

【问题标题】：Why is Beautiful Soup Returning duplicate results?为什么 Beautiful Soup 返回重复的结果？
【发布时间】：2021-12-08 17:02:28
【问题描述】：

我正在创建一个抓取确实网站的项目，它运行良好，但是当我今天运行它时，突然没有进行任何更改，而不是返回整个结果页面，它不仅显示第一个结果在重复。有人可以帮我纠正这个问题

from tkinter import *
import random
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import requests


html_text = requests.get('https://www.ign.com/').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find('section',class_='right')
#print(html_text)


driver = webdriver.Chrome(executable_path='/Users/Miscellaneous/PycharmProjects/RecursivePractice/chromedriver')
url= "https://www.indeed.com/jobs?q=developer&l=Westbury%2C%20NY&vjk=0b0cbe29e5f86422"
driver.maximize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("a",{"class":"tapItem"}

for official in officials:
  jobTitle = soup.find('h2',{'class': 'jobTitle'}).text
  companyName = soup.find('div',{'class': 'comapny_location'})
  location = soup.find('div',{'class': 'companyLocation'}).text
  salary = soup.find('div',{'class': 'salary-snippet'})
  actualSalary = salary.find('span').text
  summary = soup.find('div',{'class': 'job-snippet'}).text

print('Title: ' + str(jobTitle) + '\nCompany Name: ' + str(companyName) + '\nLocation: ' + str(location)
      + '\nSalary: ' + str(actualSalary) + "\nSummary: " + str(summary))
#print(str(official))
print(' ')


driver.quit()

【问题讨论】：

标签： python html selenium web beautifulsoup

【解决方案1】：

试试这个

from tkinter import *
import random
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import requests


html_text = requests.get('https://www.ign.com/').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find('section',class_='right')


driver = webdriver.Chrome(executable_path='/Users/Miscellaneous/PycharmProjects/RecursivePractice/chromedriver')
url= "https://www.indeed.com/jobs?q=developer&l=Westbury%2C%20NY&vjk=0b0cbe29e5f86422"
driver.maximize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("a",{"class":"tapItem"})

for i in range(len(officials)):
    jobTitle = soup.findAll('h2',{'class': 'jobTitle'})[i].text

    companyName = soup.findAll('div',{'class': 'comapny_location'})[i].text if len(soup.findAll('div',{'class': 'comapny_location'})) > i else "NULL"
    location = soup.findAll('div',{'class': 'companyLocation'})[i].text if len(soup.findAll('div',{'class': 'companyLocation'})) > i else "NULL"
    salary = soup.findAll('div',{'class': 'salary-snippet'})[i].text if len(soup.findAll('div',{'class': 'salary-snippet'})) > i else "NULL"
    actualSalary = salary.find('span')
    summary = soup.findAll('div',{'class': 'job-snippet'})[i].text if len(soup.findAll('div',{'class': 'job-snippet'})) > i else "NULL"

    print('Title: ' + str(jobTitle) + '\nCompany Name: ' + str(companyName) + '\nLocation: ' + str(location)
        + '\nSalary: ' + str(actualSalary) + "\nSummary: " + str(summary))
    print(' ')

driver.quit()

【讨论】：

大部分都成功了，非常感谢！您能否解释一下为什么添加了额外的“if 语句”。我只理解你的回答，直到每行中间的“[i].text”。并且仍然不知道为什么它首先返回重复项。请并谢谢你:)
还有为什么你不必将额外的东西添加到“jobTitle”以及为什么“actualSalary”现在不起作用。漂亮的汤最近太令人沮丧了哈哈
回答您的第一个问题：附加的 if 表达式可防止程序运行到 index out of range 错误。现在它首先检查列表中是否有足够的元素，然后尝试检索并返回“NULL”，否则。在您的代码的早期，打印语句不在您的循环中。这就是为什么它返回最后一个检索到的值。回答你的第二个问题：actualSalary 没有[i] 索引
我实际上在将打印语句放在 SO 上时错误地缩进了它，它在我的实际程序的循环中。我最大的问题是，我最初的做法非常好。我觉得如果我没有改变任何东西，重复的结果问题就从字面上冒出来了:(