【问题标题】:Scraping full post Indeed with Selenium用 Selenium 抓取完整的帖子
【发布时间】:2021-08-11 12:07:16
【问题描述】:

我正在尝试使 python 刮板代码工作,但我做不到,一点帮助会很有用,我还是个初学者。代码运行正常,但它崩溃并将单个作业导出到我的 csv,我认为这是随机的并且不会给出任何错误。请有更多经验的人可以帮助我提供一些提示。提前致谢。

from selenium import webdriver
import pandas as pd 
from bs4 import BeautifulSoup

options = webdriver.FirefoxOptions()
driver = webdriver.Firefox()
driver.maximize_window()


df = pd.DataFrame(columns=["Title","Location","Company","Salary","Sponsored","Description"])

for i in range(25):
    driver.get('https://www.indeed.co.in/jobs?q=artificial%20intelligence&l=India&start='+str(i))
    jobs = []
    driver.implicitly_wait(20)
    

    for job in driver.find_elements_by_class_name('result'):

        soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
        
        try:
            title = soup.find("a",class_="jobtitle").text.replace("\n","").strip()
            
        except:
            title = 'None'

        try:
            location = soup.find(class_="location").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="company").text.replace("\n","").strip()
        except:
            company = 'None'

        try:
            salary = soup.find(class_="salary").text.replace("\n","").strip()
        except:
            salary = 'None'

        try:
            sponsored = soup.find(class_="sponsoredGray").text
            sponsored = "Sponsored"
        except:
            sponsored = "Organic"
                
        
sum_div = job.find_element_by_class_name('summary')

try:    
              sum_div.click()
except:
             close_button = driver.find_elements_by_class_name('popover-x-button-close')[0]
             close_button.click()
             sum_div.click()            
driver.implicitly_wait(2)
try:            
    job_desc = driver.find_element_by_css_selector('div#vjs-desc').text
    print(job_desc)
except:
    job_desc = 'None'   

df = df.append({'Title':title,'Location':location,"Company":company,"Salary":salary,
                        "Sponsored":sponsored,"Description":job_desc},ignore_index=True)


df.to_csv(r"C:\Users\Desktop\Python\Newtest.csv",index=False)

【问题讨论】:

  • 这似乎是一个缩进问题。我的答案中的代码给了我 1931 行的 CSV 文件。

标签: python selenium beautifulsoup webdriver selenium-firefoxdriver


【解决方案1】:

这似乎是一个简单的缩进问题。 您的部分代码在 for 循环之外运行。

from selenium import webdriver
import pandas as pd 
from bs4 import BeautifulSoup

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

options = Options()    
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)


df = pd.DataFrame(columns=["Title","Location","Company","Salary","Sponsored","Description"])

for i in range(0,50,10):
    driver.get('https://www.indeed.co.in/jobs?q=artificial%20intelligence&l=India&start='+str(i))
    jobs = []
    driver.implicitly_wait(20)
    

    for job in driver.find_elements_by_class_name('result'):

        soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
        
        try:
            title = soup.find("a",class_="jobtitle").text.replace("\n","").strip()
            
        except:
            title = 'None'

        try:
            location = soup.find(class_="location").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="company").text.replace("\n","").strip()
        except:
            company = 'None'

        try:
            salary = soup.find(class_="salary").text.replace("\n","").strip()
        except:
            salary = 'None'

        try:
            sponsored = soup.find(class_="sponsoredGray").text
            sponsored = "Sponsored"
        except:
            sponsored = "Organic"


        sum_div = job.find_element_by_class_name('summary')

        try:    
                    sum_div.click()
        except:
                    close_button = driver.find_elements_by_class_name('popover-x-button-close')[0]
                    close_button.click()
                    sum_div.click()            
        driver.implicitly_wait(2)
        try:            
            job_desc = driver.find_element_by_css_selector('div#vjs-desc').text
            print(job_desc)
        except:
            job_desc = 'None'   

        df = df.append({'Title':title,'Location':location,"Company":company,"Salary":salary,
                                "Sponsored":sponsored,"Description":job_desc},ignore_index=True)

df.to_csv("test.csv",index=False)

我使用 Chrome 而不是 Firefox,但我认为问题不存在。我只是正确地缩进了你的代码。

此外,在没有异常错误的情况下放置 except 也不是一个好主意。 Why is "except: pass" a bad programming practice?

【讨论】:

  • 感谢您的帮助,我在 Chrome 中尝试了您的代码,它运行良好,但在 Firefox 中问题仍然存在。现在在尝试行中给我“TabError:在缩进中不一致使用制表符和空格”:try sum_div.click()。我一直在改变空间,但徒劳无功。
  • 该错误意味着您在某些地方使用了 4 个空格,而在其他地方使用了 1 个制表符。如果您查看代码并将所有 4 个空格更改为制表符,它将解决错误。
  • @DariusFlorea 这解决了您的问题还是回答了您的问题,请考虑将答案标记为已接受。 (保持社区维护)
  • 我终于解决了。谢谢@Christopher Holder
猜你喜欢
  • 1970-01-01
  • 2014-04-25
  • 2021-04-28
  • 1970-01-01
  • 2019-06-17
  • 1970-01-01
  • 2020-05-12
  • 1970-01-01
  • 2019-05-08
相关资源
最近更新 更多