【发布时间】:2021-06-06 01:01:05
【问题描述】:
我目前正在开发一个网络抓取工具,它将 url 作为输入,找到页面,抓取它,然后以 CSV 格式返回结果。刮板一次适用于单个 URL。但不幸的是,每当它向抓取结果 CSV 写入新行时,它也会在每一列中附加上一个 url 的抓取结果。我需要一个循环,它本质上会在循环内创建新的类变量,这样就不会发生这种情况。类似的事情是这样的:获取 url 列表,然后创建唯一的类实例。
links = ['www.SomeLink1.com','www.Somelink2.com','www.SomeLink3.com']
person1 = Person('www.SomeLink1.com', driver = driver, close_on_complete = False)
person2 = Person('www.Somelink2.com', driver = driver, close_on_complete = False)
person3 = Person('www.SomeLink3.com', driver = driver, close_on_complete = False)
我无权访问源代码来创建新方法“person1.reset()”之类的。
这也是我用来抓取多个页面的原始代码:
# Import libraries
from linkedin_scraper import Person, actions
from selenium import webdriver
import csv
import os
import pandas as pd
import numpy as np
import smtplib
# Read-in list of contacts:
contacts = pd.read_csv("/Users/Desktop/ContactList.csv")
names = contacts['contact_name'].tolist()
urls = contacts['contact_url'].tolist()
# turn contacts list into dictionary just in case
contact_dict = {names[i]: urls[i] for i in range(len(names))}
print(contact_dict)
# automatically login to LinkedIn
driver = webdriver.Chrome('/Users/Downloads/chromedriver')
email = os.environ.get('LINKEDIN_USER')
password = os.environ.get('LINKEDIN_PASS')
actions.login(driver, email, password)
# create general field names
fields = ['name', 'about', 'job_title', 'location','company',
'education','accomplishments','linkedin_url']
with open('ScrapeResults.csv', 'w') as f:
# using csv.writer method from CSV package
write = csv.writer(f)
write.writerow(fields)
f.close()
# Loop-through urls to scrape multiple pages at once
for individual,link in contact_dict.items():
## assign ##
the_name = individual
the_link = link
# scrape peoples url:
person = Person(the_link, driver=driver, close_on_complete=False)
# rows to be written... only index for lists?
rows = [[person.name, person.about, person.job_title, person.location, person.company,
person.educations, person.accomplishments, person.linkedin_url]]
# write
with open('ScrapeResults.csv', 'a') as f:
# using csv.writer method from CSV package
write = csv.writer(f)
write.writerows(rows)
f.close()
【问题讨论】:
-
或者有什么方法可以在这个for循环中重置 person = Person(the_link, driver=driver, close_on_complete=False) 而无需实际编辑包的源代码?
-
或者,可能是您的
rows变量正在累积以前运行的结果,所以每次您写rows时,您都在写累积的结果。 -
@RazzleShazl 是的,这正是正在发生的事情。每次 for 循环迭代时,结果都会在类属性中累积/追加。
-
我认为结果在驱动程序中累积,因此它反过来在
rows中产生累积结果。 -
出于好奇,可以换成
close_on_complete=True吗?我不知道它的作用,但似乎这可能有助于重置Person。
标签: python loops class object web-scraping