如何正确使用 Selenium 抓取 Twitter 用户名？答案

【问题标题】：How do I scrape Twitter usernames using Selenium properly?如何正确使用 Selenium 抓取 Twitter 用户名？
【发布时间】：2022-01-22 09:02:44
【问题描述】：

所以，我正在尝试抓取 Twitter 关注者，但问题是，它也会抓取非个人资料页面 (Twitter accs) 的不必要链接。

以下代码的作用是，打开您要从中抓取关注者的 Twitter 帐户页面，并使用 xpath 的 locate 元素获取个人资料页面的链接，同时逐渐向下滚动以获取所有当前关注者。

这是我的代码：

def extract_followers_func():
    driver.get("https://twitter.com/Username/followers")
    sleep(5)
    for twusernames in driver.find_elements_by_xpath('//div[@aria-label="Timeline: Followers"]//a[@role="link"]'):
        file = open("scrapedlist.txt", "a")
        file.write(twusernames.get_property('href'))
        file.write("\n")
        file.close()
    sleep(5)
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page
        sleep(5)
        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
        for twusernames in driver.find_elements_by_xpath('//div[@aria-label="Timeline: Followers"]//a[@role="link"]'):
            file = open("scrapedlist.txt", "a")
            file.write(twusernames.get_property('href'))
            file.write("\n")
            file.close()

什么是更有效的方法来做到这一点？我只想要用户名，而不是所有不必要的链接。

完整代码：

import tkinter as tk

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.chrome.service import Service

from selenium.common.exceptions import TimeoutException

import threading

import time

from time import sleep

import datetime

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("start-maximized")

root = tk.Tk()

app_width = 300
app_height = 320

screen_width = root.winfo_screenwidth()
screen_height = root.winfo_screenheight()

x = (screen_width / 2) - (app_width / 2)
y = (screen_height / 2) - (app_height / 2)

root.geometry(f'{app_width}x{app_height}+{int(x)}+{int(y)}')

#
ser = Service("C:\Program Files (x86)\chromedriver.exe")
driver = webdriver.Chrome(service=ser, options=options)
wait = WebDriverWait(driver, 50)

testbtn_txt = tk.StringVar()
testbtn = tk.Button(root, textvariable=testbtn_txt, command=lambda:extract_followers_func(), font="Arial", bg="#808080", fg="white", height=1, width=10)
testbtn_txt.set("Test")
testbtn.grid(row=10, column=0, columnspan=2, pady=5, padx=5)


def extract_followers_func():
    driver.get("https://twitter.com/Username/followers")
    sleep(5)
    for twusernames in driver.find_elements_by_xpath('//div[@aria-label="Timeline: Followers"]//a[@role="link" and not(@aria-hidden) and not(contains(@href,'search')) and not(contains(@href,'Live')) and not(@rel)]'):
        file = open("scrapedlist.txt", "a")
        file.write(twusernames.get_property('href'))
        file.write("\n")
        file.close()
    sleep(5)
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page
        sleep(5)
        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
        for twusernames in driver.find_elements_by_xpath('//div[@aria-label="Timeline: Followers"]//a[@role="link" and not(@aria-hidden) and not(contains(@href,'search')) and not(contains(@href,'Live')) and not(@rel)]'):
            file = open("scrapedlist.txt", "a")
            file.write(twusernames.get_property('href'))
            file.write("\n")
            file.close()



root.mainloop()

【问题讨论】：

标签： python selenium twitter

【解决方案1】：

你快到了！
您只需要微调定位器。
所以，而不是

'//div[@aria-label="Timeline: Followers"]//a[@role="link"]'

你应该使用

'//div[@aria-label="Timeline: Followers"]//a[@role="link" and not(@aria-hidden) and not(contains(@href,"search")) and not(contains(@href,"Live")) and not(@rel)]'

【讨论】：

我已根据我在 Twitter 关注者页面上看到的内容进行了回答。如果更新后的答案（刚刚更新）仍然为您提供不相关的链接，我想获取您的凭据以查看您的帐户关注者页面上实际发生的情况。
好的，但我的 Twitter 关注者页面上没有此类元素。所以为了给你一个正确的定位器，我需要查看包含这些元素的页面来创建一个正确的定位器
好的，请查看更新的定位器
什么没有关闭，在哪里？预期什么，在哪里？
好的，到底哪里出了问题？