【问题标题】:how to bypass googletagmanager while scraping抓取时如何绕过googletagmanager
【发布时间】:2020-11-04 03:32:48
【问题描述】:

当网站添加脚本 googletagmanadger 时,我无法得到我需要的东西。使用此代码,我正在从 现在我在每一行都得到“www.googletagmanager.com”......所以我不知道如何处理。谢谢

[HTML][1]

[CSV 文件现在的样子][2]

from bs4 import BeautifulSoup
import csv
import pandas as pd
from csv import writer



data_list = ["LINKI", "GOWNO", "JAJCO"]

with open('innovators.csv', 'w', newline='') as file:
    writer = csv.writer(file, delimiter=',')
    writer.writerow(data_list)
    for i in range(0,50):
        #df = pd.read_csv("C:\\Users\\Lukasz\\Desktop\\PROJEKTY PYTHON\\W TRAKCIE\\bf3_strona2.csv")
        #url = "https://bf3.space/" + df['LINKS'][i]
        url='https://bf3.space/a-Byu6am3P'
        response = requests.get(url)
        data = response.text
        soup = BeautifulSoup(data, 'lxml')
        rows = soup.find('iframe')
        q = (rows.get('src'))
        writer.writerow([q])


[1]: https://i.stack.imgur.com/Ogq0N.png
[2]: https://i.stack.imgur.com/3JYqc.png

【问题讨论】:

    标签: python python-3.x csv beautifulsoup


    【解决方案1】:

    您可以将soup.find() 与 lambda 一起使用。

    例如:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://bf3.space/a-Byu6am3P'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    print( soup.find('iframe', src=lambda s: 'googletagmanager.com' not in s) )
    

    打印第一个非 googletagmanager <iframe> 标签:

    <iframe align="center" frameborder="0" height="1500" src="https://ven-way.x.yupoo.com/albums/83591895?uid=1" style="margin: 10px 0;padding: 0px 0px; border:none" width="100%"></iframe>
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-08-06
      • 2017-12-05
      • 1970-01-01
      • 1970-01-01
      • 2019-10-03
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多