【问题标题】:Getting URL from text file using BeautifulSoup使用 BeautifulSoup 从文本文件中获取 URL
【发布时间】:2020-02-21 09:27:06
【问题描述】:

如何从 .t​​xt 文件 BeautifulSoup 中获取网址? 我是网络抓取的新手。我想做多页的scrape,我需要从txt文件中拉出这些页面。

import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

chrome_driver_path = r'C:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_driver_path)


urls = r'C:\chromedriver_win32\asin.txt'
url = ('https://www.amazon.com/dp/'+urls)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')

stock = soup.find(id='availability').get_text()

stok_kontrol = pd.DataFrame(  {  'Url': [url], 'Stok Durumu': [stock] })
stok_kontrol.to_csv('stok-kontrol.csv', encoding='utf-8-sig')


print(stok_kontrol)

这个记事本有 amazon asin 数字。

C:\chromedriver_win32\asin.txt

文件在:

B00004SU18

B07L9178GQ

B01M35N6CZ

【问题讨论】:

    标签: python selenium web-scraping beautifulsoup


    【解决方案1】:

    这将获取产品网址以及产品是否有库存。
    将该信息打印到控制台,然后
    将其保存到文件“stok-kontrol.csv”

    测试于:Python 3.7.4

    import pandas as pd
    from bs4 import BeautifulSoup
    from selenium import webdriver
    import re
    
    chrome_driver_path = r'C:\chromedriver_win32\chromedriver.exe'
    driver = webdriver.Chrome(executable_path=chrome_driver_path)
    
    # Gets whether the products in the array, are in stock, from  www.amazon.com
    # Returns an Array of Dictionaries, with keys ['asin','instock','url']
    def IsProductsInStock(array_of_ASINs):
        results = []
        for asin in array_of_ASINs:
            url = 'https://www.amazon.com/dp/'+str(asin)
            driver.get(url)
            soup = BeautifulSoup(driver.page_source, 'lxml')
    
            stock = soup.find(id='availability').get_text().strip()
    
            isInStock = False
            if('In Stock' in stock): 
                # If 'In Stock' is the text of 'availability' element
                isInStock=True
            else: 
                # If Not, extract the number from it, if any, and see if it's in stock.
                tmp = re.search(re.compile('[0-9]+'), stock)
                if( tmp is not None and int(tmp[0]) > 0):
                    isInStock = True
    
            results.append({"asin": asin, "instock": isInStock, "url": url})
        return results
    
    # Saves the product information to 'toFile'
    # Returns a pandas.core.frame.DataFrame object, with the product info ['url', 'instock'] as columns
    # inStockDict MUST be either a Dictionary, or a 'list' of Dictionaries with, ['asin','instock','url'] keys
    def SaveProductInStockInformation(inStockDict, toFile):
        if(isinstance(inStockDict, dict)):
            stok_kontrol = pd.DataFrame(  {  'Url': [inStockDict['url']], 'Stok Durumu': [inStockDict['instock']]  } )
        elif(isinstance(inStockDict, list)):
            stocksSimple = []
            for stock in inStockDict:
                stocksSimple.append([stock['url'], stock['instock']])
            stok_kontrol = pd.DataFrame(stocksSimple, columns=['Url', 'Stok Durumu'])
        else:
            raise Exception("inStockDict parm, Must be Either a dictionary, or a 'list' of dictionaries with, ['asin','instock','url'] keys!")
    
        stok_kontrol.to_csv(toFile, encoding='utf-8-sig')
        return stok_kontrol
    
    # Get ASINs From File
    f = open(r'C:\chromedriver_win32\asin.txt','r')
    urls = f.read().split()
    f.close()
    
    # Get a list of Dictionaries containing all the products information
    stocks = IsProductsInStock(urls)
    
    # Save and Print the ['url', 'instock'] information
    print( SaveProductInStockInformation(stocks, 'stok-kontrol.csv') )
    
    
    # Remove if you need to use the driver later on in the program
    driver.close() 
    

    结果:(文件'stok-kontrol.csv')

    ,Url,Stok Durumu
    0,https://www.amazon.com/dp/B00004SU18,True
    1,https://www.amazon.com/dp/B07L9178GQ,True
    2,https://www.amazon.com/dp/B01M35N6CZ,True
    

    【讨论】:

    • 只是我有问题。当我只剩下 10 个库存时,代码说你没有库存。你能建议解决吗?谢谢。
    • @caca 修复了这个错误。只需复制更新后的函数“IsProductsInStock”并确保“重新导入”
    【解决方案2】:

    如果我正确理解了这个问题,您只需将 ASIN 编号传递给 url 以告诉 BeautifulSoup 要抓取什么,这只是一个简单的文件操作,然后遍历文件以获取数字并传递每个一个给 BeautifulSoup 刮

    urls = r'C:\chromedriver_win32\asin.txt'
    with open(urls, 'r') as f:
        for line in f:
            url = ('https://www.amazon.com/dp/'+line)
            driver.get(url)
            soup = BeautifulSoup(driver.page_source, 'lxml')
            stock = soup.find(id='availability').get_text()
            stok_kontrol = pd.DataFrame(  {  'Url': [url], 'Stok Durumu': [stock]  }  )
            stok_kontrol.to_csv('stok-kontrol.csv', encoding='utf-8-sig')
    
            print(stok_kontrol)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-05-19
      • 1970-01-01
      • 2016-03-24
      • 2017-11-14
      • 1970-01-01
      • 2019-01-02
      • 2015-01-03
      相关资源
      最近更新 更多