用beautifulsoup将多个页面刮到列表中答案

【问题标题】：Scraping multiple pages into list with beautifulsoup用beautifulsoup将多个页面刮到列表中
【发布时间】：2017-11-22 22:24:47
【问题描述】：

我使用 Python 中的 beautifulsoup4 编写了一个爬虫程序，它遍历多页加密货币值并返回开始值、最高值和结束值。问题的抓取部分工作正常，但无法将所有货币保存到我的列表中，只有最后一个被添加到列表中。

任何人都可以帮助我了解如何保存所有这些吗？我已经进行了数小时的搜索，但似乎找不到相关的答案。代码如下：

no_space = name_15.str.replace('\s+', '-')

#lists out the pages to scrape 
for n in no_space:
    page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
    http = lib.PoolManager()
    response = http.request('GET', page)
    soup = BeautifulSoup(response.data, "lxml")

    main_table = soup.find('tbody')

    date=[]
    open_p=[]
    high_p=[]
    low_p=[]
    close_p=[]

    table = []

    for row in main_table.find_all('td'):
        table_pull = row.find_all_previous('td') #other find methods aren't returning what I need, but this works just fine

    table = [p.text.strip() for p in table_pull]

    date = table[208:1:-7]
    open_p = table[207:1:-7]
    high_p = table[206:1:-7]
    low_p = table[205:1:-7] 
    close_p = table[204:0:-7]

    df=pd.DataFrame(date,columns=['Date'])
    df['Open']=list(map(float,open_p))
    df['High']=list(map(float,high_p))
    df['Low']=list(map(float,low_p))
    df['Close']=list(map(float,close_p))
    print(df)

【问题讨论】：

在 for 循环中，您将覆盖 table_pull 变量，因此只处理最后一行。您必须在循环后缩进代码，以便在循环内执行，同时确保附加到数据帧而不是分配（df[...] = list(... 行）。
感谢@hoefling 和@rahlf23！我需要做更多的工作来重写它的这些方面，但现在我知道我哪里出错了，它应该会容易得多。

标签： python beautifulsoup

【解决方案1】：

简单地说，您似乎正在访问所有“td”元素，然后尝试访问该列表的先前元素，这是不必要的。此外，正如@hoefling 指出的那样，您在循环中不断覆盖变量，这就是为什么您只返回列表中的最后一个元素的原因（换句话说，只有循环的最后一次迭代设置值该变量，所有以前的都被覆盖）。抱歉，由于我的机器上有防火墙，我目前无法对此进行测试。请尝试以下操作：

no_space = name_15.str.replace('\s+', '-')

#lists out the pages to scrape 
for n in no_space:
    page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
    http = lib.PoolManager()
    response = http.request('GET', page)
    soup = BeautifulSoup(response.data, "lxml")

    main_table = soup.find('tbody')

    table = [p.text.strip() for p in main_table.find_all('td')]

    #You will need to re-think these indices here to get the info you want
    date = table[208:1:-7]
    open_p = table[207:1:-7]
    high_p = table[206:1:-7]
    low_p = table[205:1:-7] 
    close_p = table[204:0:-7]

    df=pd.DataFrame(date,columns=['Date'])
    df['Open']=list(map(float,open_p))
    df['High']=list(map(float,high_p))
    df['Low']=list(map(float,low_p))
    df['Close']=list(map(float,close_p))
    print(df)

【讨论】：