【问题标题】:Extracting certain columns out of a table with BeautifulSoup使用 BeautifulSoup 从表中提取某些列
【发布时间】:2019-11-15 14:55:03
【问题描述】:

您好,我正在尝试使用 html 表格确定在 ebay 上从该网站购买商品的日期:https://offer.ebay.com/ws/eBayISAPI.dll?ViewBidsLogin&item=173653442617&rt=nc&_trksid=p2047675.l2564

我的python代码:

def soup_creator(url):
  # Downloads the eBay page for processing
  res = requests.get(url)
  # Raises an exception error if there's an error downloading the website
  res.raise_for_status()
  # Creates a BeautifulSoup object for HTML parsing
  return BeautifulSoup(res.text, 'lxml')

soup = soup_creator(item_link)      
purchases = soup.find('div', attrs={'class' : 'BHbidSecBorderGrey'})
purchases = purchases.findAll('tr', attrs={'bgcolor' : '#ffffff'})
for purchase in purchases:
    date = purchase.findAll("td", {"align": "left"})
    date = date[2].get_text()
    print(purchase)

当我运行它时,打印语句不返回任何东西,我认为这意味着它没有找到任何东西。我希望它打印出这样的内容:

Jul-02-19 18:22:28 PDT
Jun-27-19 16:12:59 PDT
Jun-23-19 06:46:23 PDT
...

【问题讨论】:

  • 你不应该打印出date而不是purchase吗?

标签: python html xml web-scraping beautifulsoup


【解决方案1】:

熊猫:

使用 pandas 非常简单,只需索引右表并切出列

import pandas as pd

table = pd.read_html('https://offer.ebay.com/ws/eBayISAPI.dll?ViewBidsLogin&item=173653442617&rt=nc&_trksid=p2047675.l2564')[4]
table['Date of Purchase']

bs4 方法 1:

如您所知,您可以在感兴趣的表上使用 nth-of-type 的列号

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://offer.ebay.com/ws/eBayISAPI.dll?ViewBidsLogin&item=173653442617&rt=nc&_trksid=p2047675.l2564')
soup = bs(r.content, 'lxml')
#if column # is known 
purchases = [item.text for item in soup.select('table[width] td:nth-of-type(5)')]

bs4 方法 2(不太理想且列号未知)

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://offer.ebay.com/ws/eBayISAPI.dll?ViewBidsLogin&item=173653442617&rt=nc&_trksid=p2047675.l2564')
soup = bs(r.content, 'lxml')
#if column # not known
headers = [item.text.strip() for item in soup.select('table[width] th')]
desired_header = 'Date of Purchase'

if desired_header in headers: 
    print([item.text for item in soup.select('table[width] td:nth-of-type(' + str(headers.index(desired_header) + 1) + ')')])

【讨论】:

【解决方案2】:

我使用list unpacking 和切片从表格行中挑选出正确的单元格,然后提取它们的文本。用[2:5] 分割列列表就可以了。

import requests
from bs4 import BeautifulSoup
import re

def make_soup(url: str) -> BeautifulSoup:
    res = requests.get(url)
    res.raise_for_status()
    return BeautifulSoup(res.text, 'html.parser')

def extract_purchases(soup: BeautifulSoup) -> list:
    table = soup.find('th', text=re.compile('Date of Purchase')).find_parent('table')
    purchases = []
    for row in table.find_all('tr')[1:]:
        price_cell, qty_cell, date_cell = row.find_all('td')[2:5]
        p = {
            'price': price_cell.text.strip(),
            'quantity': qty_cell.text.strip(),
            'date': date_cell.text.strip()
        }
        purchases.append(p)
    return purchases

if __name__ == '__main__':
    url = 'https://offer.ebay.com/ws/eBayISAPI.dll?ViewBidsLogin&item=173653442617&rt=nc&_trksid=p2047675.l2564'
    soup = make_soup(url)
    purchases = extract_purchases(soup)

    from pprint import pprint
    pprint(purchases)

输出:

[{'date': 'Jul-02-19 18:22:28 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Jun-27-19 16:12:59 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Jun-23-19 06:46:23 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Jun-20-19 09:14:07 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'May-23-19 09:48:59 PDT', 'price': 'US $63.04', 'quantity': '1'},
 {'date': 'May-20-19 06:05:24 PDT', 'price': 'US $63.04', 'quantity': '1'},
 {'date': 'May-17-19 13:10:38 PDT', 'price': 'US $63.04', 'quantity': '1'},
 {'date': 'May-04-19 17:11:32 PDT', 'price': 'US $55.36', 'quantity': '1'},
 {'date': 'Apr-24-19 15:27:42 PDT', 'price': 'US $55.36', 'quantity': '1'},
 {'date': 'Apr-07-19 17:03:05 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Apr-06-19 21:20:17 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Apr-06-19 13:29:45 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Apr-05-19 14:42:23 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Apr-03-19 21:37:14 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Apr-02-19 18:23:45 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Mar-31-19 06:01:36 PDT', 'price': 'US $54.08', 'quantity': '1'},
 {'date': 'Mar-25-19 14:37:27 PDT', 'price': 'US $56.64', 'quantity': '1'},
 {'date': 'Feb-12-19 10:57:22 PST', 'price': 'US $53.94', 'quantity': '1'}]

【讨论】:

  • 这太疯狂了。我已经尝试了几个小时了,你怎么做的这么快。非常感谢。
猜你喜欢
  • 1970-01-01
  • 2021-08-23
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2012-10-15
  • 2018-09-06
  • 2019-10-22
相关资源
最近更新 更多