【问题标题】:Obtaining column from wikipedia table using beautifulsoup使用 beautifulsoup 从维基百科表中获取列
【发布时间】:2015-01-03 12:35:22
【问题描述】:
source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")

我正在尝试从Taylor Swift's discography的“单曲列表”表中获取歌曲名称列表

该表没有唯一的类或 ID。我能想到的唯一独特之处是“单曲列表...”周围的标题标签

作为主要艺人的单曲列表,包括选定的排行榜位置、销售数据和证书

我试过了:

table = soup.find_all("caption")

但它什么也没返回,我假设标题不是 bs4 中的可识别标签?

【问题讨论】:

    标签: python python-3.x beautifulsoup html-parsing


    【解决方案1】:

    其实跟findAll()find_all()没有关系。 findAll() 曾在 BeautifulSoup3 中使用,而留在 BeautifulSoup4出于兼容性原因,引用自 bs4 的源代码:

    def find_all(self, name=None, attrs={}, recursive=True, text=None,
                 limit=None, **kwargs):
        generator = self.descendants
        if not recursive:
            generator = self.children
        return self._find_all(name, attrs, text, limit, generator, **kwargs)
    
    findAll = find_all       # BS3
    

    还有一种更好的方法来获取单曲列表,它依赖于带有id="Singles"span 元素,它表示Singles 段落的开始。然后,使用find_next_sibling() 获取span 标记父级之后的第一个表。然后,用scope="row" 获取所有th 元素:

    from bs4 import BeautifulSoup
    import requests
    
    
    source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
    soup = BeautifulSoup(source_code.content)
    
    table = soup.find('span', id='Singles').parent.find_next_sibling('table')
    for single in table.find_all('th', scope='row'):
        print(single.text)
    

    打印:

    "Tim McGraw"
    "Teardrops on My Guitar"
    "Our Song"
    "Picture to Burn"
    "Should've Said No"
    "Change"
    "Love Story"
    "White Horse"
    "You Belong with Me"
    "Fifteen"
    "Fearless"
    "Today Was a Fairytale"
    "Mine"
    "Back to December"
    "Mean"
    "The Story of Us"
    "Sparks Fly"
    "Ours"
    "Safe & Sound"
    (featuring The Civil Wars)
    "Long Live"
    (featuring Paula Fernandes)
    "Eyes Open"
    "We Are Never Ever Getting Back Together"
    "Ronan"
    "Begin Again"
    "I Knew You Were Trouble"
    "22"
    "Highway Don't Care"
    (with Tim McGraw)
    "Red"
    "Everything Has Changed"
    (featuring Ed Sheeran)
    "Sweeter Than Fiction"
    "The Last Time"
    (featuring Gary Lightbody)
    "Shake It Off"
    "Blank Space"
    

    【讨论】:

      【解决方案2】:

      这是一个解决“泰勒斯威夫特问题”的完整示例。首先查找包含文本“单曲列表”的标题并移至父对象“。接下来遍历具有您要查找的文本的项目:

      for caption in soup.findAll("caption"):
          if "List of singles" in caption.text:      
              break
      
      table = caption.parent
      for item in table.findAll("th", {"scope":"row"}):
          print item.text
      

      这给出了:

      "Tim McGraw"
      "Teardrops on My Guitar"
      "Our Song"
      "Picture to Burn"
      "Should've Said No"
      "Change"
      "Love Story"
      "White Horse"
      "You Belong with Me"
      "Fifteen"
      "Fearless"
      "Today Was a Fairytale"
      ...
      

      【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-04-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多