【问题标题】:Unable to get all the data including links from a tr tag无法获取所有数据,包括来自 tr 标签的链接
【发布时间】:2018-01-01 17:30:03
【问题描述】:

我在 python 中编写了一个脚本来从表格中的一些 html 元素中获取数据。我粗略地挑选了一些位于tr 标签内的数据。我的目标是在fn 类中获取数据(包括href 链接)。到目前为止我所尝试的可以解析所有这些(来自fn 类,不包括链接)。如何更改我的以下脚本以从该类中获取链接。提前感谢您的任何解决方案。

这是我迄今为止尝试过的:

from bs4 import BeautifulSoup

content="""
<tr>
    <td align="center">1964</td>
    <td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
    <span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
    <span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
    <td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
    <td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
    <span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
    <td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
    <td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
    <td align="center">—</td>
</tr>
"""
soup = BeautifulSoup(content,"lxml")
for items in soup.select('tr'):
    item_name = [item.text for item in items.select(".fn a")]
    print(item_name)

我现在的输出:

['Charles Hard Townes', 'Nikolay Basov', 'Alexander Prokhorov', 'Dorothy Hodgkin', 'Konrad Emil Bloch', 'Feodor Felix Konrad Lynen', 'Jean-Paul Sartre', 'Martin Luther King, Jr.']

再次提醒您:我的预期输出是从 fn 类中获取所有数据,包括 href 链接。

【问题讨论】:

  • 如果您包含预期的输出会有所帮助。将href 包含为没有标签的文本并没有太大意义。也许你想要 html 代替?举个例子就清楚了。
  • 感谢大家提供如此有用的解决方案。为所有人+1。再次感谢。

标签: python python-3.x web-scraping beautifulsoup


【解决方案1】:

您可以使用bs4 或正则表达式:

bs4:

from bs4 import BeautifulSoup as soup
s = soup(content, 'lxml')
new_data = list(zip([i.text for i in s.find_all('a')], [i['href'] for i in s.find_all('a', href=True)]))

输出:

[(u'Charles Hard Townes', '/wiki/Charles_Hard_Townes'), (u'Nikolay Basov', '/wiki/Nikolay_Basov'), (u'Alexander Prokhorov', '/wiki/Alexander_Prokhorov'), (u'Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'), (u'Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'), (u'Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'), (u'Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'), (u'[D]', '#endnote_Note1D'), (u'Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.')]

正则表达式:

import re
new_data = map(lambda x:filter(None, x)[0], re.findall('href="(.*?)"|title="(.*?)">', content))
final_data = [(new_data[i], new_data[i+1]) for i in range(0, len(new_data)-1, 2)]

输出:

[('/wiki/Charles_Hard_Townes', 'Charles Hard Townes'), ('/wiki/Nikolay_Basov', 'Nikolay Basov'), ('/wiki/Alexander_Prokhorov', 'Alexander Prokhorov'), ('/wiki/Dorothy_Hodgkin', 'Dorothy Hodgkin'), ('/wiki/Konrad_Emil_Bloch', 'Konrad Emil Bloch'), ('/wiki/Feodor_Felix_Konrad_Lynen', 'Feodor Felix Konrad Lynen'), ('/wiki/Jean-Paul_Sartre', 'Jean-Paul Sartre'), ('#endnote_Note1D', '/wiki/Martin_Luther_King,_Jr.')]

【讨论】:

    【解决方案2】:

    这个修改后的代码让我得到了href和数据

    from bs4 import BeautifulSoup
    
    content="""
    <tr>
        <td align="center">1964</td>
        <td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
        <span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
        <span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
        <td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
        <td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
        <span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
        <td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
        <td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
        <td align="center">—</td>
    </tr>
    """
    soup = BeautifulSoup(content,"lxml")
    for items in soup.select('tr'):
        item_name = [[item.text,item.get('href')] for item in items.select(".fn a")]
        print(item_name)
    

    输出

    [['Charles Hard Townes', '/wiki/Charles_Hard_Townes'], ['Nikolay Basov', '/wiki/Nikolay_Basov'], ['Alexander Prokhorov', '/wiki/Alexander_Prokhorov'], ['Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'], ['Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'], ['Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'], ['Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'], ['Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.']]
    

    【讨论】:

      【解决方案3】:

      稍微简单一点:无需单独选择表格行。

      soup = BeautifulSoup(content,"lxml")
      links = soup.select('tr .fn a')
      for link in links:
          print (link.attrs['href'])
          print (link.text)
      

      【讨论】:

        【解决方案4】:

        你可以试试 bs4 而不是使用正则表达式:

        from bs4 import BeautifulSoup
        
        content="""
        <tr>
            <td align="center">1964</td>
            <td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
            <span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
            <span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
            <td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
            <td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
            <span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
            <td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
            <td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
            <td align="center">—</td>
        </tr>
        """
        
        soup = BeautifulSoup(content,"lxml")
        for i in soup.find_all('td'):
            if i.find('a')!=None:
                print((i.find('a').attrs['title'],i.find('a').attrs['href']))
        

        输出:

        ('Charles Hard Townes', '/wiki/Charles_Hard_Townes')
        ('Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin')
        ('Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch')
        ('Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre')
        ('Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.')
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2021-05-07
          • 2014-05-12
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2020-08-22
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多