【问题标题】:beautifulsoup selecting certain values onlybeautifulsoup 仅选择某些值
【发布时间】:2015-09-23 02:48:27
【问题描述】:

这里是部分网页源代码。

   <tr>
    <td>
      <a href="/docdollars/doctors/pid/36602">
        <h6>Jane</h6>
      </a>
         Allopathic & Osteopathic Physicians/Internal Medicine
    </td>
    <td>
      <p>NY Medical Ctr<br>New York City, 
      <a href="/docdollars/states/NY">NY</a>
      </p>
    </td>                
  </tr>
  <tr>
    <td>
      <a href="/docdollars/doctors/pid/1091514">
        <h6>Greg</h6>
      </a>
         Allopathic & Osteopathic Physicians/Family Medicine
    </td>
    <td>
      <p>57950 NYC<br>New York City, 
      <a href="/docdollars/states/NY">NY</a>
      </p>
    </td>
  </tr>

我希望抓取的数据如下所示:

Jane, Allopathic & Osteopathic Physicians/Internal Medicine, NY Medical Ctr, New York City, NY 
Greg, Allopathic & Osteopathic Physicians/Family Medicine, 57950 NYC, New York City, NY

我的代码(下面)部分工作(见下面的 cmets)。

for i in item.find_all('tr'):
    print i.find('a').find('h6').text  #working fine
    print i.find('td').next_sibling.next_sibling.find('p').text.strip()  # this needs revision
    print i.find('td').text.strip()  # this needs revision

提前感谢您的建议!

【问题讨论】:

    标签: python python-2.7 web-scraping beautifulsoup


    【解决方案1】:

    专注于查找&lt;h6&gt; 元素,使用CSS selector,然后从那里查找随附信息:

    for header in soup.select('tr td a h6'):
        name = header.get_text(strip=True)
        practice = header.parent.find_next_sibling(text=True).strip()
        address = header.find_parent('td').find_next_sibling('td').get_text(' ', strip=True)
        print name, practice, address
    

    所以这会找到包含在 &lt;tr&gt;&lt;td&gt;&lt;a&gt; 包装器中的所有 h6 元素。从那里,我们可以返回到父元素(&lt;a&gt; 链接)并找到下一段文本,还可以找到父元素 &lt;td&gt; 以找到包含剩余文本的下一个 &lt;td&gt;

    假设您在名为 soup 的变量中输入样本,则生成:

    >>> for header in soup.select('tr td a h6'):
    ...     name = header.get_text(strip=True)
    ...     practice = header.parent.find_next_sibling(text=True).strip()
    ...     address = header.find_parent('td').find_next_sibling('td').get_text(' ', strip=True)
    ...     print name, practice, address
    ... 
    Jane Allopathic & Osteopathic Physicians/Internal Medicine NY Medical Ctr New York City, NY
    Greg Allopathic & Osteopathic Physicians/Family Medicine 57950 NYC New York City, NY
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-12-29
      • 1970-01-01
      • 2019-04-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多