【问题标题】:Writing XPath for selecting the description编写 XPath 以选择描述
【发布时间】:2014-10-16 06:08:44
【问题描述】:

我想从 HTML 页面中提取描述。

我的div id 包含以下数据:

  <div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>
 <p>
    <strong>Responsibilities</strong>
  </p>
  <ul>
     <li> Ownership and oversight of full-cycle accounts payable responsibilities including but not limited to, invoice processing, maintaining vendor records, running payment reports according to payment schedules, reconciling vendor statements)</li>
     <li> Identify and implement process improvements and automation in appropriate areas throughout the AP cycle</li>
     <li> Provide excellent customer service to vendors and employees by researching and resolving inquiries in a timely manner</li>
     <li> Assist with month-end activities, accruals, reconciliation, preparing 1099s, and audit support</li>
   <li> Assist with ad-hoc requests</li>
  </ul>
 <p>
    <strong>Qualifications</strong>
 </p>
  <ul>
     <li> AA/AS degree or equivalent experience in accounting</li>
     <li> Three years or more of related experience</li>
     <li> Full cycle accounts payable knowledge</li>
  </ul>
  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>

这里我只需要数据&lt;p&gt; 标签。我不想要包含来自

的职责和资格的数据
<p>Responsibliites</p><ul> ... </ul>
<p>Qualifications</p><ul> .. </ul>

这不是必需的,请将其从 XPATH 中排除。

我正在使用以下代码:

sel.xpath(
        'description',
        '//div[@class="container page_op-detail"][not(descendant-or-self::p/strong[contains(text(), "Qualifications")]/../ul[1])]'
    ).extract()

这不起作用。请帮我创建 XPath 哪些项目可以排除它。如何为这种类型的查询编写 XPATH?

预期输出:

<div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>

  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>

【问题讨论】:

  • 你的预期输出是什么?
  • 预期的输出应该是:
  • 排除

    职责

      ...

    资格

      ...
    所以我想删除

      列表
  • 为了便于阅读,最好在问题中而不是在 cmets 中包含您的预期输出。
  • 缺少formspan 以及最后一个p 结束标记。您的输入格式不正确。

标签: xpath web-scraping scrapy web-crawler screen-scraping


【解决方案1】:

假设formspan标签是空元素,你可以试试这个xpath:

/div[@class='container page_op-detail']/*[not(self::p[normalize-space(.)='Responsibilities']) 
                                        and not(self::ul[preceding-sibling::p[normalize-space(.)='Responsibilities']])
                                        and not(self::ul[preceding-sibling::p[normalize-space(.)='Qualifications']])
                                        and not(self::p[normalize-space(.)='Qualifications'])]

【讨论】:

  • 感谢您的回答...但仍然无法正常工作..它显示完整的空列表..任何替代解决方案?
  • 我假设formspan 标签是空元素。请修正您的输入。他们是ps 和uls 的祖先吗?
  • 我就是这样用的。 "//div[@class='container page_op-detail']/*[not(self::p[normalize-space(.)='Responsibilities']) and not(self::ul) and not(self: :p[normalize-space(.)='Qualifications']) 而不是(self::ul)]"
  • 我已经使用 xpathtester.com 进行了测试。请看这个(xpathtester.com/xpath/8d692a09ac707d9c6891af086c472bfe
【解决方案2】:

首先,您的 html 代码缺少几个结束标记,包括 &lt;/form&gt;, &lt;/p&gt;, &lt;/span&gt; 等。我假设以下 html 代码是正确的版本:

<div class="container page_op-detail">
<form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded"         action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21"></form>
<span id="ajax-view-state-page-container" style="display: none"></span>
<p> Solving the world’s hardest problems ... </p>
<p>
<strong>Responsibilities</strong>
</p>
<ul>
 <li> Ownership and oversight of full-cycle .....</li>
 <li> Identify and implement process improvements ...</li>
 <li> Provide excellent customer service to vendors ... </li>
 <li> Assist with month-end activities, accruals, ...</li>
<li> Assist with ad-hoc requests</li>
</ul>
<p>
<strong>Qualifications</strong>
</p>
<ul>
 <li> AA/AS degree or equivalent experience in accounting</li>
 <li> Three years or more of related experience</li>
 <li> Full cycle accounts payable knowledge</li>
</ul>
<p class="type-centered">
   Data is more organised...!!!
</p>
<p class="type-centered apply-button"></p>
</div>

第一个&lt;p&gt;标签可以通过以下方式提取:

//div[@class="container page_op-detail"]/p[1]/text()

您需要的下一个&lt;p&gt;标签可以通过以下方式提取:

//div[@class="container page_op-detail"]/p[@class="type-centered"]/text()

然后您可以使用 itemloader 将两个提取附加到同一个 item 'description' 中,如 scrapy example here 所示或如下所示:

rom scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')       
    l.add_xpath('name', '//div[@class="product_title"]')  //note: item 'name' are used twice.
    return l.load_item()

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-07-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多