【问题标题】:Scrapy. Extract html from div without wrapping parent tag刮痧。从div中提取html而不包装父标签
【发布时间】:2013-03-15 08:07:39
【问题描述】:

我使用 scrapy 抓取网站。

我想提取某个div的内容。

<div class="short-description">
{some mess with text, <br>, other html tags, etc}
</div>

loader.add_xpath('short_description', "//div[@class='short-description']/div")

通过该代码,我得到了我需要的东西,但结果包括包装 html (&lt;div class="short-description"&gt;...&lt;/div&gt;)

如何去掉那个父 html 标签?

注意。像 text()、node() 这样的选择器无法帮助我,因为我的 div 包含 &lt;br&gt;, &lt;p&gt;, other divs, etc.、空格,我需要保留它们。

【问题讨论】:

    标签: python html xpath css-selectors scrapy


    【解决方案1】:

    尝试将node()Join() 结合使用:

    loader.get_xpath('//div[@class="short-description"]/node()', Join())
    

    结果看起来像:

    >>> from scrapy.contrib.loader import XPathItemLoader
    >>> from scrapy.contrib.loader.processor import Join
    >>> from scrapy.http import HtmlResponse
    >>>
    >>> body = """
    ...     <html>
    ...         <div class="short-description">
    ...             {some mess with text, <br>, other html tags, etc}
    ...             <div>
    ...                 <p>{some mess with text, <br>, other html tags, etc}</p>
    ...             </div>
    ...             <p>{some mess with text, <br>, other html tags, etc}</p>
    ...         </div>
    ...     </html>
    ... """
    >>> response = HtmlResponse(url='http://example.com/', body=body)
    >>>
    >>> loader = XPathItemLoader(response=response)
    >>>
    >>> print loader.get_xpath('//div[@class="short-description"]/node()', Join())
    
                {some mess with text,  <br> , other html tags, etc}
                 <div>
                    <p>{some mess with text, <br>, other html tags, etc}</p>
                </div>
                 <p>{some mess with text, <br>, other html tags, etc}</p>
    >>>
    >>> loader.get_xpath('//div[@class="short-description"]/node()', Join())
    u'\n            {some mess with text,  <br> , other html tags, etc}\n
       <div>\n         <p>{some mess with text, <br>, other html tags, etc}</p>\n
       </div> \n     <p>{some mess with text, <br>, other html tags, etc}</p> \n'
    

    【讨论】:

      【解决方案2】:
      hxs = HtmlXPathSelector(response)
      for text in hxs.select("//div[@class='short-description']/text()").extract(): 
          print text
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2020-05-13
        • 1970-01-01
        • 1970-01-01
        • 2020-03-14
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多