【问题标题】:How can I select from only one table with Web::Scraper?如何使用 Web::Scraper 从一张表中进行选择?
【发布时间】:2015-10-16 06:50:13
【问题描述】:

我只想从网页中提取标题节点对象方法的文本。具体的HMTL部分如下:

<h2>Node Object Properties</h2>
<p>The &quot;DOM&quot; column indicates in which DOM Level the property was introduced.</p>

<table class="reference">
<tr>
<th width="23%" align="left">Property</th>
<th width="71%" align="left">Description</th>
<th style="text-align:center;">DOM</th>
</tr>
<tr>
    <td><a href="prop_node_attributes.asp">attributes</a></td>
    <td>Returns a collection of a node's attributes</td>
    <td style="text-align:center;">1</td>
</tr>

<tr>
    <td><a href="prop_node_baseuri.asp">baseURI</a></td>
    <td>Returns the absolute base URI of a node</td>
    <td style="text-align:center;">3</td>
</tr>
<tr>
    <td><a href="prop_node_childnodes.asp">childNodes</a></td>
    <td>Returns a NodeList of child nodes for a node</td>
    <td style="text-align:center;">1</td>
</tr>
<tr>
    <td><a href="prop_node_firstchild.asp">firstChild</a></td>
    <td>Returns the first child of a node</td>
    <td style="text-align:center;">1</td>
</tr>
<tr>
    <td><a href="prop_node_lastchild.asp">lastChild</a></td>
    <td>Returns the last child of a node</td>
    <td style="text-align:center;">1</td>
</tr>
<tr>
    <td><a href="prop_node_localname.asp">localName</a></td>
    <td>Returns the local part of the name of a node</td>
    <td style="text-align:center;">2</td>
</tr>
<tr>
    <td><a href="prop_node_namespaceuri.asp">namespaceURI</a></td>
    <td>Returns the namespace URI of a node</td>
    <td style="text-align:center;">2</td>
</tr>
<tr>
    <td><a href="prop_node_nextsibling.asp">nextSibling</a></td>
    <td>Returns the next node at the same node tree level</td>
    <td style="text-align:center;">1</td>
</tr>
<tr>
    <td><a href="prop_node_nodename.asp">nodeName</a></td>
    <td>Returns the name of a node, depending on its type</td>
    <td style="text-align:center;">1</td>
</tr>
<tr>
    <td><a href="prop_node_nodetype.asp">nodeType</a></td>
    <td>Returns the type of a node</td>
    <td style="text-align:center;">1</td>
</tr>
<tr>
    <td><a href="prop_node_nodevalue.asp">nodeValue</a></td>
    <td>Sets or returns the value of a node, depending on its 
    type</td>
    <td style="text-align:center;">1</td>
</tr>
<tr>
    <td><a href="prop_node_ownerdocument.asp">ownerDocument</a></td>
    <td>Returns the root element (document object) for a node</td>
    <td style="text-align:center;">2</td>
</tr>
<tr>
    <td><a href="prop_node_parentnode.asp">parentNode</a></td>
    <td>Returns the parent node of a node</td>
    <td style="text-align:center;">1</td>
</tr>
<tr>
    <td><a href="prop_node_prefix.asp">prefix</a></td>
    <td>Sets or returns the namespace prefix of a node</td>
    <td style="text-align:center;">2</td>
</tr>
<tr>
    <td><a href="prop_node_previoussibling.asp">previousSibling</a></td>
    <td>Returns the previous node at the same node tree level</td>
    <td style="text-align:center;">1</td>
</tr>
<tr>
    <td><a href="prop_node_textcontent.asp">textContent</a></td>
    <td>Sets or returns the textual content of a node and its 
    descendants</td>
    <td style="text-align:center;">3</td>
</tr>
</table>

<h2>Node Object Methods</h2>
<p>The &quot;DOM&quot; column indicates in which DOM Level the method was introduced.</p>
<table class="reference">
<tr>
<th width="33%" align="left">Method</th>
<th width="61%" align="left">Description</th>
<th style="text-align:center;">DOM</th>
</tr>
<tr>
    <td><a href="met_node_appendchild.asp">appendChild()</a></td>
    <td>Adds a new child node, to the specified node, as the last child node</td>
    <td style="text-align:center;">1 </td>
</tr>
<tr>
    <td><a href="met_node_clonenode.asp">cloneNode()</a></td>
    <td>Clones a node</td>
    <td style="text-align:center;">1 </td>
</tr>
<tr>
    <td><a href="met_node_comparedocumentposition.asp">compareDocumentPosition()</a></td>
    <td>Compares the document position of two nodes</td>
    <td style="text-align:center;">1 </td>
</tr>
<tr>
    <td>getFeature(<span class="parameter">feature</span>,<span class="parameter">version</span>)</td>
    <td>Returns a DOM object which implements the specialized APIs 
    of the specified feature and version</td>
    <td style="text-align:center;">3 </td>
</tr>
<tr>
    <td>getUserData(<span class="parameter">key</span>)</td>
    <td>Returns the object associated to a key on a this node. The 
    object must first have been set to this node by calling setUserData with the 
    same key</td>
    <td style="text-align:center;">3 </td>
</tr>
<tr>
    <td><a href="met_node_hasattributes.asp">hasAttributes()</a></td>
    <td>Returns true if a node has any attributes, otherwise it 
    returns false</td>
    <td style="text-align:center;">2 </td>
</tr>
<tr>
    <td><a href="met_node_haschildnodes.asp">hasChildNodes()</a></td>
    <td>Returns true if a node has any child nodes, otherwise it 
    returns false</td>
    <td style="text-align:center;">1 </td>
</tr>
<tr>
    <td><a href="met_node_insertbefore.asp">insertBefore()</a></td>
    <td>Inserts a new child node before a specified, existing, child node</td>
    <td style="text-align:center;">1 </td>
</tr>
</table>

如果我在 Perl 中编写以下内容:

 my $data = scraper {
 process "table.reference > tr > td > a", 'renners[]' => 'TEXT';
}

for my $i (0 .. $#{$res2->{renners}}) {
  print $res2->{renners}[$i];
print "\n";
}

我得到所有标签的文本,即:

attributes
baseURI
.
.
.
.
insertBefore()

而我需要标签 &lt;a&gt; 的文本仅用于节点对象方法,即:

appendChild()
.
.
.
insertBefore()

简而言之,我只想打印 NODE 对象方法。我应该在代码中修改什么?

【问题讨论】:

    标签: css html perl css-selectors


    【解决方案1】:

    Web::Scraper 可以使用nth_of_type 来选择合适的表。有两个同一个类的表,可以说table.reference:nth-of-type(2)

    use v5.22;
    
    use feature qw(postderef);
    no warnings qw(experimental::postderef);
    
    
    use Web::Scraper;
    
    my $html = do { local $/; <DATA> };
    
    my $methods = scraper {
        process "table.reference:nth-of-type(2) > tr > td > a", 'renners[]' => 'TEXT';
        };
    my $res = $methods->scrape( $html );
    
    say join "\n", $res->{renners}->@*;
    

    这是Mojo::DOM

    use Mojo::DOM;
    
    my $html = do { local $/; <DATA> };
    
    my $dom = Mojo::DOM->new( $html );
    
    say $dom
        ->find( 'table.reference:nth-of-type(2) > tr > td > a' )
        ->map( 'text' )
        ->join( "\n" );
    

    我尝试寻找可以识别h2中的文本的选择器解决方案,但是我的功夫在这里很弱。

    【讨论】:

      【解决方案2】:

      Web::Query 提供了与 brian d foy 提出的 Mojo::DOM 解决方案几乎相同的解决方案。

      use Web::Query;
      
      my $html = do { local $/; <DATA> };
      
      wq($html)
          ->find('table.reference:nth-of-type(2) > tr > td > a')
          ->each(sub {
              my ($i, $e) = @_;
              say $e->text();
          });
      

      但是看起来 Mojo::DOM 是更健壮的库。为了使 Web::Query 正确匹配其选择器,我必须编辑问题中提供的输入以添加围绕所有其他内容的根节点。

      __DATA__
      <html>
      ...
      </html>
      

      【讨论】:

        【解决方案3】:

        您可以使用 XPath 从标题 Node Object Methods 之后的下一个表中提取数据,就像这样

        use Web::Scraper;
        
        my $html = do { local $/; <DATA> };
        
        my $methods = scraper {
            process '//h2[.="Node Object Methods"]/following-sibling::table[1]//tr/td[1]', 
                'renners[]' => 'TEXT';
        };  
        my $res = $methods->scrape( $html );
        
        say join "\n", @{ $res->{renners} };
        

        输出将是

        appendChild()
        cloneNode()
        compareDocumentPosition()
        getFeature(feature,version)
        getUserData(key)
        hasAttributes()
        hasChildNodes()
        insertBefore()
        

        【讨论】:

          猜你喜欢
          • 2021-03-03
          • 1970-01-01
          • 2011-05-02
          • 2018-05-04
          • 2016-09-23
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-10-30
          相关资源
          最近更新 更多