【问题标题】:Right syntax in HTML scrapingHTML 抓取中的正确语法
【发布时间】:2018-04-20 08:24:01
【问题描述】:

我有一个动态变化的代码

<tbody>
' ------------------- Block 1 ----------------------
   <tr class="table-row">
      <td class="cell">
         <div>18/4/2018</div>
      </td>
      <td class="cell">
         <div>
            <form id="idc" method="post" action=""> ' id is dinamic so cant use it
               <div style=""><input type="hidden" name="idc_hf_0" id="idc_hf_0" /></div> ' id and name is dinamic so cant use them
               Download all invoice documents as ZIP-file
               <span>
               <a class="icon zipdownload" title="Download all invoice documents as ZIP-file" href=""></a>
               </span>
               <span class="has-explanation">
               <a class="helper" href="javascript:;" title="The zip-file contains only PDF files of Tax/Fee statements and the Fleet Invoice with all annexes if available.">
               <span class="icon question" id="table-header-explanation"></span>
               </a>
               </span>
            </form>
         </div>
      </td>
      <td class="cell">
         <div>
            <a class="" title="View &gt;&gt;" href="">View &gt;&gt;</a>
         </div>
      </td>
   </tr>
 ' ################### Block1 END #######################
 
 ' ------------------- Block 2 ----------------------
   <tr class="table-row">
      <td class="cell">
         <div>13/4/2018</div> ' need this
      </td>
      <td class="cell">
         <div>
            <form id="idd" method="post" action="">
               <div style=""><input type="hidden" name="idd_hf_0" id="idd_hf_0" /></div>
               <div>
                  <span>Collective Payment Order</span> (<span>2018-500421707</span>)
                  <span>
                  <span class="invisible"> | </span><span>
                  <a class="Download" title="Download" href="">English</a>
                  </span>
                  </span>
               </div>
               <div>
                  <span>Tax/Fee CSV list</span> <span>
                  <a class="icon csv" title="Download" href=""></a>  ' need this  HREF1
                  </span>
               </div>
               <div>
                  <span>Detailed Trip CSV list</span> <span>
                  <a class="icon csv" title="Download" href=""></a> ' need this HREF2
                  </span>
               </div>
               Download all invoice documents as ZIP-file
               <span>
               <a class="icon zipdownload" title="Download all invoice documents as ZIP-file" href=""></a>
               </span>
               <span class="has-explanation">
               <a class="helper" href="javascript:;" title="The zip-file contains only PDF files of Tax/Fee statements and the Fleet Invoice with all annexes if available.">
               <span class="icon question" id="table-header-explanation"></span>
               </a>
               </span>
            </form>
         </div>
      </td>
      <td class="cell">
         <div>
            <a class="" title="View &gt;&gt;" href="">View &gt;&gt;</a>
         </div>
      </td>
   </tr>
  ' ################### Block2 END #######################
  
<tbody>

所以有两个块是动态的。也可以是这样的结构

Block1
Block1
Block2
Block1
Block2
Block2
Block2
Block1

我需要从这些块中获取:

  1. Block2 计数
  2. 每个区块的日期2
  3. HREF1 来自 class="icon csv"
  4. HREF2 来自 class="icon csv"

区分block 1和block 1没有 class="icon csv"&lt;span&gt;Tax/Fee CSV list&lt;/span&gt; &lt;span&gt;

我很困惑如何使用getelement属性,试图获取

Set IeDoc = IeApp.Document
    With IeDoc
        Set IeTbody = .getElementsByTagName("tbody").getElementsByClassName("table-row")
        d = IeTbody.legth
        For Each stEl In IeTbody
            
        Next stEl

    End With

但得到错误“对象不支持此属性或方法”,也许使用更好的 querySelector? 如何获得链接?

逻辑上它一定是这样的

Set IeDoc = IeApp.Document
    With IeDoc
        Set Blocks = .getElementsByTagName("tbody")

    For Each block In Blocks
        Set hasClass = .getElementsByClassName("table-row").getElementsByClassName("cell")(1).getElementsByClassName("icon csv")
        if not hasClass is nothing then
            b.Date = Blocks(block).getElementsByClassName("table-row").getElementsByClassName("cell")(0).getElementsByTagName("div")(0).innerText()
            b.Href1 = Blocks(block).getElementsByClassName("table-row").getElementsByClassName("cell")(1).getElementsByClassName("icon csv")(0)
            b.Href2 = Blocks(block).getElementsByClassName("table-row").getElementsByClassName("cell")(1).getElementsByClassName("icon csv")(1)
        end if
    Next block

End With

【问题讨论】:

  • 您有可以共享的 URL 还是内部 URL 或需要登录?
  • 需要登录
  • 如何区分块 1 和块 2?
  • Block 1 没有class="icon csv"&lt;span&gt;Tax/Fee CSV list&lt;/span&gt; &lt;span&gt;
  • 另外,是否有可靠的方法来判断任何给定块何时开始和结束?我问这些问题的原因是确定是否可以编写解析函数来解析响应文本。

标签: vba excel


【解决方案1】:

所以这不是很健壮,而是使用正则表达式并解析您提供的 HTML。向后看将有助于使用正则表达式拆分日期,但我目前无法解决。我目前已通过@FlorentB

改编了一个正则表达式函数
Public Matches As Object
' Or add in Tools > References > VBScript Reg Exp for early binding
Public Sub testing()
    Dim str As String, countOfBlock2   As Long, arr() As String, i As Long
    str = Range("A1") 'I am reading in from sheet but this would be your response text
    arr = SplitRe(str, "\<div>[\d]+[\/-][\d]+[\/-][\d]+\<\/div>") 'look behind would help

    For i = LBound(arr) To UBound(arr)

        If InStr(1, arr(i), "class=""icon csv""") > 0 Then
           countOfBlock2 = countOfBlock2 + 1 ' "Block 2"
           Debug.Print Replace(Replace(Matches(i - 1), "<div>", ""), "</div>", "") 'dates from Block 2
           Debug.Print Split(Split(arr(i), """icon csv"" title=""Download"" href=")(1), "></a>")(0)
           Debug.Print Split(Split(arr(i), """icon csv"" title=""Download"" href=")(2), "></a>")(0)
        End If

   Next i

   Debug.Print "count of block2 = " & countOfBlock2

End Sub

    'https://stackoverflow.com/questions/28107005/splitting-string-in-vba-using-regex?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
Public Function SplitRe(Text As String, Pattern As String, Optional IgnoreCase As Boolean) As String()
    Static re As Object

    If re Is Nothing Then
        Set re = CreateObject("VBScript.RegExp")
        re.Global = True
        re.MultiLine = True
    End If

    re.IgnoreCase = IgnoreCase
    re.Pattern = Pattern
    SplitRe = Strings.Split(re.Replace(Text, ChrW(-1)), ChrW(-1))

     Set Matches = re.Execute(Text)

End Function

输出:

【讨论】:

  • 它仅适用于范围(a1)中复制的文本,但不适用于响应文本。我也不了解这个arr = SplitRe(str, "\&lt;div&gt;[\d]+[\/-][\d]+[\/-][\d]+\&lt;\/div&gt;") 的实际作用,因为正则表达式用于查找日期
  • 它根据日期模式分隔符分割文本,假设每个块的 div 标签之间有一个日期。因此,它查看 A1 中的文本并将文本按
    dd/mm/yyyy
    拆分以生成一个数组,该数组应表示每个位置的一个块,
  • 如果有帮助,请支持,我会很感激,但如果它还没有解决您的问题,请不要接受作为答案。我们可以尝试调试为什么它不能整体工作,但同时其他人可能会提出更好的解决方案。
  • 因为responsetext有更多&lt;div&gt;标签,我只需要删除&lt;tbody&gt;innertext
  • 所以对 tbody 标记进行初始拆分并标记该数组中的第一个元素,然后使用它
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-11-22
  • 1970-01-01
  • 2016-03-04
  • 1970-01-01
  • 2020-04-06
  • 1970-01-01
相关资源
最近更新 更多