【问题标题】:how to get at this data如何获取这些数据
【发布时间】:2015-02-19 08:31:30
【问题描述】:

我希望从下面的 html 示例中抓取突出显示和加边框的三个项目。我还强调了一些看起来很有用的标记。

你会怎么做?

解决方案

好吧,所以这不是一个好问题,我真的很惊讶它没有得到更多的反对!哦,好吧,这里有一些面包屑给别人。

我想要的四项信息中的三项是具有已知 id 的 span 元素的内部文本(即,“yfs_l10_gm150220c00036500”为 0.83 美元),所以我下面的帮助程序类似乎是一个体面和直接的镜头:

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetSpanTextForId
'
' Returns the inner text from a span element known by the passed id
'
' param doc:     the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetSpanTextForId(ByRef doc As HTMLDocument, ByVal spanId As String) As Double
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetSpanTextForId"
     
    CheckArgNotNothing doc, "doc"
    CheckArgNotBadString spanId, "spanId"
'   Procedure
    Dim oSpan As HTMLSpanElement
    Set oSpan = doc.getElementById(spanId)
    Check Not oSpan Is Nothing, "Could not find span with id: " & Bracket(spanId)
    GetSpanTextForId = oSpan.innerText
    
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select


End Function

跨度不直接知道的唯一项目是 OpenInterest,它是表的一部分,是具有 id 的元素的第二个子元素。以下方法返回紧跟在单元格后面的单元格,其中包含我想要的文本(即“未平仓合约”)

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetOpenInterest
'
' The latest available Open Interest.
'
' param doc:     the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetOpenInterest(ByRef doc As HTMLDocument) As Integer
    Dim tbl As IHTMLTable
    Set tbl = GetSummaryDataTable(doc, 1)
    Dim k As Integer
    k = mWebScrapeHelpers.GetCellNumberForTextStartingWith(tbl, "Open Interest:")
    GetOpenInterest = CInt(mWebScrapeHelpers.GetCellTextFromCellNumber(tbl, k + 1))
End Function


Function GetCellNumberForTextStartingWith(ByRef tbl As IHTMLTable, ByRef s As String) As Integer
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetCellNumberForTextStartingWith"
    
    CheckArgNotNothing tbl, "tbl"
    
'   Procedure
    Dim tblCell As HTMLTableCell
    Dim k As Integer

    For Each tblCell In tbl.Cells
        If tblCell.innerText Like ("*" & s) Then
            GetCellNumberForTextStartingWith = k
            Exit Function
        End If
        k = k + 1
    Next
    
    ' if we got here it was not found so
    GetCellNumberForTextStartingWith = -1
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select
     
End Function

Function GetCellTextFromCellNumber(ByRef tbl As IHTMLTable, ByRef nbr As Integer) As String
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetCellNumberForTextStartingWith"
    
    CheckArgNotNothing tbl, "tbl"
    Check tbl.Cells.Length > 0, "table is empty"
    Check tbl.Cells.Length >= nbr, "table only has " & tbl.Cells.Length & " cells; can't get cell number " & nbr
    
'   Procedure
    GetCellTextFromCellNumber = tbl.Cells(nbr).innerText
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select


End Function

这些方法工作正常,但似乎有很多不同的方法可以工作,包括建议作为答案的正则表达式解析方法。 RedShift 的 excellent 链接更有助于分析 html 并提出策略。

干杯

【问题讨论】:

  • 我先搜索“excel vba web scraping”。
  • 你看过this这样的东西吗?
  • 网上有很多关于抓取雅虎财经的例子,这里可能有一些。建议你用yahoo.com找几个合适的。

标签: html excel web-scraping vba


【解决方案1】:

我可能会先使用 XML 解析器来获取文本内容(或者这样: xmlString.replace(/]+>/g, "") 用空字符串替换所有标签),然后使用以下正则表达式提取您需要的信息:

/-OPR\s+(\d+\.\d+)/
/Bid:\s+(\d+\.\d+)/
/Ask:\s+(\d+\.\d+)/
/Open Interest:\s+(\d+,\d+)/

这个过程可以在 nodejs (more info) 或任何其他支持正则表达式的语言中轻松完成。


现场演示:

  • 等待 1 秒,然后删除标签。
  • 再等一秒钟,然后找到所有模式并创建一个表格。

wait = true; // Set to false to execute instantly.

var elem = document.getElementById("parsingStuff");
var str = elem.textContent;

var keywords = ["-OPR", "Bid:", "Ask:", "Open Interest:"];
var output = {};
var timeout = 0;

if (wait) timeout = 1000;

setTimeout(function() { // Removing tags.
  elem.innerHTML = elem.textContent;
}, timeout);

if (wait) timeout = 2000;

setTimeout(function() { // Looking for patterns.
  for (var i = 0; i < keywords.length; i++) {
    output[keywords[i]] = str.match(RegExp(keywords[i] + "\\s+(\\d+[\\.,]\\d+)"))[1];
  }

  // Creating basic table of found data.
  elem.innerHTML = "";
  var table = document.createElement("table");
  for (k in output) {
    var tr = document.createElement("tr");
    var th = document.createElement("th");
    var td = document.createElement("td");

    th.style.border = "1px solid gray";
    td.style.border = "1px solid gray";

    th.textContent = k;
    td.textContent = output[k];

    tr.appendChild(th);
    tr.appendChild(td);

    table.appendChild(tr);
  }
  elem.appendChild(table);
}, timeout);
<div id="parsingStuff">
  <div class="yfi_rt_quote_summary" id="yfi_rt_quote_summary">
    <div class="hd">
      <div class="title">
        <h2>GM Feb 2015 36.500 call (GM150220C00036500)</h2>
        <span class="rtq_exch">
        <span class="rtq_dash">-</span>OPR
        </span>
        <span class="wl_sign"></span>
      </div>
    </div>
    <div class="yfi_rt_quote_summary_rt_top sigfig_promo_1">
      <div>
        <span class="time_rtq_ticker">

        <span id="yfs_110_gm150220c00036500">0.83</span>
        </span>
      </div>
    </div>undefined</div>undefined
  <div class="yui-u first yfi-start-content">
    <div class="yfi_quote_summary">
      <div id="yfi_quote_summary_data" class="rtq_table">
        <table id="table1">
          <tr>
            <th scope="row" width="48%">Bid:</th>
            <td class="yfnc_tabledata1">

              <span id="yfs_b00_gm150220c00036500">0.76</span>
            </td>
          </tr>
          <tr>
            <th scope="row" width="48%">Ask:</th>
            <td class="yfnc_tabledata1">

              <span id="yfs_a00_gm150220c00036500">0.90</span>
            </td>
          </tr>
        </table>
        <table id="table2">
          <tr>
            <th scope="row" width="48%">Open Interest:</th>

            <td class="yfnc_tabledata1">11,579</td>
          </tr>
        </table>
      </div>
    </div>
  </div>
</div>

【讨论】:

    猜你喜欢
    • 2015-04-09
    • 1970-01-01
    • 1970-01-01
    • 2014-01-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-06-10
    • 2020-07-20
    相关资源
    最近更新 更多