【发布时间】:2015-02-19 08:31:30
【问题描述】:
我希望从下面的 html 示例中抓取突出显示和加边框的三个项目。我还强调了一些看起来很有用的标记。
你会怎么做?
解决方案
好吧,所以这不是一个好问题,我真的很惊讶它没有得到更多的反对!哦,好吧,这里有一些面包屑给别人。
我想要的四项信息中的三项是具有已知 id 的 span 元素的内部文本(即,“yfs_l10_gm150220c00036500”为 0.83 美元),所以我下面的帮助程序类似乎是一个体面和直接的镜头:
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetSpanTextForId
'
' Returns the inner text from a span element known by the passed id
'
' param doc: the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetSpanTextForId(ByRef doc As HTMLDocument, ByVal spanId As String) As Double
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetSpanTextForId"
CheckArgNotNothing doc, "doc"
CheckArgNotBadString spanId, "spanId"
' Procedure
Dim oSpan As HTMLSpanElement
Set oSpan = doc.getElementById(spanId)
Check Not oSpan Is Nothing, "Could not find span with id: " & Bracket(spanId)
GetSpanTextForId = oSpan.innerText
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
跨度不直接知道的唯一项目是 OpenInterest,它是表的一部分,是具有 id 的元素的第二个子元素。以下方法返回紧跟在单元格后面的单元格,其中包含我想要的文本(即“未平仓合约”)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetOpenInterest
'
' The latest available Open Interest.
'
' param doc: the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetOpenInterest(ByRef doc As HTMLDocument) As Integer
Dim tbl As IHTMLTable
Set tbl = GetSummaryDataTable(doc, 1)
Dim k As Integer
k = mWebScrapeHelpers.GetCellNumberForTextStartingWith(tbl, "Open Interest:")
GetOpenInterest = CInt(mWebScrapeHelpers.GetCellTextFromCellNumber(tbl, k + 1))
End Function
Function GetCellNumberForTextStartingWith(ByRef tbl As IHTMLTable, ByRef s As String) As Integer
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetCellNumberForTextStartingWith"
CheckArgNotNothing tbl, "tbl"
' Procedure
Dim tblCell As HTMLTableCell
Dim k As Integer
For Each tblCell In tbl.Cells
If tblCell.innerText Like ("*" & s) Then
GetCellNumberForTextStartingWith = k
Exit Function
End If
k = k + 1
Next
' if we got here it was not found so
GetCellNumberForTextStartingWith = -1
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
Function GetCellTextFromCellNumber(ByRef tbl As IHTMLTable, ByRef nbr As Integer) As String
' Error Handling
On Error GoTo ErrHandler
Dim sRoutine As String
sRoutine = cModule & ".GetCellNumberForTextStartingWith"
CheckArgNotNothing tbl, "tbl"
Check tbl.Cells.Length > 0, "table is empty"
Check tbl.Cells.Length >= nbr, "table only has " & tbl.Cells.Length & " cells; can't get cell number " & nbr
' Procedure
GetCellTextFromCellNumber = tbl.Cells(nbr).innerText
Exit Function
ErrHandler:
Select Case DspErrMsg(sRoutine)
Case Is = vbAbort: Stop: Resume 'Debug mode - Trace
Case Is = vbRetry: Resume 'Try again
Case Is = vbIgnore: 'End routine
End Select
End Function
这些方法工作正常,但似乎有很多不同的方法可以工作,包括建议作为答案的正则表达式解析方法。 RedShift 的 excellent 链接更有助于分析 html 并提出策略。
干杯
【问题讨论】:
-
我先搜索“excel vba web scraping”。
-
你看过this这样的东西吗?
-
网上有很多关于抓取雅虎财经的例子,这里可能有一些。建议你用yahoo.com找几个合适的。
标签: html excel web-scraping vba