从 HTML 标记内的文件中抓取文本答案

【问题标题】：Scraping text from file within HTML tags从 HTML 标记内的文件中抓取文本
【发布时间】：2012-04-03 05:25:54
【问题描述】：

我有一个要从中提取日期的文件，它是一个 HTML 源文件，所以它充满了我不需要的代码和短语。我需要提取包含在特定 HTML 标记中的每个日期实例：

abbr title="((这是我需要的文字))" data-utime="

实现这一目标的最简单方法是什么？

【问题讨论】：

这不是数据挖掘。重新命名并重新标记为web-scraping。数据挖掘本质上是一种大规模的统计分析；这最多是数据挖掘的预处理。哦，为什么这被标记为“excel”？

标签： excel web-scraping extract analysis text-extraction

【解决方案1】：

如果您使用 Excel VBA，请设置对 MSHTML 库的引用（工具 - 引用）（在引用菜单中名为 Microsoft HTML Object Library）

Sub ScrapeDateAbbr()

    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement
    Dim sFile As String, lFile As Long
    Dim sHtml As String

    'read in the file
    lFile = FreeFile
    sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html"
    Open sFile For Input As lFile
    sHtml = Input$(LOF(lFile), lFile)

    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = sHtml

    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem

End Sub

我假设该文件是本地文件，因为您调用了源文件。如果您需要先下载它，则需要另外引用 MSXML 和此代码

Sub ScrapeDateAbbrDownload()

    Dim xHttp As MSXML2.XMLHTTP
    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement

    Set xHttp = New MSXML2.XMLHTTP
    xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html"
    xHttp.send

    Do
        DoEvents
    Loop Until xHttp.readyState = 4

    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = xHttp.responseText

    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem

End Sub

【讨论】：

【解决方案2】：

如果您使用的是 Java，则可以使用 Jsoup。你的问题不清楚，请详细说明你到底想做什么

【讨论】：

所以基本上我有一个 15mb 的文件，它是从保存的网页中提取的。该页面包含 facebook 消息，每条消息都带有日期戳，我想对每个日期的消息进行分析，所以我需要在每次出现日期时提取。希望能解决这个问题。
如果编程语言无关紧要，那么我当然会推荐 JSoup
如果你更喜欢 python，也可以使用 BeautifulSoup。