【问题标题】:Get The Text Out of HTML Tag via getContext - Google Apps Script - Spreadsheets通过 getContext - Google Apps 脚本 - 电子表格从 HTML 标记中获取文本
【发布时间】:2014-08-23 18:20:56
【问题描述】:

所以,我对这个 Google Apps 脚本有点进退两难。习惯于传统的 Javascript 是一个很大的挑战。我目前正在尝试从 Zillow 中提取值,并且我在前几个项目(租金值、Zestimate、学校评级)上取得了成功,但现在我需要获取学校名称。这变得非常麻烦,老实说,我似乎无法对我需要得到的东西做一个.match()。我会发布一些代码,看看是否有其他人可以掌握这一点。

我正在解析的 Zillow 代码:

<ul class="nearby-schools-list">
<li class="nearby-schools-header">
    <h4 class="nearby-schools-rating">&nbsp;</h4>
    <h4 class="nearby-schools-name">&nbsp;</h4>
    <h4 class="nearby-schools-grades">Grades</h4>
    <h4 class="nearby-schools-distance">Distance</h4>
</li>
<li class="nearby-school assigned-school">
    <span class="gs-rating-badge">
        <div class="gs-rating gs-rating-8">
            <span class="gs-rating-number">8</span>
            <span class="gs-rating-subtext">out of 10</span>
        </div>
    </span>
    <span class="nearby-schools-name"> <a href="/seattle-wa/schools/salmon-bay-school-93956/" class="ga-tracked-link track-ga-event school-name notranslate" data-ga-action="School details click" data-ga-label="HDP AB Module" data-ga-category="Homes" data-ga-standard-href="true">Salmon Bay School</a> 
        <span class="assigned-label de-emph">(assigned)</span>
    </span>
    <span class="nearby-schools-grades">K-8</span>
    <span class="nearby-schools-distance">0.3 mi</span>
</li>
<li class="nearby-school assigned-school">
    <span class="gs-rating-badge">
        <div class="gs-rating gs-rating-8">
            <span class="gs-rating-number">8</span>
            <span class="gs-rating-subtext">out of 10</span>
        </div>
    </span>
    <span class="nearby-schools-name"> <a href="/seattle-wa/schools/whitman-middle-school-93939/" class="ga-tracked-link track-ga-event school-name notranslate" data-ga-action="School details click" data-ga-label="HDP AB Module" data-ga-category="Homes" data-ga-standard-href="true">Whitman Middle</a> 
        <span class="assigned-label de-emph">(assigned)</span>
    </span>
    <span class="nearby-schools-grades">6-8</span>
    <span class="nearby-schools-distance">1.4 mi</span>
</li>
<li class="nearby-school assigned-school">
    <span class="gs-rating-badge">
        <div class="gs-rating gs-rating-9">
            <span class="gs-rating-number">9</span>
            <span class="gs-rating-subtext">out of 10</span>
        </div>
    </span>
    <span class="nearby-schools-name"> <a href="/seattle-wa/schools/ballard-high-school-92363/" class="ga-tracked-link track-ga-event school-name notranslate" data-ga-action="School details click" data-ga-label="HDP AB Module" data-ga-category="Homes" data-ga-standard-href="true">Ballard High</a> 
        <span class="assigned-label de-emph">(assigned)</span>
    </span>
    <span class="nearby-schools-grades">9-12</span>
    <span class="nearby-schools-distance">0.2 mi</span>
</li>

这是一个很大的块,但基本上我试图从school-name 中获取文本,这是ul &gt; li &gt; span.nearby-schools-name &gt; a.school-name 下列出的一个类。

这是我的尝试,我所做的任何事情都会被退回。

// get School Names
var match = contentText.match(/<a href="([^<]*)" class="ga-tracked-link track-ga-event school-name notranslate" /g);
Browser.msgBox(match);
var schoolNameArray = new Array();

while (match.length > 0) {
    var thisSchoolName = new String(schoolName.pop());
    Browser.msgBox(thisSchoolName);
    //schoolNameArray.push(thisSchoolName);
}

var schoolNames = schoolNameArray.toString().replace(/,/g, " _ ");

快速常见问题解答,我尝试了网络上复制getElementsByClassName 的功能,但我没有运气。我也试过抓住href

【问题讨论】:

    标签: javascript jquery html scripting google-apps-script


    【解决方案1】:

    这是一种方法。首先按类名获取所有元素:

    var elSchoolNames = document.getElementsByClassName("nearby-schools-name");
    

    返回的是一个对象。如果将变量elSchoolNames 显示到控制台,console.log('elSchoolNames: ' + elSchoolNames ); 将如下所示:

    [object HTMLCollection]
    

    对象[object HTMLCollection]内部是一堆更多的对象;一个对象数组。

    [object HTMLHeadingElement]
    [object HTMLSpanElement]
    [object HTMLSpanElement]
    [object HTMLSpanElement] 
    

    了解对象具有key:value 对很重要,但也有一个对象数组,没有键(属性)。要从主对象中取出子对象,请按编号引用它们,因为它们没有属性名称,因为它是该级别的数组。

    你需要所有的 Span 元素。

    var theSpanEl = elSchoolNames[1];
    var theSpanE2 = elSchoolNames[2];
    var theSpanE3 = elSchoolNames[3];
    
    console.log('textContent: ' + theSpanEl.textContent);
    

    学校名称在对象的textContent属性中。

    我如何知道第一个对象内的所有对象是什么,以及第一个 Span 元素的内容是什么?我遍历了对象的所有属性。

    var elSchoolNames = document.getElementsByClassName("nearby-schools-name");
    console.log('namesOfSchools: ' + elSchoolNames);
    
    for (theProperty in elSchoolNames) {
        console.log('theProperties: ' + theProperty);
        console.log('each value: ' + elSchoolNames[theProperty]);
    };
    
    var theSpanEl = elSchoolNames[1];
    
    for (spanProperty in theSpanEl) {
        console.log('theProperties: ' + spanProperty);
        console.log('each value: ' + theSpanEl[spanProperty]);
    };
    
    console.log('textContent: ' + theSpanEl.textContent);
    

    要获取子元素,您需要删除第一个元素之后的每个元素。因为它是零索引的,所以第二个元素是数字 1。

    var theSpanEl = elSchoolNames[1];
    

    现在,看看你有什么,把它打印到控制台:

    console.log('textContent: ' + theSpanEl.textContent);
    

    这给了你:

    textContent:  Salmon Bay School 
        (assigned)
    

    当然,你会想用字符串方法去掉最后的(assigned)。您不需要为此使用.match() 或正则表达式。

    我刚刚意识到,如果您从一个不属于您的网站中获取 HTML 内容,并且 HTML 内容是一个字符串,那么这些都不起作用。除非您使用 innerHTML 将 HTML 注入您的网站,否则请使用上述代码。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-12-15
      • 2013-09-12
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多