第 1 行错误：prolog 中不允许有内容答案

【问题标题】：Error on line 1: Content is not allowed in prolog第 1 行错误：prolog 中不允许有内容
【发布时间】：2019-04-04 11:52:17
【问题描述】：

我正在尝试使用以下代码从 website 中抓取价格数据表；

function scrapeData() {
// Retrieve table as a string using Parser.
var url = "https://stooq.com/q/d/?s=barc.uk&i=d";

var fromText = '<td align="center" id="t03">';
var toText = '</td>';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser.data(content).from(fromText).to(toText).build();

//Parse table using XmlService.
var root = XmlService.parse(scraped).getRootElement();
}

我从我在类似问题here 中使用的方法中采用了这种方法，但是它在这个特定的 url 上失败并给了我错误；

Error on line 1: Content is not allowed in prolog. (line 12, file "Stooq")

在相关问题here 和here 中，他们谈到不接受提交给解析器的文本内容，但是，我无法将这些问题中的解决方案应用于我自己的问题。任何帮助将不胜感激。

【问题讨论】：

标签： google-apps-script web-scraping html-parsing

【解决方案1】：

这个修改怎么样？

修改点：

在这种情况下，需要修改检索到的 HTML 值。例如，当var content = UrlFetchApp.fetch(url).getContentText() 运行时，每个属性值都不包含在内。这些都需要修改。
标题中有合并列。

当以上几点反映到脚本中时，变成如下。

修改脚本：

function scrapeData() {
  // Retrieve table as a string using Parser.
  var url = "https://stooq.com/q/d/?s=barc.uk&i=d";
  var fromText = '#d9d9d9}</style>';
  var toText = '<table';
  var content = UrlFetchApp.fetch(url).getContentText();
  var scraped = Parser.data(content).from(fromText).to(toText).build();

  // Modify values
  scraped = scraped.replace(/=([a-zA-Z0-9\%-:]+)/g, "=\"$1\"").replace(/nowrap/g, "");

  // Parse table using XmlService.
  var root = XmlService.parse(scraped).getRootElement();

  // Retrieve header and modify it.
  var headerTr = root.getChild("thead").getChildren();
  var res = headerTr.map(function(e) {return e.getChildren().map(function(f) {return f.getValue()})});
  res[0].splice(7, 0, "Change");

  // Retrieve values.
  var valuesTr = root.getChild("tbody").getChildren();
  var values = valuesTr.map(function(e) {return e.getChildren().map(function(f) {return f.getValue()})});
  Array.prototype.push.apply(res, values);

  // Put the result to the active spreadsheet.
  var ss = SpreadsheetApp.getActiveSheet();
  ss.getRange(1, 1, res.length, res[0].length).setValues(res);
}

注意：

在运行此修改后的脚本之前，请安装 Parser 的 GAS 库。
这个修改后的脚本不对应各种URL。这可用于您问题中的 URL。如果您想从其他 URL 检索值，请修改脚本。

参考：

如果这不是你想要的，我很抱歉。

【讨论】：

再次感谢 Tanaike 正是我想要的。你能解释一下 .replace 函数在这种情况下的用途吗？另外，您是如何决定 html 中的“from text”和“to text”位置的？
@redbaron1981 感谢您的回复。我很高兴你的问题得到了解决。对于您的评论， 1. 例如，通过将scraped = scraped.replace(/=([a-zA-Z0-9\%-:]+)/g, "=\"$1\"").replace(/nowrap/g, "") 修改为scraped = scraped.replace(/=([a-zA-Z0-9\%-:]+)/g, "=\"$1\"") 和scraped = scraped，您可以看到有无replace() 的区别。 2.我从var content = UrlFetchApp.fetch(url).getContentText()的content获得了from和to。