如何为 Google 表格找到正确的 xpath？答案

【问题标题】：How to find the right xpath for Google sheets?如何为 Google 表格找到正确的 xpath？
【发布时间】：2021-07-02 17:25:58
【问题描述】：

我想从页面中抓取数据，但无法为 Google 表格找出正确的 xpath。我想从https://www.belvilla.nl/zoeken/?land=nl&rgo=frie中提取数字202（在页面顶部，“202 弗里斯兰的 vakantiehuizen gevonden")

如果我使用 xpath，我会得到：//*[@id="result-container-items"]/div[1]/div/div/div[1]/div[1]/div[1]/strong

在 Google 表格中，我尝试过 =IMPORTXML(A1;"//*[@id="result-container-items"]/div[1]/div/div/div[1]/div[1]/div[1]/strong)") 和其他一些类似 =IMPORTXML(A1;"//div[@class='search-numbers']") 的表格，但它们都不起作用。对于最后一个，我收到“带有 URL 内容的资源已超出大小限制”的错误。但我猜我的 xpath 是错误的。

谁能帮帮我？谢谢！

【问题讨论】：

请参阅stackoverflow.com/questions/64172780/…，了解您从 Google 表格收到的错误

标签： google-sheets xpath

【解决方案1】：

IMPORTXML 有其局限性，尤其是在 JS 元素上。但是，如果可以选择编写脚本，请尝试在 Google Apps 脚本中使用 UrlFetchApp.fetch()。

代码：

function fetchUrl(url) {
  var html = UrlFetchApp.fetch(url).getContentText();
  // startString and endString must be unique or at least the first result 
  // enclosing the result we want
  var startString = 'search-result-div" ';
  var endString = 'alternate-dates-filter-bar';
  var startIndex = html.search(startString);
  var endIndex = html.search(endString);
  // regex for numbers and text content
  var numbers = /strong>([^<]+)<\/strong/;
  var text = /span>([^<]+)<\/span/;
  // clean content then combine matches of numbers and text
  var content =  html.substring(startIndex, endIndex).replace(/\s\s+/g, ' ');
  var result = numbers.exec(content)[1] + ' ' + text.exec(content)[1];
  return result.trim();
}

输出：

注意：

上面的代码特定于您要获取的内容。如果您需要其他任何内容，则需要更新响应的脚本处理。
您可以在其他 url 上重复使用它，并在您的帖子中获取位于您想要的 xpath 上的类似值。
这不使用 xpath。

【讨论】：

【解决方案2】：

google sheet 不支持 scraping JavaScript 元素。如果您为给定的 URL 禁用 JS，您可以检查这一点，并且您将留下可以导入的内容。在您的情况下，IMPORTXML 无法实现这一点：

【讨论】：