使用 Google Sheets Re2 正则表达式语法提取匹配地址答案

【问题标题】：Extract matching addresses using Google Sheets Re2 regular expression syntax使用 Google Sheets Re2 正则表达式语法提取匹配地址
【发布时间】：2020-12-01 14:24:25
【问题描述】：

我正在尝试提取在 Google 表格单元格的公式中出现的所有单元格/范围地址。

公式本质上可能非常复杂。我尝试了许多模式，它们在网络测试人员中有效，但在 google sheet re2 上无效。

以下示例显示了两个问题。也许我误读了匹配结果，但据我了解，有 4 个匹配项。

公式（忽略逻辑）：

=A$13:B4+$BC$12+$DE2+F2:G2

正则表达式：

((\$?[A-Z]+\$?\d+)(:(\$?[A-Z]+\$?\d+))?)

预期结果：

[A$13:B4,$BC$12,$DE2,F2:G2]

Here（如果我没有误读结果）看起来还可以。我不确定显示的 组匹配 是否也被视为匹配项，因为它声明了 “4 个匹配项，287 步”

但是在 google 表格中返回所有 Match 1 结果

[A$13:B4,A$13,:B4,B4]

忽略其他匹配项所以我想问题是如何将正则表达式转换为 re2 语法？

更新： 关注player0 cmets，可能我并不清楚。这只是一个简单的示例，以隔离我遇到的其他问题。这只是一个包含几种相对和绝对格式的地址的字符串。但是，我正在寻找一个更广泛的通用解决方案，该解决方案将适合任何可能包含公式和对其他工作表的引用的公式。例如：

=(STDEVA(Sheet1!B2:B5)+sum($A$1:$A$2))*B2

这里的预期结果是Sheet1!B2:B5,$A$1:$A$2,B2

此公式包含两个公式并引用另一个工作表。仍然在这里忽略命名范围和其他我目前无法想到的公式可能的参考。此外，方括号 [] 无关紧要，它只是显示结果的方式，实际上是从日志中复制的，因为它都是在脚本中完成的。

【问题讨论】：

分享您的工作表副本以及所需输出的示例
i.stack.imgur.com/EpwaG.png
如果不使用任何捕获组怎么办？ \$?[A-Z]+\$?\d+(?::(?:\$?[A-Z]+\$?\d+))?regex101.com/r/A5yKb5/1
@player0 猜我还不够清楚。请查看我的编辑。
@Thefourthbird 因为我不是正则表达式大师，所以我并不完全理解分组的用法，但我知道它会阻止分组匹配。无论如何，在谷歌表格中仍然只返回第一场比赛A$13:B4

标签： regex google-sheets google-sheets-formula re2

【解决方案1】：

尝试：

=INDEX(SUBSTITUTE(TEXTJOIN(",", 1, 
 IFNA(REGEXEXTRACT(SPLIT(SUBSTITUTE(FORMULATEXT(A3), "'", "♥"), 
 "+-*/^()=<>&"), 
 "(?:.+!)?[A-Z$]+\d+(?::[A-Z$](?:\d+)?)?|(?:.+!)?[A-Z$]:[A-Z$]+"))), "♥", "'"))

或更长：

=INDEX(SUBSTITUTE(TEXTJOIN(",", 1, 
 IFNA(IFNA(REGEXEXTRACT(SPLIT(SUBSTITUTE(FORMULATEXT(A3), "'", "♥"), 
 "+-*/^()=<>"), "(?:.+!)?[A-Z$]+\d+(?::[A-Z$](?:\d+)?)?"), 
 REGEXEXTRACT(SPLIT(SUBSTITUTE(FORMULATEXT(A3), "'", "♥"), 
 "+-*/^()=<>"), "(?:.+!)?[A-Z$]:[A-Z$]+")))), "♥", "'"))

【讨论】：

@OJNSim 这应该可以满足您的所有需求
我不明白你做了什么，我只是不明白为什么它解决了这个问题，或者更好地说，为什么首先会出现问题。只返回第一个匹配的事实，是 re2 的行为方式吗？
另外，我实际上是在脚本中完成的，而不是使用内部函数。因此，您的建议是将所有公式文本拆分为一个数组，然后遍历后者以（尝试）匹配每个条目？澄清一下，“海峡前进”方法既不适用于内部功能，也不适用于脚本（我开始的地方）。
另外，正如我对 Jan 的回答所评论的那样，在删除 /g flag 时，它也会返回唯一的第一个匹配项。那么也许 re2 有这样一个标志，而这实际上是问题所在？
不确定您的后续问题是什么... 正则表达式将始终只返回指定数量的组，这就是为什么需要拆分输入，因为匹配的数量是可变的。 regex101 是一个不错的工具，但与 google sheet 结合起来完全没用

【解决方案2】：

看来，你可以用

[A-Z$]+\d+(?::[A-Z$]\d+)?

见a demo on regex101.com。

【讨论】：

虽然它已经删除了所有的分组匹配，它仍然只返回第一个匹配A$13:B4。我现在意识到它可能是演示中的\g 全局标志。是否有 re2 等效 flag？

【解决方案3】：

通过使用/g 标志，我找到了一种无需拆分的更好方法。但是，这在脚本中有效，而不是通过使用表格内部正则表达式函数（即REGEXEXTRACT），因为我无法弄清楚如何在包含 /g 标志和REGEXEXTRACT 的单元格中格式化正则表达式字符串接受为有效的正则表达式。

代码如下：

/* Find all predessesor cells of input range 
*/
function findPredecessor(rng){
 
  var formualaText = rng.getFormula();
  
  /* addMatchesRegex
  * supports all A1Notation addresses 
  * the 2nd regex after the | operator will match all column addresses (A:A, Sheet2!b:B, etc)
  * some NamedRanges with names like NameRange1 
  * Does not support - NamedRange with names including dot, not ending with digits 
  */
  var addMatchesRegex = /(([\w .'!]+)?(\$?[A-Z]+\$?\d+\b)(:(\$?[A-Z]+\$?\d+))?)|([\w .'!]+)?[A-Z]+:[A-Z]+/gi; 
     
  var addMatches = formualaText.match(addMatchesRegex);
  
  Logger.log("%s add matched: %s",addMatches.length,addMatches);
  
  /* fullMatchRegex
  *  modify addMatches to return also strings like
  * 1. SUM, IFERROR, etc - internal sheets functions.
  * 2. NamedRanges
  * 
  */
  var fullMatchRegex = /(([\w .'!]+)?([\$A-Z.\d]*)(:(\$?[A-Z]+\$?\d*))?)/gi; 
  
  // match regex with formula
  var fullMatches =  formualaText.match(fullMatchRegex);
    
  Logger.log("Full matches list: %s",fullMatches);
  
  var namedRangesAdd = analyzeMatch(addMatches,fullMatches);
    
  Logger.log("%s total predecessors: %s",namedRangesAdd.length,namedRangesAdd);
}



/* This function accepts the two regex matches list
*  and returns one unique list of all predecessor addresses
*  @param {Array} addMatches - All A1 notation addresses 
*                              plus some of NamedRanges 
*  @param {Array} fullMatches - All A1 notation addresses,All NamedRanges,
*                               Other irrelevent matches
*/
function analyzeMatch(addMatches,fullMatches){

  /*Expected 
    First parameter - holds all A1Notation addresses as well as NamedRanges that
    their name in the form of /[A-Z]+/d+
    NamedRange with name including dot(.) or does not contain digits will not
    be on the list
    Second Parameter - contains all first list matches, as well as all NamedRanges
    names and also irrelevant matches to be filtered like function names and empty string 
  */
  
  //Full Matched Addresses to be returned
  var mAddresses = [];
  
  //Remove duplicate addresses
  var uniqueMatches = 
      addMatches.filter((item,index)=>addMatches.indexOf(item)===index); 
  
  //Get all named Ranges in spread sheet
  var nr = SpreadsheetApp.getActive().getNamedRanges();
  
  // Loop Named Ranges arr 
  nr.forEach(function(item){
  
    /* Check if the name of the current Named Range
    * is included in matches
    * 1. first in addMatches list
    * 2. only if not found in the wider list */
    
    var name = item.getName();
    
    //Check if in addmatches array
    var i = uniqueMatches.indexOf(name);
    
    //Build A1Notation address of current NamedRange 
    var rng = item.getRange();
    var add = "'" + rng.getSheet().getName() + "'!" + rng.getA1Notation();    
    
    if (i > -1){
      
      //Add the address of curr NamedRange to final list 
      mAddresses.push(add);
      //Remove curr NamedRange from list
      uniqueMatches.splice(i,1);
      
    }else if (fullMatches.includes(name)){
      // Name found - add the address of the 
      //              Named Range to matched Addresses list
      
      mAddresses.push(add);    
    }
    
  });
  
  //Add all left matched addresses to final list  
  mAddresses.push(...uniqueMatches);
  
  return mAddresses;
   
}

NamedRanges 有点复杂。此代码将匹配和分析并返回一个列表，其中包含所有前辈地址，包括 NamedRanges 的地址。

【讨论】：