如何从 javascript 文件中提取 javascript 函数答案

【问题标题】：How to to extract a javascript function from a javascript file如何从 javascript 文件中提取 javascript 函数
【发布时间】：2011-07-05 21:17:12
【问题描述】：

我需要从脚本文件中提取整个 javascript 函数。我知道函数的名称，但我不知道函数的内容可能是什么。这个函数可以嵌入到任意数量的闭包中。

我需要有两个输出值：

我在输入脚本中找到的命名函数的整个主体。
删除了找到的命名函数的完整输入脚本。

所以，假设我在这个输入脚本中寻找findMe 函数：

function() {
  function something(x,y) {
    if (x == true) {
      console.log ("Something says X is true");
      // The regex should not find this:
      console.log ("function findMe(z) { var a; }");
    }
  }
  function findMe(z) {
    if (z == true) {
      console.log ("Something says Z is true");
    }
  }
  findMe(true);
  something(false,"hello");
}();

由此，我需要以下两个结果值：

提取的findMe脚本

function findMe(z) {
  if (z == true) {
    console.log ("Something says Z is true");
  }
}

删除了findMe 函数的输入脚本

function() {
  function something(x,y) {
    if (x == true) {
      console.log ("Something says X is true");
      // The regex should not find this:
      console.log ("function findMe(z) { var a; }");
    }
  }
  findMe(true);
  something(false,"hello");
}();

我正在处理的问题：

要查找的脚本正文中可以包含任何有效的 JavaScript 代码。查找此脚本的代码或正则表达式必须能够忽略字符串、多个嵌套块级别等中的值。
如果要查找的函数定义在字符串中指定，则应忽略。

关于如何完成这样的事情有什么建议吗？

更新：

看起来正则表达式不是执行此操作的正确方法。我愿意接受指向可以帮助我完成此任务的解析器的指针。我正在查看Jison，但我很想知道其他任何事情。

【问题讨论】：

您需要使用 javascript 来完成，或者您可以使用其他语言（例如 python）？
我在服务器上解析 javascript 文件，但我是在 node.js 中做的。所以最好是 javascript 来做这件事。我现在正在将 Jison 视为一种可能的解决方案：zaach.github.com/jison
我刚刚将问题更新为不是正则表达式特定的。基本上，我正在寻找问题的解决方案，无论解决方案是否涉及正则表达式都无关紧要。
也许您应该尝试使用正则表达式查找函数名称，然后使用堆栈选择函数体：您从找到函数名称的位置解析文件，推入“{”（或其他任何东西）当你找到一个时在堆栈中，当你找到一个“}”时从堆栈中弹出一个符号。当堆栈变空时，您已经到达函数体的末尾，并且完成了。它肯定不是高效的，也不是非常优雅，但它可能是一个解决方案。
它会在任何带有未处理的左括号或右括号的字符串声明上中断。或评论。或者任何让你不关闭括号并保持有效的东西。我不认为有简单的解决方案，只需要扣紧并编写一个（简单的）解析器。

标签： javascript regex parsing

【解决方案1】：

如果脚本包含在您的页面中（您不清楚的地方）并且该函数可以公开访问，那么您可以通过以下方式获取该函数的源代码：

functionXX.toString();

https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/toString

其他想法：

1) 查看执行 JS 缩小或 JS 漂亮缩进的开源代码。在这两种情况下，这些代码都必须“理解” JS 语言才能以容错的方式完成它们的工作。我怀疑它会是纯正则表达式，因为语言比这复杂一点。

2) 如果您在服务器上控制源并且想要修改其中的特定函数，那么只需插入一些新的 JS，在运行时用您自己的函数替换该函数。这样，您就可以让 JS 编译器为您识别函数，然后将其替换为您自己的版本。

3) 对于正则表达式，这是我所做的，这不是万无一失的，但对我使用的一些构建工具有用：

我运行多次（在 python 中使用正则表达式）：

删除所有用 /* 和 */ 描述的 cmets。
删除所有带引号的字符串
现在，剩下的就是非字符串、非注释 javascript，因此您应该能够直接在函数声明上进行正则表达式
如果您需要带回字符串和 cmets 的函数源，则必须从原始重新构建它，因为您知道函数的开头结尾

这是我使用的正则表达式（以 python 的多行格式表示）：

reStr = r"""
    (                               # capture the non-comment portion
        "(?:\\.|[^"\\])*"           # capture double quoted strings
        |
        '(?:\\.|[^'\\])*'           # capture single quoted strings
        |
        (?:[^/\n"']|/[^/*\n"'])+    # any code besides newlines or string literals
        |
        \n                          # newline
    )
    |
    (/\*  (?:[^*]|\*[^/])*   \*/)       # /* comment */
    |
    (?://(.*)$)                     # // single line comment
    $"""    

reMultiStart = r"""         # start of a multiline comment that doesn't terminate on this line
    (
        /\*                 # /* 
        (
            [^\*]           # any character that is not a *
            |               # or
            \*[^/]          # * followed by something that is not a /
        )*                  # any number of these
    )
    $"""

reMultiEnd = r"""           # end of a multiline comment that didn't start on this line
    (
        ^                   # start of the line
        (
            [^\*]           # any character that is not a *
            |               # or
            \*+[^/]         # * followed by something that is not a /
        )*                  # any number of these
        \*/                 # followed by a */
    )
"""

regExSingleKeep = re.compile("// /")                    # lines that have single lines comments that start with "// /" are single line comments we should keep
regExMain = re.compile(reStr, re.VERBOSE)
regExMultiStart = re.compile(reMultiStart, re.VERBOSE)
regExMultiEnd = re.compile(reMultiEnd, re.VERBOSE)

这对我来说听起来很乱。你最好解释一下你真正想要解决的问题，这样人们就可以帮助找到一个更优雅的解决实际问题的方法。

【讨论】：

谢谢，我知道这一点。但在这种情况下，这是在服务器上完成的，我只是在处理纯文本。我的页面中没有包含 javascript。
好的，我在答案中添加了更多选项。
@jfriend00，我没有看到正则表达式在任何地方处理 JS 正则表达式引用。 ;-)
另外，您的 cmets 将在 /* foo **/ 上中断。
+1 查看现有的缩小器和其他代码解析工具的好主意。我还建议查看 JSLint。

【解决方案2】：

我使用普通的旧字符串方法（无正则表达式）在 C# 中构建了一个解决方案，它也适用于嵌套函数。基本原理是计算大括号并检查不平衡的右大括号。警告：这不适用于大括号是注释一部分的情况，但您可以通过在解析函数边界之前首先从代码中去除 cmets 来轻松增强此解决方案。

我首先添加了这个扩展方法来提取字符串中匹配的所有索引（来源：More efficient way to get all indexes of a character in a string）

    /// <summary>
    /// Source: https://stackoverflow.com/questions/12765819/more-efficient-way-to-get-all-indexes-of-a-character-in-a-string
    /// </summary>
    public static List<int> AllIndexesOf(this string str, string value)
    {
        if (String.IsNullOrEmpty(value))
            throw new ArgumentException("the string to find may not be empty", "value");
        List<int> indexes = new List<int>();
        for (int index = 0; ; index += value.Length)
        {
            index = str.IndexOf(value, index);
            if (index == -1)
                return indexes;
            indexes.Add(index);
        }
    }

我定义这个结构是为了方便引用函数边界：

    private struct FuncLimits
    {
        public int StartIndex;
        public int EndIndex;
    }

这是我解析边界的主要函数：

    public void Parse(string file)
    {
        List<FuncLimits> funcLimits = new List<FuncLimits>();

        List<int> allFuncIndices = file.AllIndexesOf("function ");
        List<int> allOpeningBraceIndices = file.AllIndexesOf("{");
        List<int> allClosingBraceIndices = file.AllIndexesOf("}");

        for (int i = 0; i < allFuncIndices.Count; i++)
        {
            int thisIndex = allFuncIndices[i];
            bool functionBoundaryFound = false;

            int testFuncIndex = i;
            int lastIndex = file.Length - 1;

            while (!functionBoundaryFound)
            {
                //find the next function index or last position if this is the last function definition
                int nextIndex = (testFuncIndex < (allFuncIndices.Count - 1)) ? allFuncIndices[testFuncIndex + 1] : lastIndex;

                var q1 = from c in allOpeningBraceIndices where c > thisIndex && c <= nextIndex select c;
                var qTemp = q1.Skip<int>(1); //skip the first element as it is the opening brace for this function

                var q2 = from c in allClosingBraceIndices where c > thisIndex && c <= nextIndex select c;

                int q1Count = qTemp.Count<int>();
                int q2Count = q2.Count<int>();

                if (q1Count == q2Count && nextIndex < lastIndex)
                    functionBoundaryFound = false; //next function is a nested function, move on to the one after this
                else if (q2Count > q1Count)
                {
                    //we found the function boundary... just need to find the closest unbalanced closing brace 
                    FuncLimits funcLim = new FuncLimits();
                    funcLim.StartIndex = q1.ElementAt<int>(0);
                    funcLim.EndIndex = q2.ElementAt<int>(q1Count);
                    funcLimits.Add(funcLim);

                    functionBoundaryFound = true;
                }
                testFuncIndex++;
            }
        }
    }

【讨论】：

【解决方案3】：

正则表达式无法做到这一点。你需要的是一个工具，它以编译器精确的方式解析 JavaScript，构建一个表示 JavaScript 代码形状的结构，使你能够找到你想要的函数并将其打印出来，并使你能够从该结构并重新生成剩余的 javascript 文本。

我们的DMS Software Reengineering Toolkit 可以做到这一点，使用它的JavaScript front end。 DMS 提供通用解析、抽象语法树构建/导航/操作，以及来自修改后的 AST 的（有效！）源文本的漂亮打印。 JavaScript 前端为 DMS 提供了编译器准确的 JavaScript 定义。您可以将 DMS/JavaScript 指向一个 JavaScript 文件（甚至是各种带有包含 JavaScript 的嵌入脚本标签的动态 HTML），让它生成 AST。 DMS 模式可用于查找您的函数：

  pattern find_my_function(r:type,a: arguments, b:body): declaration
     " \r my_function_name(\a) { \b } ";

DMS 可以在 AST 中搜索具有指定结构的匹配树；因为这是 AST 匹配而不是字符串匹配，所以换行符、空格、cmets 和其他微不足道的差异不会欺骗它。 [你没有说的是如果你有多个在不同的范围内发挥作用：你想要哪一个？]

找到匹配项后，您可以要求 DMS 打印只是匹配的代码，作为您的提取步骤。您还可以要求 DMS 使用重写规则删除该函数：

  rule remove_my_function((r:type,a: arguments, b:body): declaration->declaration
     " \r my_function_name(\a) { \b } " -> ";";

然后漂亮地打印生成的 AST。 DMS 将正确保存所有 cmets。

这不会做的是检查删除函数不会破坏您的代码。毕竟，它可能在一个范围内，它直接访问范围内本地定义的变量。现在将它移到另一个作用域意味着它不能引用它的变量。

要检测这个问题，你不仅需要一个解析器，还需要一个符号表，将代码中的标识符映射到定义和用途。然后，删除规则必须添加一个语义条件来检查这一点。 DMS 提供了使用属性语法从 AST 构建此类符号表的机制。

为了修复这个问题，在删除函数时，可能需要修改函数以接受额外的参数来替换它访问的局部变量，并修改调用站点以传递相当于对局部变量的引用。这可以通过一组适度大小的 DMS 重写规则来实现，这些规则会检查符号表。

因此删除这样的函数可能比仅仅移动代码要复杂得多。

【讨论】：

【解决方案4】：

我几乎担心正则表达式无法完成这项工作。我认为这与尝试使用正则表达式解析 XML 或 HTML 是一样的，这个话题已经在本论坛引起了各种宗教辩论。

编辑：如果这与尝试解析 XML 不同，请纠正我。

【讨论】：

虽然这不能回答问题，但我给它+1，因为这将是对正则表达式的滥用。正则表达式绝不意味着解析本质上是递归的输入（一种编程语言）；
这也是我害怕的。我正在调查吉森，看看它是否有帮助。 zaach.github.com/jison
NP-完全，对吧？这和什么有什么关系？与大多数语言一样，Javascript 语法是一种上下文无关的语法。解析在很大程度上是一种确定性多项式时间操作。此外，您对通用解析的简化是不正确的（它不构成有效的 iff，因为它是单向的）。虽然能够解析可以解决这个问题，但这是解析问题的一个子问题，并且可能更容易，甚至可以通过正则表达式解决。
@davin 你说 JS 语法是上下文无关的语法（因此 ir 可以由下推自动机表示/编写/验证）。正则表达式是正则的，因此等价于有限状态自动机。有限状态自动机无法处理上下文无关文法。
@davin 关于用正则表达式解析非正则输入，授予：stackoverflow.com/questions/1732348/…

【解决方案5】：

我猜你必须为这项工作使用和构造一个 String-Tokenizer。

function tokenizer(str){
  var stack = array(); // stack of opening-tokens
  var last = ""; // last opening-token

  // token pairs: subblocks, strings, regex
  var matches = {
    "}":"{",
    "'":"'",
    '"':'"',
    "/":"/"
  };

  // start with function declaration
  var needle = str.match(/function[ ]+findme\([^\)]*\)[^\{]*\{/);

  // move everything before needle to result
  var result += str.slice(0,str.indexOf(needle));
  // everithing after needle goes to the stream that will be parsed
  var stream = str.slice(str.indexOf(needle)+needle.length);

  // init stack
  stack.push("{");
  last = "{";

  // while still in this function
  while(stack.length > 0){

    // determine next token
    needle = stream.match(/(?:\{|\}|"|'|\/|\\)/); 

    if(needle == "\\"){
      // if this is an escape character => remove escaped character
      stream = stream.slice(stream.indexOf(needle)+2);
      continue;

    }else if(last == matches[needle]){
      // if this ends something pop stack and set last
      stack.pop();
      last = stack[stack.length-1];

    }else if(last == "{"){  
      // if we are not inside a string (last either " or ' or /)
      // push needle to stack
      stack.push(needle);
      last = needle;
    }

    // cut away including token
    stream = stream.slice(stream.indexOf(needle)+1);
  }

  return result + stream;
}

哦，我忘记了 cmets 的令牌...但我想您现在知道它是如何工作的了...

【讨论】：