如何在 HTML 文件中查找 JSON 字符串答案

【问题标题】：How to find JSON string in HTML file如何在 HTML 文件中查找 JSON 字符串
【发布时间】：2020-01-22 02:24:30
【问题描述】：

我正在尝试使用 Javascript 在网页中查找纯文本 JSON。 JSON 将在浏览器中显示为纯文本，但它可能会被截断为单独的 html 标记。示例：

<div>
{"kty":"RSA","e":"AQAB","n":"mZT_XuM9Lwn0j7O_YNWN_f7S_J6sLxcQuWsRVBlAM3_5S5aD0yWGV78B-Gti2MrqWwuAhb_6SkBlOvEF8-UCHR_rgZhVR1qbrxvQLE_zpamGJbFU_c1Vm8hEAvMt9ZltEGFS22BHBW079ebWI3PoDdS-DJvjjtszFdnkIZpn4oav9fzz0
</div>
<div>
xIaaxp6-qQFjKXCboun5pto59eJnn-bJl1D3LloCw7rSEYQr1x5mxhIxAFVVsNGuE9fjk0ueTDcMUbFLPYn6PopDMuN0T1B2D1Y8ClItEVbVDFb-mRPz8THJ_gexJ8C20n8m-pBlpL4WyyPuY2ScDugmfG7UnBGrDmS5w"}
</div>

我尝试过使用这个正则表达式。

{"?\w+"?:[^}<]+(?:(?:(?:<\/[^>]+>)[^}<]*(?:<[^>]+>)+)*[^}<]*)*}

但问题是它无法使用嵌套的 JSON。

我也可以使用 javascript 来计算 { 和 } 的数量以找到 JSON 实际结束的位置，但必须有比使用这种缓慢而笨拙的方法更好的选择。

非常感谢

更新：也许没有更好的方法来做到这一点。以下是我当前的代码（有点冗长但可能需要）：

let regex = /{[\s\n]*"\w+"[\s\n]*:/g;

// Consider both open and close curly brackets
let brackets = /[{}]/g;

let arr0, arr;
// Try to parse every matching JSON
arr0 = match.exec(body);
if (arr0 === null) { // Nothing found
    return new Promise(resolve => resolve());
}

try {
    brackets.lastIndex = match.lastIndex; // After beginning of current JSON
    let count = 1;
    // Count for { and } to find the end of JSON.
    while ((count !== 0) && ((arr = brackets.exec(body)) !== null)) {
        count += (arr[0] === "{" ? 1 : -1);
    }

    // If nothing special, complete JSON found when count === 0;
    let lastIdx = brackets.lastIndex;
    let json = body.substring(match.lastIndex - arr0[0].length, lastIdx);

    try {
        let parsed = JSON.parse(json);
     // Process the JSON here to get the original message
    } catch (error) {
        console.log(err);
    }

...

} catch(err) {
    console.log(err);
};

【问题讨论】：

没有约束的通用解决方案很难。也许搜索 textContent 以 { 开头的元素，然后评估它，如果它不解析，则跟随它的下一个兄弟，等等。不要使用正则表达式
@CertainPerformance 不幸的是，在我的情况下，JSON 并不总是出现在元素的开头，但幸运的是它们都以相同的元素开头（我正在搜索；上面的代码是概括了一下）。所以现在我仍然会去计算括号......

标签： javascript json regex

【解决方案1】：

这是不可能的，有可能获取父元素的 innerText 并解析它：

console.log(JSON.parse(document.getElementById('outer').innerText.replace(/\s|\n/g, '')));

<div id="outer">
<div>
{"kty":"RSA","e":"AQAB","n":"mZT_XuM9Lwn0j7O_YNWN_f7S_J6sLxcQuWsRVBlAM3_5S5aD0yWGV78B-Gti2MrqWwuAhb_6SkBlOvEF8-UCHR_rgZhVR1qbrxvQLE_zpamGJbFU_c1Vm8hEAvMt9ZltEGFS22BHBW079ebWI3PoDdS-DJvjjtszFdnkIZpn4oav9fzz0
</div>
<div>
xIaaxp6-qQFjKXCboun5pto59eJnn-bJl1D3LloCw7rSEYQr1x5mxhIxAFVVsNGuE9fjk0ueTDcMUbFLPYn6PopDMuN0T1B2D1Y8ClItEVbVDFb-mRPz8THJ_gexJ8C20n8m-pBlpL4WyyPuY2ScDugmfG7UnBGrDmS5w"}
</div>
</div>

但有时可能会失败

【讨论】：

当我说我需要过滤html标签时，我显然忽略了.innerText，但似乎我仍然需要计算{和}。