如何从 json 文件中删除某些注释？ (/*)答案

【问题标题】：How to remove certain comments from a json file? (/*)如何从 json 文件中删除某些注释？ (/*)
【发布时间】：2021-09-03 11:04:48
【问题描述】：

我有大约 500 个带有 cmets 的 json 文件。尝试使用新值更新 json 文件中的字段会引发错误。我设法使用commentjson来删除这样的字符串//一些文本和json文件更新并且没有抛出错误。

但是有大约 100 个带有 cmets 的 json 文件，如下所示：

  /*

   1. sometext.
        i. sometext
        ii. sometext 
   2. sometext

  */

Commentjson 只是在 /* 存在时崩溃。如果我删除 /* 并运行代码，它将工作并更新并删除任何 //。如何编写一些代码来管理 /* 和 /* */ 之间的所有文本？

这是我当前可以删除的代码 //

with open(f"{i['Location']}\\{file_name}",'r') as f:
    json_info = commentjson.load(f) #Gets info from the json file
    json_info['password'] = password

    with open(f"{i['location_Daily']}\\{file_name}",'w') as f:
        commentjson.dump(json_info,f,indent = 4) #updates the password   
        print("updated")

【问题讨论】：

为什么要删除它们？
@OlvinRoght 评论甚至不是有效的 JSON，所以大多数 JSON 解析器会在试图读取这些文件时崩溃（并且创建它们的人应该严厉地与他们交谈；）
@Iguananaut，有支持 JSON5 标准的 JSON 解析器。
@Iguananaut：commentjson 包确实支持 cmets：commentjson (Comment JSON) 是一个 Python 包，可帮助您使用 Python 和 JavaScript 样式的内联 cmets 创建 JSON 文件。只是不是这种风格。
我需要删除这些，因为当我尝试更新包含 cmets 的值时，我的代码会引发错误并且不会更新。到目前为止，我已经设法处理了 cmets，但 /* 现在引起了问题。

标签： python json

【解决方案1】：

您可以使用其他库，例如 json5 或 pyjson5 或任何支持 JSON5 的库

import json5
import pyjson5

data = '''
{
    "something": [
        ["any"],
        ["thing", "here", 10]    // This is comment 1
    ],
    /* While this
    is
    comment 2 */
    "car": [
        ["and", "another", "here"], /* Last comment */
    ]
}
'''

print(json5.loads(data))
print(pyjson5.loads(data))

输出

$ python3 script.py 
{'something': [['any'], ['thing', 'here', 10]], 'car': [['and', 'another', 'here']]}
{'something': [['any'], ['thing', 'here', 10]], 'car': [['and', 'another', 'here']]}

【讨论】：

重要的是要注意pyjson5 比json5快得多，并且比纯python json 快得多。检查：Performance 部分 pyjson5 文档； json5 项目描述中的已知问题部分。

【解决方案2】：

你有几个选择：

将整个文件读入字符串，然后使用正则表达式对文本进行预处理。例如：

with open(...) as f:
    json_text = f.read()
# remove everything from '/*' to '*/' as long as it is either
# - a '*' character that is *not* followed by '/'
# - any character that is not '*'
without_comments = re.sub(r"/\*(?:\*(?!/)|[^*])*\*/", "", json_text)
json_info = commentjson.loads(without_comments)

请注意，如果其中还有带有 /* 和 */ 的 JSON 字符串，则此方法将不起作用。正则表达式不是 JSON 解析器。

尝试更新 commonjson 项目用来解析 JSON 的解析器。查看the project source code，他们使用Lark parsing library，所以你可以用额外的语法修改模块。

我注意到主分支已经有一个定义多行 cmets 的语法规则：

COMMENT: "/*" /(.|\\n)+?/ "*/"
       | /(#|\\/\\/)[^\\n]*/

但这还不是他们发布的一部分。但是，您可以重复使用该规则：

from commentjson import commentjson as implementation
from lark.reconstruct import Reconstructor

serialized = implementation.parser.serialize()
for tok in serialized["parser"]["lexer_conf"]["tokens"]:
    if tok["name"] != "COMMENT":
        continue
    if tok["pattern"]["value"].startswith("(#|"):
        # only supports `#` or `//` comments, add block comments
        tok["pattern"]["value"] = r'(?:/\*(?:\*(?!/)|[^*])*\*/|(#|\/\/)[^\n]*)'
    break

implementation.parser = implementation.parser.deserialize(serialized, None, None)

我在该语法更新中使用了自己的正则表达式，而不是项目使用的版本。

找到一个不同的库来解析输入。有几个选项声称支持使用相同语法解析 JSON：
我没有尝试过任何这些，也没有任何关于它们的可用性或性能的说法。

【讨论】：

@LewisGreen：我添加了代码来更新commentjson 解析器以直接支持块 cmets。