Python 和 JSON：ValueError：未终止的字符串开始于：答案

【问题标题】：Python & JSON: ValueError: Unterminated string starting at:Python 和 JSON：ValueError：未终止的字符串开始于：
【发布时间】：2014-12-19 21:35:36
【问题描述】：

我已经阅读了多篇有关此内容的 StackOverflow 文章以及 Google 前 10 名搜索结果中的大部分文章。我的问题出在哪里，我在 python 中使用一个脚本来创建我的 JSON 文件。下一个脚本，在 10 分钟后运行，无法读取那个文件。

简短版，我为我的在线业务生成潜在客户。我正在尝试学习 Python，以便更好地分析这些线索。我正在搜寻价值 2 年的潜在客户，目的是保留有用的数据并删除任何个人信息 - 电子邮件地址、姓名等 - 同时还将 30,000 多个潜在客户保存到几十个文件中以便于访问。

所以我的第一个脚本会打开每个单独的潜在客户文件 - 30,000+ - 根据文件中的时间戳确定它被捕获的日期。然后它将导致的结果保存到 dict 中的适当键。当所有数据都聚合到这个 dict 文本文件时，使用 json.dumps 写入。

dict的结构是：

addData['lead']['July_2013'] = { ... }

其中“前导”键可以是前导、部分键和其他一些键，“7 月_2013”键显然是基于日期的键，可以是整月和 2013 年或 2014 年的任意组合，可追溯到“2013 年 2 月” .

完整的错误是这样的：

ValueError: Unterminated string starting at: line 1 column 9997847 (char 9997846)

但是我手动查看了文件，我的 IDE 说文件中只有 76,655 个字符。那么它是怎么到 9997846 的呢？

失败的文件是第 8 个被读取的文件；通过 json.loads 读取后的其他 7 个文件和所有其他文件都很好。

Python 说有一个未终止的字符串，所以我查看了失败的文件中 JSON 的结尾，它看起来很好。我看到有人提到换行符在 JSON 中是 \n ，但这个字符串都是一行。我见过提到 \ vs \ 但在快速浏览整个文件时我没有看到任何 .其他文件确实有 \ 并且它们可以正常读取。而且，这些文件都是由 json.dumps 创建的。

我无法发布该文件，因为其中仍有个人信息。手动尝试验证 76,000 个字符文件的 JSON 并不可行。

将不胜感激有关如何调试此问题的想法。与此同时，我将尝试重建文件，看看这是否只是一个错误，但需要一段时间。

通过 Spyder 和 Anaconda 实现的 Python 2.7
Windows 7 专业版

--- 编辑 --- 根据请求，我在此处发布编写代码：

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global archiveDir
global aggLeads


def aggregate_individual_lead_files():
    """

    """

    # Get the aggLead global and 
    global aggLeads

    # Get all the Files with a 'lead' extension & aggregate them
    exts = [
        'lead',
        'partial',
        'inp',
        'err',
        'nobuyer',
        'prospect',
        'sent'
    ]

    for srchExt in exts:
        agg = {}
        leads = f.recursiveGlob(leadDir, '*.cd.' + srchExt)
        print "There are {} {} files to process".format(len(leads), srchExt)

        for lead in leads:
            # Get the Base Filename
            fname = f.basename(lead)
            #uniqID = st.fetchBefore('.', fname)

            #print "File: ", lead

            # Get Lead Data
            leadData = json.loads(f.file_get_contents(lead))

            agg = agg_data(leadData, agg, fname)

        aggLeads[srchExt] = copy.deepcopy(agg)

        print "Aggregate Top Lvl Keys: ", aggLeads.keys()
        print "Aggregate Next Lvl Keys: "

        for key in aggLeads:
            print "{}: ".format(key)

            for arcDate in aggLeads[key].keys():
                print "{}: {}".format(arcDate, len(aggLeads[key][arcDate]))

        # raw_input("Press Enter to continue...")


def agg_data(leadData, agg, fname=None):
    """

    """
    #print "Lead: ", leadData

    # Get the timestamp of the lead
    try:
        ts = leadData['timeStamp']
        leadData.pop('timeStamp')
    except KeyError:
        return agg

    leadDate = datetime.fromtimestamp(ts)
    arcDate = leadDate.strftime("%B_%Y")

    #print "Archive Date: ", arcDate

    try:
        agg[arcDate][ts] = leadData
    except KeyError:
        agg[arcDate] = {}
        agg[arcDate][ts] = leadData
    except TypeError:
        print "Timestamp: ", ts
        print "Lead: ", leadData
        print "Archive Date: ", arcDate
        return agg

    """
    if fname is not None:
        archive_lead(fname, arcDate)
    """

    #print "File: {} added to {}".format(fname, arcDate)

    return agg


def archive_lead(fname, arcDate):
    # Archive Path
    newArcPath = archiveDir + arcDate + '//'

    if not os.path.exists(newArcPath):
        os.makedirs(newArcPath)

    # Move the file to the archive
    os.rename(leadDir + fname, newArcPath + fname)


def reformat_old_agg_data():
    """

    """

    # Get the aggLead global and 
    global aggLeads
    aggComplete = {}
    aggPartial = {}

    oldAggFiles = f.recursiveGlob(leadDir, '*.cd.agg')
    print "There are {} old aggregate files to process".format(len(oldAggFiles))

    for agg in oldAggFiles:
        tmp = json.loads(f.file_get_contents(agg))

        for uniqId in tmp:
            leadData = tmp[uniqId]

            if leadData['isPartial'] == True:
                aggPartial = agg_data(leadData, aggPartial)
            else:
                aggComplete = agg_data(leadData, aggComplete)

    arcData = dict(aggLeads['lead'].items() + aggComplete.items())
    aggLeads['lead'] = arcData

    arcData = dict(aggLeads['partial'].items() + aggPartial.items())
    aggLeads['partial'] = arcData    


def output_agg_files():
    for ext in aggLeads:
        for arcDate in aggLeads[ext]:
            arcFile = leadDir + arcDate + '.cd.' + ext + '.agg'

            if f.file_exists(arcFile):
                tmp = json.loads(f.file_get_contents(arcFile))
            else:
                tmp = {}

            arcData = dict(tmp.items() + aggLeads[ext][arcDate].items())

            f.file_put_contents(arcFile, json.dumps(arcData))


def main():
    global leadDir
    global archiveDir
    global aggLeads

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    archiveDir = leadDir + 'archive//'
    aggLeads = {}


    # Aggregate all the old individual file
    aggregate_individual_lead_files()

    # Reformat the old aggregate files
    reformat_old_agg_data()

    # Write it all out to an aggregate file
    output_agg_files()


if __name__ == "__main__":
    main()

这里是读取代码：

from p2p.basic import files as f
from p2p.adv import strTools as st
from p2p.basic import strTools as s

import os
import json
import copy
from datetime import datetime
import time


global leadDir
global fields
global fieldTimes
global versions


def parse_agg_file(aggFile):
    global leadDir
    global fields
    global fieldTimes

    try:
        tmp = json.loads(f.file_get_contents(aggFile))
    except ValueError:
        print "{} failed the JSON load".format(aggFile)
        return False

    print "Opening: ", aggFile

    for ts in tmp:
        try:
            tmpTs = float(ts)
        except:
            print "Timestamp: ", ts
            continue

        leadData = tmp[ts]

        for field in leadData:
            if field not in fields:
                fields[field] = []

            fields[field].append(float(ts))


def determine_form_versions():
    global fieldTimes
    global versions

    # Determine all the fields and their start and stop times
    times = []
    for field in fields:
        minTs = min(fields[field])
        fieldTimes[field] = [minTs, max(fields[field])]
        times.append(minTs)
        print 'Min ts: {}'.format(minTs)

    times = set(sorted(times))
    print "Times: ", times
    print "Fields: ", fieldTimes

    versions = {}
    for ts in times:
        d = datetime.fromtimestamp(ts)
        ver = d.strftime("%d_%B_%Y")

        print "Version: ", ver

        versions[ver] = []
        for field in fields:
            if ts in fields[field]:
                versions[ver].append(field)


def main():
    global leadDir
    global fields
    global fieldTimes

    leadDir = 'D://Server Data//eagle805//emmetrics//forms//leads//'
    fields = {}
    fieldTimes = {}

    aggFiles = f.glob(leadDir + '*.lead.agg')

    for aggFile in aggFiles:
        parse_agg_file(aggFile)

    determine_form_versions()

    print "Versions: ", versions




if __name__ == "__main__":
    main()

【问题讨论】：

您需要向我们展示您的代码。如果它有个人信息，请删除该信息。
您真的要我发布一个包含 76,000 个字符的 JSON 文件吗？该代码适用于除此之外的所有其他文件。所以我确信这与保存的字段之一有关。它包含数千人的个人信息。我可以手动将数据输入到一个新文件中，其速度几乎与从该文件中删除个人数据的速度一样快。除非 PHP 或其他东西可以打开这个文件，否则必须手动完成删除。如果没有完整文件没有人可以提供指针，那么我将删除该问题。
不，我要求您向我们展示您的代码 — Python，而不是 JSON。
哪个部分，写还是读？或两者？代码中没有个人数据，只有数据文件中。
p2p 是我自己的包，用于创建与 PHP 具有相同名称和输入的 Python 函数。根据我的经验，它只是让学习一门新语言的速度更快。

标签： json python-2.7

【解决方案1】：

所以我想通了......我发布这个答案以防其他人犯同样的错误。

首先，我找到了一个解决方法，但我不确定为什么会这样。从我的原始代码中，这是我的file_get_contents 函数：

def file_get_contents(fname):
    if s.stripos(fname, 'http://'):
        import urllib2
        return urllib2.urlopen(fname).read(maxUrlRead)
    else:
        return open(fname).read(maxFileRead)

我通过以下方式使用它：

tmp = json.loads(f.file_get_contents(aggFile))

这一次又一次地失败了。但是，当我试图让 Python 至少给我 JSON 字符串以通过 JSON validator 时，我遇到了 json.load 与 json.loads 的提及。所以我尝试了这个：

a = open('D://Server Data//eagle805//emmetrics//forms//leads\July_2014.cd.lead.agg')
b = json.load(a)

虽然我没有在我的整体代码中测试这个输出，但这个代码块实际上会读入文件，解码 JSON，甚至会显示数据而不会导致 Spyder 崩溃。 Spyder 中的变量 explorer 显示 b 是一个大小为 1465 的字典，这正是它应该有多少条记录。 dict末尾显示的文本部分看起来都不错。所以总的来说，我对数据被正确解析有相当高的信心。

当我编写file_get_contents 函数时，我看到了一些建议，我总是提供要读取的最大字节数，以防止 Python 因返回错误而挂起。 maxReadFile 的值为 1E7。当我手动强制 maxReadFile 为 1E9 时，一切正常。原来该文件不到 1.2E7 字节。所以读取文件得到的字符串不是文件中的完整字符串，因此是无效的 JSON。

通常我会认为这是一个错误，但很明显，在打开和读取文件时，您需要一次只能读取一个块以进行内存管理。因此，我对maxReadFile 的价值感到很短视。错误消息是正确的，但让我大吃一惊。

希望这可以为其他人节省一些时间。

【讨论】：

如何将maxReadFile强制改为1e9？
@MutluSimsek maxReadFile 只是一个变量。所以'maxReadFile = 1E9'。这个问题已经有 7 年历史了，我已经有一段时间没有这样做了。如果我忘记了什么，我会提前道歉。

【解决方案2】：

我遇到了同样的问题。事实证明，文件的最后一行不完整可能是由于下载突然停止，因为我发现有足够的数据并在终端上停止了进程。

【讨论】：

这里也一样，显然从 API 接收到的数据有时不完整。

【解决方案3】：

如果有人像我一样在这里，并且您正在处理来自表单请求的json，请检查是否设置了任何Content-Length 标头。由于那个标题，我收到了这个错误。我使用了json美化，发现json变大了，出现了这个错误。

【讨论】：

【解决方案4】：

我在导入我创建的 json 文件时遇到了同样的问题，但是当我导入另一个 json 文件时，即使不更改代码中的任何内容，它也可以工作。我在创建的json文件中发现的不同之处在于内容在一行上。

enter image description here

具有这种形状的原因是我在编写这样的文件时转储了字典：

with open("sample.json", "w") as outfile: 
    json.dump(dictionnary, outfile)

但是一旦我把字典单独倾倒然后写它：

json_object = json.dumps(dictionary, indent = 4) 

    with open("sample.json", "w") as outfile: 
        outfile.write(json_object)

我有 json 文件的已知和标准形状：

enter image description here

所以知道使用这个 json 文件并导入它我们不会有问题。

【讨论】：

【解决方案5】：

我遇到了类似的问题，显然文件已损坏。帮助我理解的是做

    with open("path/to/file", 'r') as f:
          raw_data = f.read()

然后我看到字符串突然结束了。然后，我检查了一小部分数据。

index_of_the_end_of_last_record = 100 # For example
data = json.loads(raw_data[:index_of_the_end_of_last_record]+']')

我添加了一个 - ]，因为我在 json 文件中有一个列表。

【讨论】：