使用 MapReduce 时，ndb 模型未保存在 memcache 中答案

【问题标题】：ndb Models are not saved in memcache when using MapReduce使用 MapReduce 时，ndb 模型未保存在 memcache 中
【发布时间】：2014-10-06 19:08:18
【问题描述】：

我创建了两个 MapReduce 管道，用于上传 CSV 文件以批量创建类别和产品。每个产品都通过 KeyProperty 绑定到一个类别。 Category 和 Product 模型是基于 ndb.Model 构建的，因此根据文档，我认为它们在从 Datastore 中检索时会自动缓存在 Memcache 中。

我已经在服务器上运行这些脚本来上传 30 个类别，然后是 3000 个产品。所有数据都按预期显示在数据存储区中。

但是，产品上传似乎没有使用 Memcache 来获取类别。当我检查门户中的 Memcache 查看器时，它显示命中数约为 180，未命中数约为 60。如果我每次上传 3000 个产品并检索类别，我不应该有大约 3000 个吗？获取类别的命中 + 未命中（即 Category.get_by_id(category_id)）？在创建新产品之前尝试检索现有产品可能还有 3000 多次未命中（算法同时处理实体创建和更新）。

这是相关的产品映射函数，它从 CSV 文件中提取一行以创建或更新产品：

def product_bulk_import_map(data):
    """Product Bulk Import map function."""

    result = {"status" : "CREATED"}
    product_data = data

    try:
        # parse input parameter tuple
        byteoffset, line_data = data

        # parse base product data
        product_data = [x for x in csv.reader([line_data])][0]
        (p_id, c_id, p_type, p_description) = product_data

        # process category
        category = Category.get_by_id(c_id)
        if category is None:
            raise Exception(product_import_error_messages["category"] % c_id)

        # store in datastore
        product = Product.get_by_id(p_id)
        if product is not None:
            result["status"] = "UPDATED"
            product.category = category.key
            product.product_type = p_type
            product.description = p_description
        else:
            product = Product(
                id = p_id,
                category = category.key,
                product_type = p_type,
                description = p_description
            )
        product.put()
        result["entity"] = product.to_dict()
    except Exception as e:
        # catch any exceptions, and note failure in output
        result["status"] = "FAILED"
        result["entity"] = str(e)

    # return results
    yield (str(product_data), result)

【问题讨论】：

您能否提供有关如何在 memcache 中存储/获取数据的信息？你用的是什么键？请记住，memcache 不接受键中的特殊符号（如空格）。
通过.get_by_id()获取，通过.put()进行存储。类别 ID 是简单的字符串（“书籍”、“电影”等）。产品 ID 目前只是数字（1、2、3...），但在发布之前，我们可能会将其更改为类别和数字的组合（book_1、book_2、movie_1 等）。如果我需要更改，我只是想要一些相当简单的东西，允许我们使用 CSV 导入来添加新条目并修改具有过时信息或拼写错误的旧条目
我认为这可能是上下文缓存造成的，它是在memcache之前使用的。您可以尝试禁用它以查看是否看到大量的内存缓存命中。当然上下文缓存比内存缓存更高效

标签： google-app-engine mapreduce memcached google-cloud-datastore app-engine-ndb

【解决方案1】：

MapReduce 有意为 NDB 禁用 memcache。

参见mapreduce/util.py ln 373, _set_ndb_cache_policy()（截至 2015 年 5 月 1 日）：

def _set_ndb_cache_policy():
  """Tell NDB to never cache anything in memcache or in-process.

  This ensures that entities fetched from Datastore input_readers via NDB
  will not bloat up the request memory size and Datastore Puts will avoid
  doing calls to memcache. Without this you get soft memory limit exits,
  which hurts overall throughput.
  """
  ndb_ctx = ndb.get_context()
  ndb_ctx.set_cache_policy(lambda key: False)
  ndb_ctx.set_memcache_policy(lambda key: False)

你可以强制get_by_id()和put()使用memcache，例如：

product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)

或者，如果您将 put 与 mapreduce.operation 一起批处理，则可以修改 NDB 上下文。但是我不知道这是否有其他不良影响：

ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)

关于“软内存限制退出”的文档字符串，我不明白为什么如果启用仅内存缓存（即没有上下文缓存）会发生这种情况。

实际上您似乎希望为 puts 启用 memcache，否则在您的映射器修改了下面的数据后，您的应用最终会从 NDB 的 memcache 读取陈旧数据。

【讨论】：

需要记住的一点：使用 mapreduce 库中提供的缓存策略，如果您更新任何实体（例如，如果您正在执行迁移），则将完全跳过 memcache...这意味着如果一个实体已经在 memcache 中并且您使用 mapreduce 更新它，它将直接在数据存储中更新，但之前的（陈旧）副本仍将在 memcache 中并通过任何后续 NDB 获取返回。您必须记住使用.put(use_memcache=True) 来替换陈旧的副本，或者手动删除内存缓存键

【解决方案2】：

正如 Slawek Rewaj 已经提到的，这是由上下文缓存引起的。在检索实体时，NDB 首先尝试上下文缓存，然后是 memcache，最后如果在上下文缓存和 memcache 中都没有找到实体，它会从数据存储中检索实体。上下文缓存只是一个 Python 字典，它的生命周期和可见性仅限于当前请求，但 MapReduce 在单个请求中多次调用 product_bulk_import_map()。

您可以在此处找到有关上下文缓存的更多信息：https://cloud.google.com/appengine/docs/python/ndb/cache#incontext

【讨论】：

不是由上下文缓存引起的。请看下面我的回答。 :)