使用 Python 和 datetime 模块根据 TimeUUIDType 从 Cassandra 获取列范围答案

【问题标题】：Get range of columns from Cassandra based on TimeUUIDType using Python and the datetime module使用 Python 和 datetime 模块根据 TimeUUIDType 从 Cassandra 获取列范围
【发布时间】：2013-08-03 23:59:29
【问题描述】：

我有一个这样设置的表：

{"String" : {uuid1 : "String", uuid1: "String"}, "String" : {uuid : "String"}}

或者……

Row_validation_class = UTF8Type
Default_validation_class = UTF8Type
Comparator = UUID

（它基本上将网站作为行标签，并基于 datetime.datetime.now() 以 Cassandra 中的 TimeUUIDType 和字符串作为值动态生成列）

我希望使用 Pycassa 根据行和列检索数据切片。但是，在其他（较小的）表上，我已经这样做了，但通过下载整个数据集（或至少过滤到一行），然后有一个有序字典，我可以与 datetime 对象进行比较。

我希望能够使用 Pycassa multiget 或 get_indexed_slice 函数来提取某些列和行。是否存在允许过滤日期时间的类似内容。我当前的所有尝试都会导致以下错误消息：

TypeError: can't compare datetime.datetime to UUID

到目前为止，我想出的最好的方法是......

def get_number_of_visitors(site, start_date, end_date=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S:%f")):
    pool = ConnectionPool('Logs', timeout = 2)
    col_fam = ColumnFamily(pool, 'sessions')
    result = col_fam.get(site)
    number_of_views = [(k,v) for k,v in col_fam.get(site).items() if get_posixtime(k) > datetime.datetime.strptime(str(start_date), "%Y-%m-%d %H:%M:%S:%f") and get_posixtime(k) < datetime.datetime.strptime(str(end_date), "%Y-%m-%d %H:%M:%S:%f")]
    total_unique_sessions = len(number_of_views)
    return total_unique_sessions

get_posixtime 被定义为：

def get_posixtime(uuid1):
    assert uuid1.version == 1, ValueError('only applies to type 1')
    t = uuid1.time
    t = (t - 0x01b21dd213814000L)
    t = t / 1e7
    return datetime.datetime.fromtimestamp(t)

这似乎不起作用（没有返回我期望的数据）并且感觉它不应该是必要的。我正在使用以下方法创建列时间戳：

timestamp = datetime.datetime.now()

有人有什么想法吗？感觉这是 Pycassa（或其他 python 库）支持的那种东西，但我不知道该怎么做。

附言cqlsh 描述的表模式：

CREATE COLUMNFAMILY sessions (
  KEY text PRIMARY KEY
) WITH
  comment='' AND
  comparator='TimeUUIDType' AND
  row_cache_provider='ConcurrentLinkedHashCacheProvider' AND
  key_cache_size=200000.000000 AND
  row_cache_size=0.000000 AND
  read_repair_chance=1.000000 AND
  gc_grace_seconds=864000 AND
  default_validation=text AND
  min_compaction_threshold=4 AND
  max_compaction_threshold=32 AND
  row_cache_save_period_in_seconds=0 AND
  key_cache_save_period_in_seconds=14400 AND
  replicate_on_write=True;

附言

我知道您可以在 Pycassa 中指定一个列范围，但我无法保证该范围的开始值和结束值对每一行都有条目，因此该列可能不存在。

【问题讨论】：

标签： python nosql cassandra pycassa

【解决方案1】：

您确实希望使用 get()、multiget()、get_count()、get_range() 等参数使用 column_start 和 column_finish 参数请求列的“切片”。对于 TimeUUIDType 比较器，pycassa 实际上接受datetime 这两个参数的实例或时间戳；它将在内部将它们转换为具有匹配时间戳组件的类似 TimeUUID 的表单。文档中有一部分专门针对 working with TimeUUIDs 提供了更多详细信息。

例如，我会这样实现你的函数：

def get_number_of_visitors(site, start_date, end_date=None):
    """
    start_date and end_date should be datetime.datetime instances or
    timestamps like those returned from time.time().
    """
    if end_date is None:
        end_date = datetime.datetime.now()
    pool = ConnectionPool('Logs', timeout = 2)
    col_fam = ColumnFamily(pool, 'sessions')
    return col_fam.get_count(site, column_start=start_date, column_finish=end_date)

您可以使用与col_fam.get() 或col_fam.xget() 相同的表单来获取实际的访问者列表。

附：尽量不要为每个请求创建一个新的ConnectionPool()。如果必须，请设置较小的池大小。

【讨论】：

非常感谢 - 添加奖励积分（如果可以的话）以获取有关如何提高效率的其他建议 :)