时间序列数据，在 cassandra 中使用 maxTimeuuid/minTimeuuid 选择范围答案

【问题标题】：time series data, selecting range with maxTimeuuid/minTimeuuid in cassandra时间序列数据，在 cassandra 中使用 maxTimeuuid/minTimeuuid 选择范围
【发布时间】：2013-07-26 13:53:27
【问题描述】：

我最近在 cassandra 中创建了一个键空间和一个列族。我有以下

CREATE TABLE reports (
  id timeuuid PRIMARY KEY,
  report varchar
)

我想根据时间范围选择报告。所以我的查询如下；

select dateOf(id), id 
from keyspace.reports 
where token(id) > token(maxTimeuuid('2013-07-16 16:10:48+0300'));

返回；

dateOf(id)                | id
--------------------------+--------------------------------------
 2013-07-16 16:10:37+0300 | 1b3f6d00-ee19-11e2-8734-8d331d938752
 2013-07-16 16:10:13+0300 | 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b
 2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
 2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e

所以，这是错误的。

当我尝试使用以下 cql;

select dateOf(id), id from keyspace.reports 
where token(id) > token(minTimeuuid('2013-07-16 16:12:48+0300'));

 dateOf(id)               | id
--------------------------+--------------------------------------
 2013-07-16 16:10:37+0300 | 1b3f6d00-ee19-11e2-8734-8d331d938752
 2013-07-16 16:10:13+0300 | 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b
 2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
 2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e

select dateOf(id), id from keyspace.reports
where token(id) > token(minTimeuuid('2013-07-16 16:13:48+0300'));

 dateOf(id)               | id
--------------------------+--------------------------------------
 2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
 2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e

是随机的吗？为什么它不提供有意义的输出？

cassandra 中最好的解决方案是什么？

【问题讨论】：

标签： cassandra

【解决方案1】：

您正在使用令牌函数，该函数在您的上下文中并不是很有用（使用 mintimeuuid 和 maxtimeuuid 在时间之间进行查询）并且正在生成看起来随机且不正确的输出：

来自CQL documentation：

TOKEN 函数可以与分区键列上的条件运算符一起使用以进行查询。该查询基于其分区键的标记而不是它们的值来选择行。密钥的令牌取决于使用的分区器。 RandomPartitioner 和 Murmur3Partitioner 不会产生有意义的顺序。

如果您希望根据两个日期之间的所有记录进行检索，则将数据建模为宽行可能更有意义，每列一条记录，而不是每行一条记录，例如，创建表：

CREATE TABLE reports (
  reportname text,
  id timeuuid,
  report text,
  PRIMARY KEY (reportname, id)
)

，填充数据：

insert into reports2(reportname,id,report) VALUES ('report', 1b3f6d00-ee19-11e2-8734-8d331d938752, 'a');
insert into reports2(reportname,id,report) VALUES ('report', 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b, 'b');
insert into reports2(reportname,id,report) VALUES ('report', 1b275870-ee19-11e2-b3f3-af3e3057c60f, 'c');
insert into reports2(reportname,id,report) VALUES ('report', 21f9a390-ee19-11e2-89a2-97143e6cae9e, 'd');

和查询（没有令牌调用！）：

select dateOf(id),id from reports2 where reportname='report' and id>maxtimeuuid('2013-07-16 16:10:48+0300');

，返回预期结果：

 dateOf(id)               | id
--------------------------+--------------------------------------
 2013-07-16 14:10:48+0100 | 21f9a390-ee19-11e2-89a2-97143e6cae9e

这样做的缺点是您的所有报告都在一行中，当然您现在可以存储许多不同的报告（此处由报告名称键入）。要在 2013 年 8 月获取所有名为 mynewreport 的报告，您可以使用以下命令进行查询：

select dateOf(id),id from reports2 where reportname='mynewreport' and id>=mintimeuuid('2013-08-01+0300') and id<mintimeuuid('2013-09-01+0300');

【讨论】：

如果我们在分区键上执行 mintimeuuid，假设它是 timeuuid 类型（而不是示例中的集群键），会不会有性能损失？