【问题标题】：why clustering key is named "clustering key" in cassandra?为什么集群键在 cassandra 中被命名为“集群键”？
【发布时间】：2018-12-13 22:56:27
【问题描述】：

表 1：

create table mylistofitems (listid int, 
  itemid int, 
  quantity int, 
  itemdesc text, 
  primary key ((listid, itemid), itemdesc));

在上表中，我正在执行以下插入操作：

insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'apple', 5);
insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'banana', 10);
insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'orange', 6);
insert into mylistofitems (listid, itemid, itemdesc, quantity) values (1, 1000, 'orange', 50);

当我从 mylistofitems 中选择 * 时，我得到以下信息：

 listid | itemid | itemdesc | quantity
--------+--------+----------+----------
      1 |   1000 |    apple |        5
      1 |   1000 |   banana |       10
      1 |   1000 |   orange |       50

第二个插入语句没有覆盖第一行。但是第四个插入语句已经覆盖了第三行。

在这种情况下，聚类键是什么意思？

表 2：

create table myitems (listid int,
  itemid int, 
  idesc text, 
  qty int,
  primary key (listid, itemid));

我将以下记录插入到表 2 中：

insert into myitems (listid, itemid, idesc, qty) values (1, 1000,
'apple', 5);
insert into myitems (listid, itemid, idesc, qty) values (1, 1000, 'banana', 10);
insert into myitems (listid, itemid, idesc, qty) values (1, 1000, 'orange', 6);
insert into myitems (listid, itemid, idesc, qty) values (1, 1000, 'orange', 50);

在 table-2 中插入查询与 table-1 完全相同。但是当我从 myitems 中选择 * 时，我很惊讶地看到只有最后插入的一行。其余所有行都丢失了。即，每个插入语句都会覆盖之前的记录。

 listid | itemid | idesc  | qty
--------+--------+--------+-----
      1 |   1000 | orange |  50

问题： 为什么它在表 2 中的行为与表 1 不同？在这种情况下，聚类键的含义是什么？为什么集群键被命名为“集群键”。和cassandra集群有关系吗？

更新问题： 我尝试对表 1 进行更新：

update mylistofitems set quantity = 100 where listid = 1 and itemid = 1000;

这表示错误 2200 缺少某些集群键。为什么会受到限制？

【问题讨论】：

标签： cassandra data-modeling

【解决方案1】：

什么是集群键？

集群键决定了数据在磁盘上的存储方式。这是 Cassandra 如此高效的原因之一。因为列的顺序改变了数据的存储方式，所以知道它们是在 Cassandra 的内部进行管理很重要。

将磁盘上的数据可视化为一个数组。这就是 Cassandra 存储数据的有效方式。这是前 3 个查询后第一个表的样子：

table1 =
(listid(1) - itemid(1000)) // Partition key
    idesc('apple') // Clustering key
        = {listid: 1, itemid: 1000, idesc: apple, qty: 5}
    idesc('banana') // Clustering key
        = {listid: 1, itemid: 1000, idesc: banana, qty: 10}
    idesc('orange') // Clustering key
        = {listid: 1, itemid: 1000, idesc: orange, qty: 6}

在第四次插入时，它将使用每个集群键（或本例中的索引）遍历数据，以找到要覆盖的最后一条数据。所以在第四次插入之后它会是这样的：

table1 =
(listid(1) - itemid(1000)) // Partition key
    idesc('apple') // Clustering key
        = {listid: 1, itemid: 1000, idesc: apple, qty: 5}
    idesc('banana') // Clustering key
        = {listid: 1, itemid: 1000, idesc: banana, qty: 10}
    idesc('orange') // Clustering key
        = {listid: 1, itemid: 1000, idesc: orange, qty: 50}

缺少分区/集群键

使用以下查询和我的示例，以像 Cassandra 那样访问数据。

WHERE listid IN (1, 2) and itemid = 1000

result = (data[1-1000], data[2-1000])
WHERE listid = 1 AND itemid = 1000 AND idesc = 'apple'

result = data[1-1000]['apple']
WHERE idesc = 'apple'

result = data[????]['apple']

C* 不知道要搜索哪个索引 apple。

请务必注意，插入或更新数据时也是如此。让我们在这里以您的UPDATE 查询为例。

UPDATE mylistofitems SET quantity = 100 WHERE listid = 1 AND itemid = 1000;

通过此查询，您正在尝试执行此操作：

`data[1-1000][????] = {listid: 1, itemid: 1000, idesc:????, qty: 1000}`

C* 不知道将数据存储在哪个索引中。

您应该将查询更新为以下内容：

UPDATE mylistofitems SET quantity = 100 WHERE listid = 1 AND itemid = 1000 AND idesc = 'orange';

数组形式如下：

`data[1-1000]['orange'] = {listid: 1, itemid: 1000, idesc: 'orange', qty: 1000}`

添加数量作为聚类键

如果您将quantity 添加为聚类键，则数据结构将如下所示：

table1 =
(listid(1) - itemid(1000)) // Partition key
    idesc('apple') // Clustering key
        quantity(5) // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'apple', qty: 5}
    idesc('banana') // Clustering key
        quantity(10) // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'banana', qty: 10}
    idesc('orange') // Clustering key
        quantity(6) // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'orange', qty: 6}
        quantity(50 // Clustering key
            = {listid: 1, itemid: 1000, idesc: 'orange', qty: 50}

这将允许您为每个组合拥有多行，尽管您不能拥有包含相同数据的多行。

一般规则

分区 + 聚类键是每一行的唯一字段
如果不将之前的键包含在查询中，则无法通过键进行查询
Cassandra 没有插入/更新 - 只有 upserts
插入行时，必须指定所有键

【讨论】：

谢谢吉姆！有点解释。但是我有一个后续问题：我知道 SSTables 是不可变的，但是 memtable 呢？在上面的例子中：第三条记录真的被更新/覆盖了吗？或者它只是带有最新时间戳的盲插入，并且连续的选择语句足够聪明，可以排除带有旧时间戳的橙色？为什么 cassandra 不能允许另一个橙色（如果这是我的意图），因为我使用的是插入语句而不是更新语句。允许另一个橙色会有什么影响？
简而言之，这听起来像分区键不必是唯一的，但是对于给定的分区键，您的集群键总是被限制为唯一的。我想知道这个限制的设计原理。
AFAIK 内存表是可变的。我刚刚意识到我的数据结构示例是错误的。我指定的最后一个索引是错误的，因为 quantity 不是集群键。我会更新答案。如果您将quantity 设为集群键，那么它将与原始记录一起存储。
@Jinnah C* 是这样设计的，用于性能和扩展。将您的数据非规范化为许多服务于特定查询的表是很常见的。当您可以运行非常动态的查询时，这与关系数据库不同。

【解决方案2】：

我想回答我自己的问题以关闭此线程。也可以帮助其他有同样困惑的人：

基本上我忽略了主键、分区键、集群键的概念。

table-1的主键是：

primary key ((listid, itemid), itemdesc));

这意味着 listid+itemid 只是一个分区键，帮助记录找到要登陆的节点。

只有listid+itemid+itemdesc的组合才会保持实际唯一性。

总结：

listid+itemid = composite partition key
listid+itemid+itemdesc = composite primary key
itemdesc = clustering key

（聚类键仅用于对每个分区下的项目进行排序，有助于通过该列与各种关系运算符进行查询）

在这种情况下，聚类只不过是按分区键对记录进行分组，然后在每个分区键下的 ASC（默认情况下）中对它们进行排序。换句话说，它是一个 group by 和 order by。

这与 RDBMS 完全不同。在 RDBMS 世界中，您可以根据需要在检索时使用 group by 和 order by。在 Cassandra 中，我们在插入时使用分组和排序，以便您的检索更快（取决于使用的查询）

Table-2 被定义为主键(listid, itemid) 这意味着：

listid = standalone partition key
itemid = standalone clustering key
listid + itemid = composite primary key

【讨论】：