数据库索引VS内存索引全表扫描答案

【问题标题】：Database index VS full table scan of memory index数据库索引VS内存索引全表扫描
【发布时间】：2018-04-12 07:19:35
【问题描述】：

我想创建一个应用程序，例如（例如）Tinder。所以我必须能够列出我周围所有符合某些标准（年龄、宗教等）的用户。

实际上所有的用户都存储在 mongoDB 中，但是 mongoDB 看起来做这样的查询很糟糕，例如我做的

db.runCommand( { dropDatabase: 1 } )

db.createCollection("users"); 

db.users.createIndex( { "locs.loc" : "2dsphere" } )


function randInt(n) { return parseInt(Math.random()*n); }
function randFloat(n) { return Math.random()*n; }

for(var j=0; j<10; j++) {  
  print("Building op "+j);
  var bulkop=db.users.initializeOrderedBulkOp() ;
  for (var i = 0; i < 1000000; ++i) {
    bulkop.insert(    
      {
        locs: [
          {
            loc : { 
              type: "Point", 
              coordinates: [ randFloat(180), randFloat(90) ] 
            }
          },
          {
            loc : { 
              type: "Point", 
              coordinates: [ randFloat(180), randFloat(90) ] 
            }
          }
        ]
      }  
    )
  };
  print("Executing op "+j);
  bulkop.execute();
}

然后

db.runCommand(
   {
     geoNear: "users",
     near: { type: "Point", coordinates: [ 73.9667, 40.78 ] },
     spherical: true,
     query: { category: "xyz" }
   }
)

我花了 4 分钟返回

   "waitedMS" : NumberLong(0),
   "results" : [ ],
   "stats" : {
           "nscanned" : 10018218,
           "objectsLoaded" : 15000000,
           "maxDistance" : 0,
           "time" : 219873
   },
   "ok" : 1

所以我绝对必须使用其他东西但是什么？我很确定我需要一个像sphinx 这样的内存索引（所以只需将所有记录存储在内存中，并在每次查询时对所有行进行全面扫描）。实际上它工作得很好，但狮身人面像索引是面向索引文本文档，我不确定它是否适合我的需要。

【问题讨论】：

嗨洛基；请问您是否调查过为什么此查询运行缓慢？例如，其他类型的地理空间查询也运行缓慢；或者您是否考虑过在 category 和 loc 字段上创建 compound index？

标签： database mongodb indexing database-design sphinx

【解决方案1】：

在 Sphinx / Manticore 中搜索超过 100 万个文档会更快。在我的服务器（不是很强大的服务器）上，它需要大约 100 毫秒，而索引需要大约 16M 的 RAM 和大约 31M 的磁盘空间。

mysql> select id, geodist(lat,lng,73.9667,40.78, {in=deg,out=km}) dist, lat, lng from idx where dist < 5;
+--------+----------+-----------+-----------+
| id     | dist     | lat       | lng       |
+--------+----------+-----------+-----------+
| 456688 | 4.311642 | 74.005157 | 40.793140 |
| 679960 | 2.206543 | 73.979790 | 40.726372 |
| 904809 | 3.339423 | 73.936790 | 40.783146 |
+--------+----------+-----------+-----------+
3 rows in set (0.10 sec)

mysql> select count(*) from idx;
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+
1 row in set (0.04 sec)

[snikolaev@dev01 ~]$ ls -lah idx_1m.sp*
-rw------- 1 snikolaev snikolaev  16M Apr 12 05:17 idx_1m.spa
-rw------- 1 snikolaev snikolaev 6.7M Apr 12 05:17 idx_1m.spd
-rw------- 1 snikolaev snikolaev    1 Apr 12 05:17 idx_1m.spe
-rw------- 1 snikolaev snikolaev  334 Apr 12 05:17 idx_1m.sph
-rw------- 1 snikolaev snikolaev 7.8M Apr 12 05:17 idx_1m.spi
-rw------- 1 snikolaev snikolaev    0 Apr 12 05:17 idx_1m.spk
-rw------- 1 snikolaev snikolaev    0 Apr 12 05:17 idx_1m.spl
-rw------- 1 snikolaev snikolaev    0 Apr 12 05:17 idx_1m.spm
-rw------- 1 snikolaev snikolaev    1 Apr 12 05:17 idx_1m.spp
-rw------- 1 snikolaev snikolaev    1 Apr 12 05:17 idx_1m.sps

所以我认为在您的情况下使用 Sphinx / Manticore 没有任何问题：

如果您更喜欢批量数据加载 xmlpipe/csvpipe 将允许您加载轻松获取来自 mongodb 的数据
如果您需要实时加载数据，也可以通过实时索引来实现
性能/资源消耗为水平不错

请注意，虽然它不是纯粹的内存解决方案，即您的数据一旦被索引将存储在磁盘上，但属性（在您的情况下为经度和经度）始终保存在内存中以获得更好的性能。

另一个选项（如果您正在寻找更多的内存解决方案）是 RediSearch，它也可以进行地理搜索 - https://redis.io/commands/georadius 我不是这方面的专家，所以不能说它是否比 Sphinx / Manticore 快。

【讨论】：

感谢 manticore！但像狮身人面像我认为 Manticore 主要是面向全文搜索引擎。在 manticore 中我们如何更新索引？
@loki 您可能是指属性更新，而不是文本。然后可以这样完成：mysql> update idx set lat = 75.028214, lng = 41.751846 where id = 41219;也可以通过表达式更新。默认情况下，更新保留在内存中。 “FLUSH ATTRIBUTES”命令会将更新同步到磁盘，或者可以偶尔同步一次（配置中有选项 rt_flush_period）