Solr进阶之微信朋友圈搜索实现(关系搜索)

微信搜索中有朋友圈搜索,简单观察下,能搜索自己朋友圈中对自己可见的各个状态.
这里搜索的实现涉及到两个要点:
1.简单的模糊搜索
2.自己的朋友圈关系链条

这里最直观的实现方式肯定就是建立一个Collection,建索引时将该状态能看到的用户id都存储起来,然后直接使用查询用户的id来进行过滤.
这里存在两点问题:
1.用户的好友链可能会很长,如微信好友最多可到5000人,如果一个字段中存储5000个用户id
这里是不太现实的(对于每一条动态而言)
2.用户的好友状态是在时刻变化的,索引也要进行实时的跟着变化,这里如果冗余在一起,改变一个好友关系,历史状态都要进行更新,规模太大.

综上,上述方式是可以实现的,但是是下下之选.

这里讲述另外一种方式.基于Solr的Join方式的多Collection查询.

Solr的Join 类似于Mysql中的多表查询,但是略有区别.Join的主要目的是加速过滤.联合查询.

看一下官方文档中SolrCloud模式下的Join查询的例子和说明:

Joining Across Collections

You can also specify a fromIndex parameter to join with a field from another core or collection. If running in SolrCloud mode, then the collection specified in the fromIndex parameter must have a single shard and a replica on all Solr nodes where the collection you're joining to has a replica.

Let's consider an example where you want to use a Solr join query to filter movies by directors that have won an Oscar. Specifically, imagine we have two collections with the following fields:

movies: id, title, director_id, ...

movie_directors: id, name, has_oscar, ...

To filter movies by directors that have won an Oscar using a Solr join on the movie_directors collection, you can send the following filter query to the movies collection:

fq={!join from=id fromIndex=movie_directors to=director_id}has_oscar:true

Notice that the query criteria of the filter (has_oscar:true) is based on a field in the collection specified using fromIndex. Keep in mind that you cannot return fields from the fromIndex collection using join queries, you can only use the fields for filtering results in the "to" collection (movies).

Next, let's understand how these collections need to be deployed in your cluster. Imagine the movies collection is deployed to a four node SolrCloud cluster and has two shards with a replication factor of two. Specifically, the movies collection has replicas on the following four nodes:

node 1: movies_shard1_replica1

node 2: movies_shard1_replica2

node 3: movies_shard2_replica1

node 4: movies_shard2_replica2

To use the movie_directors collection in Solr join queries with the movies collection, it needs to have a replica on each of the four nodes. In other words, movie_directors must have one shard and replication factor of four:

node 1: movie_directors_shard1_replica1

node 2: movie_directors_shard1_replica2

node 3: movie_directors_shard1_replica3

node 4: movie_directors_shard1_replica4

At query time, the JoinQParser will access the local replica of the movie_directors collection to perform the join. If a local replica is not available or active, then the query will fail. At this point, it should be clear that since you're limited to a single shard and the data must be replicated across all nodes where it is needed, this approach works better with smaller data sets where there is a one-to-many relationship between the from collection and the to collection. Moreover, if you add a replica to the to collection, then you also need to add a replica for the from collection.

例子:
fq={!join from=id fromIndex=movie_directors to=director_id}has_oscar:true
给出了使用的要点:

from 为从表字段
fromIndex 为从表表名(集合名称)
to 为主表字段
过滤条件直接使用于从表

这里的意思是经过过滤条件处理后将从表id字段中还剩余的结果映射到主表 director_id 中(即
directror_id 必须为剩余的从表id字段中的结果).这里是fq即主要的主查询还可以自由设置的.

官方文档中指出,从表不能进行分片操作,而且要保证主要的每个节点上都有一份完整的从表,这里
也可以理解,其实相当于本地查询结果后直接进行过滤了,相对于你自己查询中间节省了网络传输的时间.缺点就是不能分片还要保证共同存活.

下面是基于此功能实现的简单的关系搜索:

集合索引结构:

动态索引 :

关系索引 :

动态索引类似朋友圈发布的动态了,这里是最简单的形式,动态id,动态内容,和发布动态的用户id

关系索引中存储也是最简单的朋友圈权限关系,权限id,用户id,朋友id,这里的权限字段省略了,我默认他们只要存在都是可见的.

测试数据:

http://pan.baidu.com/s/1qYNmIza

创建两个集合meixin_dynameic 和meixin_relationship 使用查询条件:

fq={!join from=friendId fromIndex=meixin_relationship to=userId}userId:1

使用这种办法能够结果没条动态都要冗余关系和实时更新的问题,但是从表不能分片的问题需要思考解决,这里有两个思路:
1.自己查询权限关系后手动过滤(可能面对过滤链比较长的情况)
2.另类的分片,对用户id做hash的方式创建多个从表,查询时候也做hash决定查询的从表

这里觉得只是勉强实现吧,希望以后能找到更加高效的方法.