【发布时间】:2015-01-05 19:15:29
【问题描述】:
我有两个分片,并且正在尝试使用对分片的分布式搜索来实现建议器(使用 solr 4.10.1)。似乎建议者遍历每个分片并加入结果集,留下重复项。在我的 solrconfig.xml 文件中,我有以下内容:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">titleSuggester</str>
<str name="lookupimpl">AnalyzingLookupFactory</str>
<str name="lookupimpl">FreeTextSuggesterFactory</str>
<str name="dictionaryimpl">DocumentDictionaryFactory</str>
<str name="field">title_sug</str>
<str name="weightField">rank</str>
<str name="suggestAnalyzerFieldType">shingleSuggest</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>`
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
http://localhost:8983/solr/collection1/suggest?suggest.dictionary=titleSuggester&shards.qt=/suggest&shards=shard1,shard2&suggest.q=an&wt=json&indent=true 结果:
{ "responseHeader":{
"status":0,
"QTime":12}, "suggest":{"titleSuggester":{
"an":{
"numFound":10,
"suggestions":[{
"term":"an",
"weight":149,
"payload":""},
{
"term":"an",
"weight":142,
"payload":""},
{
"term":"an american",
"weight":6,
"payload":""},
{
"term":"an affair",
"weight":4,
"payload":""},
{
"term":"an 18th century",
"weight":2,
"payload":""},
{
"term":"an 18th",
"weight":2,
"payload":""},
{
"term":"an american hymn",
"weight":2,
"payload":""},
{
"term":"an 18th century drawing room",
"weight":2,
"payload":""},
{
"term":"an 18th century drawing",
"weight":2,
"payload":""},
{
"term":"an american hymn (main",
"weight":2,
"payload":""}]}}}}
从上面可以看出,结果项“an”被返回了两次,每个分片一个。如果我用 distrib=false (
http://localhost:8983/solr/collection1/suggest?suggest.dictionary=titleSuggester&distrib=false&suggest.q=an&wt=json&indent=true),正如预期的那样,我只得到没有重复:
{ "responseHeader":{
"status":0,
"QTime":1},
"suggest":{"titleSuggester":{
"an":{
"numFound":10,
"suggestions":[{
"term":"an",
"weight":149,
"payload":""},
{
"term":"an 18th",
"weight":2,
"payload":""},
{
"term":"an 18th century",
"weight":2,
"payload":""},
{
"term":"an 18th century drawing",
"weight":2,
"payload":""},
{
"term":"an 18th century drawing room",
"weight":2,
"payload":""},
{
"term":"an absolution take",
"weight":1,
"payload":""},
{
"term":"an absolution take her",
"weight":1,
"payload":""},
{
"term":"an absolution take her to",
"weight":1,
"payload":""},
{
"term":"an absolution take her to sea,",
"weight":1,
"payload":""},
{
"term":"an affair",
"weight":4,
"payload":""}]}}}}
有没有办法去除重复的结果?
【问题讨论】:
-
我更喜欢在 Solr 中执行此操作,但如果我没有得到解决方案,我们将在客户端执行此操作。
-
@你能弄明白吗?就我而言,没有
distrib=false我得到了一个非常高的数字,但有了distrib=false我得到了正确的计数。 -
没有。起初我们只是过滤掉客户端上的重复项,然后它变得无关紧要,因为(因为找到here 的原因)我为建议者创建了一个新核心,并且我使用一个分片制作。 distrib=false 不会重复结果,因为它仅从其中一个核心获取结果。
-
似乎是因为 Solr 路由器策略在我们的例子中从
composite更改为implicit。
标签: solr