停用词没有被删除 - solr答案

【问题标题】：Stopwords not getting removed - solr停用词没有被删除 - solr
【发布时间】：2023-03-04 14:36:01
【问题描述】：

我刚开始使用 Solr，并定义了以下架构：

<schema name="example" version="1.5">
<fields>
    <field name="nodeId" type="string" indexed="true" stored="true" />
    <field name="_root_" type="string" indexed="true" stored="false" />
    <field name="datetime" type="string" indexed="true" stored="true"
        multiValued="true" />
    <field name="epochSecs" type="string" indexed="true" stored="true"
                    multiValued="true" />
    <field name="subject" type="text_general" indexed="true"
        stored="true" />
    <field name="body" type="text_general" indexed="true"
        stored="true" />
    <field name="emailId" type="string" indexed="true"
        stored="true" />
    <field name="compliantFlag" type="boolean" indexed="true"
                    stored="true" />
    <field name="_version_" type="long" indexed="true" stored="true" />
    <field name="text" type="text_general" indexed="true" stored="false"
        multiValued="true" />
    <field name="ngrams" type="myNGram" indexed="true" stored="false" required="false" />


</fields>
<uniqueKey>nodeId</uniqueKey>
<copyField source="datetime" dest="text" />
<copyField source="epochSecs" dest="text" />
<copyField source="subject" dest="text" />
<copyField source="body" dest="text" />
<copyField source="emailId" dest="text" />
<copyField source="compliantFlag" dest="text" />
<copyField source="text" dest="ngrams"/>

<types>
    <fieldType name="string" class="solr.StrField"
        sortMissingLast="true" omitNorms="true"/>
    <fieldType name="long" class="solr.TrieLongField"
                    precisionStep="0" positionIncrementGap="0" />
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldType name="text_general" class="solr.TextField"
        positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
            <filter class="solr.PorterStemFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.PorterStemFilterFactory"/>
        </analyzer>
    </fieldType>
    <fieldType name="myNGram" stored="false" class="solr.TextField"> 
        <analyzer type="index"> 
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/> 
            <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="5"/> 
        </analyzer> 
    </fieldType>
</types>

索引时，停用词不会从“正文”字段中删除。

另外，我如何使用 solr 的分析器从以下字段中删除特殊字符，如 \n：

\n \n\n\nThese are the numbers Smurfit has.  \n\nP

感谢任何帮助。谢谢。

【问题讨论】：

标签： solr lucene indexing information-retrieval

【解决方案1】：

StandardTokenizer 应该在换行符、空格等周围创建标记，并且停用词过滤器乍一看应该可以正常工作。不过，您可能应该在 StopwordFilter 上方添加一个 LowercaseFilter，以防止这些匹配项区分大小写。

我想知道一个相关的问题可能是：“删除”是什么意思？分析只影响字段的索引表示。它不会以任何方式影响您从索引中检索到的存储版本。它旨在促进搜索，而不是转换文本的存储版本。如果您通过过滤器删除了单词“the”，那么在搜索时您应该不会再看到单词“the”，但是当您从索引中检索文档时，您仍然会看到。

【讨论】：

谢谢！我可以在“正文”字段中看到带有停用词（solr UI）的文本。这是预期的行为吗？当我执行查询时，我仍然看到停用词（尽管我也将分析器用于查询）。此外，当我使用“the”进行搜索时，我仍然得到一些结果。所以我相信，我的索引没有正确进行。请发表评论！
好吧，仔细看，您在 index 和 query 分析器中使用了不同的停用词集（“lang/stopwords_en.txt”与“stopwords.txt”），这是非常不寻常的.如果两个集合都正确加载并包含“the”，那么我不希望查询 body:the 返回结果。
我应该怎么做才能从存储的版本中消除停用词？ solr 是否为此提供任何功能，还是我需要手动执行？
开箱即用，据我所知，确实没有任何方法可以通过分析来转换存储的文本。再说一次，这不是它的用途。您可以利用 Lucene 标记器和过滤器手动完成，但我认为这需要手动完成，作为预处理步骤。