【问题标题】:Elastic search count query based on field with value containing filesystem path基于字段的弹性搜索计数查询,其值包含文件系统路径
【发布时间】:2021-12-10 01:29:54
【问题描述】:

我之前问过这个问题here 但是,当我尝试使用更多数据的解决方案时,我很快就意识到了自己的错误。

所以我回到第一方。所以我希望再次提出这个问题并获得更多见解。

我的任务仍然相同,但更准确地说是根据多个值获取文档计数,包括包含系统文件路径等值的路径字段。

我的示例数据如下所示:

{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
        "max_score": 15.9074545,
        "hits": [
            {
                "_index": "stage-data-20210728115212095",
                "_type": "_doc",
                "_id": "fil.31c425766287497ec5a508d995d1ce36",
                "_score": 15.9074545,
                "_source": {
                    "header_action": "uploaded",
                    "partition": 7,
                    "offset": 11382619,
                    "volumeId": "vol.e144f0bc59914725528f08d995ebd8c3",
                    "lambdaLagMs": 0,
                    "id": "fil.31c425766287497ec5a508d995d1ce36",
                    "name": "sampleFile.txt",
                    "parentFolderId": "fol.6357e749063445b0c5a408d995d1ce36",
                    "volumeName": "test-vol-b2ee569932dd470788ebc70e6f15bf36",
                    "type": "text/plain",
                    "path": "/test_Folder-ed9cc1294ba841f98fa986be7ac38813/Folder1/sampleFile.txt",
                    "timeCreated": "2021-10-23T06:10:45.287Z",
                    "timeModified": "2021-10-23T06:10:45.287Z",
                    "sizeInBytes": 26,
                    "isUploaded": true,
                    "archiveStatus": "None",
                    "storageTier": "Standard",
                    "eTag": "ed6a6e795564952d4d9707e7dc91c6a6",
                    "format": "TXT",
                    "status": "Available",
                    "recordDateTime": "2021-10-23 06:10:47.268",
                    "recordTurnAroundTimeMs": 2629.375,
                    "dataType": "File"
                }
            },
            {
                "_index": "stage-data-20210728115212095",
                "_type": "_doc",
                "_id": "fil.6075863c66464a2cc5a608d995d1ce36",
                "_score": 15.500043,
                "_source": {
                    "header_action": "uploaded",
                    "partition": 15,
                    "offset": 11393012,
                    "volumeId": "vol.e144f0bc59914725528f08d995ebd8c3",
                    "lambdaLagMs": 0,
                    "id": "fil.6075863c66464a2cc5a608d995d1ce36",
                    "name": "testFile.txt",
                    "parentFolderId": "fol.230c9c8861fa40640cc808d995d1b210",
                    "volumeName": "test-vol-b2ee569932dd470788ebc70e6f15bf36",
                    "type": "text/plain",
                    "path": "/test_Folder-ed9cc1294ba841f98fa986be7ac38813/testFile.txt",
                    "timeCreated": "2021-10-23T06:10:45.286Z",
                    "timeModified": "2021-10-23T06:10:45.286Z",
                    "sizeInBytes": 23,
                    "isUploaded": true,
                    "archiveStatus": "None",
                    "storageTier": "Standard",
                    "eTag": "2b9f6fc56449eb68b4fa5c5da127c5be",
                    "format": "TXT",
                    "status": "Available",
                    "recordDateTime": "2021-10-23 06:10:47.284",
                    "recordTurnAroundTimeMs": 2628.936,
                    "dataType": "File"
                }
            },
            {
                "_index": "stage-data-20210728115212095",
                "_type": "_doc",
                "_id": "fil.27a781dc81554811576308d995d1ce3c",
                "_score": 15.500043,
                "_source": {
                    "header_action": "uploaded",
                    "partition": 6,
                    "offset": 11377991,
                    "volumeId": "vol.e144f0bc59914725528f08d995ebd8c3",
                    "lambdaLagMs": 0,
                    "id": "fil.27a781dc81554811576308d995d1ce3c",
                    "name": "smallfile.txt",
                    "parentFolderId": "fol.6ac9ecb11dae4ebd576208d995d1ce3c",
                    "volumeName": "test-vol-b2ee569932dd470788ebc70e6f15bf36",
                    "type": "text/plain",
                    "path": "/test_Folder-ed9cc1294ba841f98fa986be7ac38813/Folder1/Folder2/smallfile.txt",
                    "timeCreated": "2021-10-23T06:10:45.294Z",
                    "timeModified": "2021-10-23T06:10:45.294Z",
                    "sizeInBytes": 1249,
                    "isUploaded": true,
                    "archiveStatus": "None",
                    "storageTier": "Standard",
                    "eTag": "c6e9338f9e54e39b52dd853908a1aecd",
                    "status": "Available",
                    "recordDateTime": "2021-10-23 06:10:47.276",
                    "recordTurnAroundTimeMs": 2629.8689999999997,
                    "dataType": "File"
                }
            }
        ]
    }
}

我正在尝试使用 NEST c# 库获取文档数。这是我的示例代码:

        var elasticSettings = new ConnectionSettings(new Uri("https://myelasticurl/"))
                .DefaultIndex("stage-data");

            var client = new ElasticClient(elasticSettings);
            var folderPrefix = "/test_Folder-ed9cc1294ba841f98fa986be7ac38813/Folder1/Folder2/";

            Func<CountDescriptor<dynamic>, ICountRequest> countQueryFilter = q => q.Query(q =>
                q.Match(m => m.Field("volumeId").Query("vol.e144f0bc59914725528f08d995ebd8c3"))
                && q.Match(m => m.Field("dataType").Query("File")) &&
                q.Wildcard(m => m.Field("path").Value($"{folderPrefix}*")));
            
            

         var countResponse= client.CountAsync(countQueryFilter);
         Console.WriteLine(countResponse.Result.Count);

这里是路径字段的映射

{
    "stage-data-20210728115212095": {
        "mappings": {
            "path": {
                "full_name": "path",
                "mapping": {
                    "path": {
                        "type": "text",
                        "fields": {
                            "raw": {
                                "type": "keyword"
                            },
                            "rawlower": {
                                "type": "keyword",
                                "normalizer": "lowercase"
                            },
                            "tree": {
                                "type": "text",
                                "analyzer": "path_analyzer"
                            },
                            "tree_level": {
                                "type": "token_count",
                                "store": true,
                                "analyzer": "path_level_analyzer",
                                "enable_position_increments": false
                            }
                        },
                        "analyzer": "ngram_analyzer"
                    }
                }
            }
        }
    }
}

如果我只搜索volumeId和dataType,我可以得到很好的结果。即使对于路径字段,对于我在根文件夹中有文件的数据集,例如 /folder1/mytxt.txt 等,查询也有效。 只有当我在上面的示例中有多个级别的文件时,当我尝试搜索 /test_Folder-ed9cc1294ba841f98fa986be7ac38813/Folder1/Folder2/ 这样的路径时,我得到 0 结果计数。

此时,我不确定是否需要调整此字段的映射设置以使其对搜索更友好,例如建议的 here,或者我是否只是使用错误的方法进行搜索。

请注意,我确实尝试了以下路径搜索方法:

  • 通配符
  • 期限
  • 正则表达式
  • 匹配

我得到了相同的结果,返回 0 条记录。

请提出我所缺少的,提前感谢您的帮助。

我在 .NET core 3.1 上使用 NEST 7.13.0。

问候, 维卡斯

【问题讨论】:

  • 您是否正在寻找与 path 字段值完全匹配的内容?
  • 嗨 Nishant,实际上不是完全匹配,而是某种通配符。我的一位同事能够找到可行的解决方案。我会尽快发布答案。

标签: c# elasticsearch nest amazon-elasticsearch


【解决方案1】:

我的一位同事对此提供了帮助,解决方案效果很好。 下面是示例代码:

 var elasticSettings = new ConnectionSettings(new Uri("https://myelasticurl/"))
                .DefaultIndex("stage-data");

            var client = new ElasticClient(elasticSettings);
            var folderPrefix = "/test_Folder-ed9cc1294ba841f98fa986be7ac38813/Folder1/Folder2/";

            Func<CountDescriptor<dynamic>, ICountRequest> countQueryFilter = q => q.Query(q =>
                q.Match(m => m.Field("volumeId").Query("vol.e144f0bc59914725528f08d995ebd8c3"))
                && q.Match(m => m.Field("dataType").Query("File")) &&
                q.Prefix(m => m.Field("path.raw").Value($"{folderPrefix}")));
            
            

         var countResponse= client.CountAsync(countQueryFilter);
         Console.WriteLine(countResponse.Result.Count);

所以基本上需要使用前缀过滤器以及映射中定义的 path.raw。

【讨论】:

    猜你喜欢
    • 2018-11-03
    • 2021-11-28
    • 2014-12-13
    • 2018-08-02
    • 2017-09-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多