Sparql如何对这种数据进行分组答案

【问题标题】：Sparql how to group this kind of dataSparql如何对这种数据进行分组
【发布时间】：2016-04-05 13:12:35
【问题描述】：

因为担心你不了解我的情况，所以我为你做了这个视觉插图（点击图片查看高质量版本）。

我知道一个用户（不管是谁，我们不在乎）喜欢一个项目(i1)。

我们想推荐其他项目：

i1 与i2 相似，取决于特定的标准（因此存在相似值，我们称之为s1）

i1也与i2类似，但取决于另一个标准（所以有相似度值，我们称之为s2）

i1也和i2类似，但是依赖于第三个条件（所以有相似度值，姑且称之为s3）

现在i2 属于两个类别，并且每个类别都会通过特定的权重影响相似度。

我的问题

我想计算i1 和i2 之间的最终相似度，除了特定类的权重之外，我几乎做了所有的事情。

我的问题是这个权重不应该应用于导致选择i2 的标准。换句话说，如果i2 使用 1000 个条件被选择了 1000 次，并且 i2 属于特定类，那么该类的权重将只应用一次，而不是 1000 次，如果 i2 属于两个类，这两个类的两个权重将只应用一次关于有多少标准导致选择i2

现在

为了方便您帮助我，我做了这个查询（好吧，但它必须很长时间才能向您展示这个案例），但我也通过让我的查询只选择所需的信息来方便您所以你可以在它上面添加另一层选择。

    prefix : <http://example.org/rs#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>


select  ?item ?similarityValue ?finalWeight where {
  values ?i1 {:i1}
  ?i1 ?similaryTo ?item .
  ?similaryTo :hasValue ?similarityValue .
  optional{
    ?item :hasContextValue ?weight .
  }
  bind (if(bound(?weight), ?weight, 1) as ?finalWeight)
}

所以该查询的结果是（查看项目i2）它重复了 6 次（如预期的那样），具有三个不同的相似性（如预期的那样，因为三个不同的标准），以及 finalWeight，即权重，针对每个标准重复：

终于

这是数据

@prefix : <http://example.org/rs#>
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

:i1 :similaryTo1 :i2 .
:similaryTo1 :hasValue 0.5 .
:i1 :similaryTo2 :i2 .
:similaryTo2 :hasValue 0.6 .
:i1 :similaryTo3 :i2 .
:similaryTo3 :hasValue 0.7 .
:i2 :hasContextValue 0.1 .
:i2 :hasContextValue 0.4 .
:i1 :similaryTo4 :i3 .
:similaryTo4 :hasValue 0.5 .

希望你能帮助我，我真的很感激

所以我想做什么

假设根本没有重量，所以我的查询将是：

prefix : <http://example.org/rs#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select  ?item ?similarityValue  where {
  values ?i1 {:i1}
  ?i1 ?similaryTo ?item .
  ?similaryTo :hasValue ?similarityValue .

}

结果将是：

然后我对具有相似性总和的项目进行聚合，如下所示：

prefix : <http://example.org/rs#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select  ?item (SUM(?similarityValue) as ?sumSimilarities)  where {
  values ?i1 {:i1}
  ?i1 ?similaryTo ?item .
  ?similaryTo :hasValue ?similarityValue .
}
group by ?item

结果是：

我想要的是将此结果的每一行乘以与?item 相关联的两个权重 的总和，对于 i2 来说是 (0.1 * 0.4)，对于 i3 来说是 (1)

请注意，有些项目没有两个权重，有些只有一个，有些什么都没有，并注意即使对于那些有两个的项目，这两个值也可能相同，因此如果在这里使用 distinct 时要小心。

最后，我总是说两个只是作为一个例子，但在现实生活中，这个数字来自动态系统。

更新在@Joshua Taylor 回答后，我将他的样本数据理解为：

【问题讨论】：

这个查询太长了，很难分析。就像评论一样，可能偏离主题：如果您使用属性路径替换不需要中间节点的三重模式的路径，则可以使其更紧凑，例如而不是 ?s :p1 ?o1 . ?o1 :p2 ?o2 . 你可以写 ?s :p1/:p2 ?o - 当然只有当你的三重存储支持 SPARQL 1.1
@AKSW 我会尝试为您解释查询，正如我在问题中所说，有一些标准可以选择与输入项目相似的项目。查询的前 4 个块只是用于处理这些标准，所以基本上你只留下一个块来广告这个权重（userContextFinalValue）。如果您需要更多描述，我可以告诉您，我确实将代码最小化到最不可能的情况，我可以在其中向您展示情况。
@AniaDavid 但是您不需要这样复杂的数据或查询来重现问题。我在this answer 中创建的示例数据不是针对您之前的问题sparql how to group correctly this data 解释了那里的问题吗？这不是差不多的情况吗？
我认为在这个问题的部分，我的问题，我认为你实际上已经非常接近描述你正在尝试去做，我想如果你坐下来弄清楚你想要实现的实际数学公式，那么查询会变得简单得多。
听起来您可能想要sumOver(property, propertyWeight * sumOver(similarItem, similarItemSimilarity)) 之类的东西。我认为，一旦你指定了公式，SPARQL 查询就会直接从中出来。

标签： sparql rdf aggregation semantic-web ontology

【解决方案1】：

一些数据

首先，我们可以使用一些数据。 item :a 有一堆相似连接，每个连接指定一个 item 和一个原因。 :a 可能出于几个不同的原因与一个项目相似，甚至可能与相同的项目和原因有重复的相似性。到目前为止，我认为这与您的用例相匹配。（问题中的样本数据可以使这一点更清楚，但我认为这与您所拥有的相符）。然后，每个项目都有上下文价值，每个原因都有一个可选的权重。

@prefix : <urn:ex:>

:a :similarTo [ :item :b ; :reason :p ] ,
              [ :item :b ; :reason :p ] , # a duplicate
              [ :item :b ; :reason :q ] ,
              [ :item :b ; :reason :r ] ,
              [ :item :c ; :reason :p ] ,
              [ :item :c ; :reason :q ] ,
              [ :item :d ; :reason :r ] ,
              [ :item :d ; :reason :s ] .

:b :context 0.01 .
:b :context 0.02 .
:c :context 0.04 .
:d :context 0.05 .
:e :context 0.06 . # not used

:p :weight 0.1 .
:q :weight 0.3 .
:r :weight 0.5 .
# no weight for :s
:t :weight 0.9 . # not used

听起来您想要做的是计算相似项目的上下文值的总和，包括每次出现的上下文值，但要对原因权重求和，但仅针对不同的出现。如果这是正确的理解，那么我认为您想要以下内容。

获取权重的原因

第一步是能够获得每个相似项目的不同原因的权重总和。

prefix : <urn:ex:>

select * where {
  values ?i { :a }

  #-- get the sum of weights of distinct reasons
  #-- for each item that is similar to ?i.
  { select ?item (sum(?weight) as ?propertyWeight) {
      #-- get the distinct properties for each ?item
      #-- along with their weights.
      { select distinct ?item ?property ?weight {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
          optional { ?property :weight ?weight_ }
          bind(if(bound(?weight_), ?weight_, 0.0) as ?weight)
        } }
    }
    group by ?item
  }
}

------------------------------
| i  | item | propertyWeight |
==============================
| :a | :b   | 0.9            |
| :a | :c   | 0.4            |
| :a | :d   | 0.5            |
------------------------------

获取项目的权重

现在，您仍然需要计算每个项目的值的总和，计算每次出现的权重。所以我们扩展查询：

select * where {
  values ?i { :a }

  #-- get the sum of weights of distinct reasons
  #-- for each item that is similar to ?i.
  { select ?item (sum(?weight) as ?propertyWeight) {
      #-- get the distinct properties for each ?item
      #-- along with their weights.
      { select distinct ?item ?property ?weight {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
          optional { ?property :weight ?weight_ }
          bind(if(bound(?weight_), ?weight_, 0.0) as ?weight)
        } }
    }
    group by ?item
  }

  #-- get the sum of the context values
  #-- for each item.
  { select ?item (sum(?context_) as ?context) {
      ?item :context ?context_ .
    }
    group by ?item
  }
}

----------------------------------------
| i  | item | propertyWeight | context |
========================================
| :a | :b   | 0.9            | 0.03    |
| :a | :c   | 0.4            | 0.04    |
| :a | :d   | 0.5            | 0.05    |
----------------------------------------

请注意，在第二个子查询中搜索 ?item :context ?context_ . 是可以的，甚至不确保 ?item 是类似的之一项目。由于两个子查询的结果是连接在一起的，所以我们只会得到第一个子查询也返回的 ?item 值的结果。

把它们放在一起

现在，您可以进行加法、乘法或其他任何操作，将原因权重的总和与上下文值的总和结合起来。例如，如果您将它们相加：

select ?i ?item ((?propertyWeight + ?context) as ?similarity) where {
  values ?i { :a }

  #-- get the sum of weights of distinct reasons
  #-- for each item that is similar to ?i.
  { select ?item (sum(?weight) as ?propertyWeight) {
      #-- get the distinct properties for each ?item
      #-- along with their weights.
      { select distinct ?item ?property ?weight {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
          optional { ?property :weight ?weight_ }
          bind(if(bound(?weight_), ?weight_, 0.0) as ?weight)
        } }
    }
    group by ?item
  }

  #-- get the sum of the context values
  #-- for each item.
  { select ?item (sum(?context_) as ?context) {
      ?item :context ?context_ .
    }
    group by ?item
  }
}

--------------------------
| i  | item | similarity |
==========================
| :a | :b   | 0.93       |
| :a | :c   | 0.44       |
| :a | :d   | 0.55       |
--------------------------

最终清理

查看最后的查询，有两件事让我有点烦恼。第一个是我们在内部子查询中检索每个解决方案的原因权重，而我们只需要为每个项目的每个属性检索一次。也就是说，我们可以将 optional 部分移到外部、内部子查询中。然后，我们有一个 bind 来设置我们只在聚合中使用的变量。我们可以通过求和 coalesce(?weight,0.0) 来替换它，如果它被绑定则使用 ?weight，否则使用 0.0。进行这些更改后，我们最终得到：

select ?i ?item ((?propertyWeight + ?context) as ?similarity) where {
  values ?i { :a }

  #-- get the sum of weights of distinct properties
  #-- using 0.0 as the weight for a property that doesn't
  #-- actually specify a weight.
  { select ?item (sum(coalesce(?weight,0.0)) as ?propertyWeight) {

      #-- get the distinct properties for each ?item.
      { select distinct ?item ?property {
          ?i :similarTo [ :item ?item ; :reason ?property ] .
        } }

       #-- then get each property's optional weight.
       optional { ?property :weight ?weight }
    }
    group by ?item
  }

  #-- get the sum of the context values
  #-- for each item.
  { select ?item (sum(?context_) as ?context) {
      ?item :context ?context_ .
    }
    group by ?item
  }
}

这不是一个巨大的变化，但我认为它使事情变得更清晰，并且更容易理解。

此时这几乎成了我的口头禅，但如果提供示例数据，这类问题会更容易回答。在这种情况下，您首先如何获得这些值的大多数实际机制并不重要。这就是你事后聚合它们的方式。这就是为什么我们可以使用非常简单的数据，比如我在这个答案开头从头开始创建的数据。

不过，我认为最大的收获是使用 SPARQL（以及其他查询语言，我希望）的重要技术之一是拥有单独的子查询并连接它们的结果。在这种情况下，我们最终得到了几个子查询，因为我们确实需要以几种不同的方式进行分组。如果 SPARQL 提供 distinct by 运算符，这可能会更简单，这样我们就可以说类似

sum(distinct by(?property) ?weight)

但问题是，如果一个不同的属性可以有多个权重，您会选择其中的哪个权重？所以解决方案似乎真的是几个子查询，以便我们可以进行几种不同类型的分组。这就是我询问您要计算的实际公式的原因。

【讨论】：

您好，感谢您的努力，首先，我必须说我做错了说 我想要将此结果的每一行乘以两个权重之和与 ?item 相关联，对于 i2 是 (0.1 *+ 0.4)，对于 i3* 是 (1) 我应该说 multiply 而不是 sum，但我认为这不会有太大变化。其次，你理解我，但以相反的方式。我会在下一条评论中告诉你
请检查更新的图像，你说 听起来你想要做的是计算相似项目的上下文值的总和，包括每次出现的上下文值，但是对原因权重求和，但仅针对不同的事件，但实际上，我想要的是：*计算原因值的总和（所有这些值）并将其与（乘以每个项目的上下文）。如果你看我放的图片，那么我想要(:r weight + :p weight + :q weight) * (:b context * :b context) = (0.5 + 0.1 + 0.3)* (0.01 * 0.02)
但请不要更改（更新）答案，我还在看，
@Ania 如果我理解你，这只是意味着你会用 * 替换我的答案中的 +。
我猜是0到1的默认值，但我还在读它