【发布时间】:2021-07-16 01:06:06
【问题描述】:
我创建了一组单词的交叉连接,以比较它们在 Spark 中的相似性。但是,我试图摆脱那些自(word1,word2)=(word2,word1)的分数以来重复的条目。我的下表看起来像这样;
+-------+-------+-------+
| col1 | col2 | score |
+-------+-------+-------+
| word1 | word1 | 1 |
| word1 | word2 | 0.345 |
| word1 | word3 | 0.432 |
| word2 | word1 | 0.345 |
| word2 | word2 | 1 |
| word2 | word3 | 0.543 |
| word3 | word1 | 0.432 |
| word3 | word2 | 0.543 |
| word3 | word3 | 1 |
+-------+-------+-------+
理想情况下,我希望获得这样的结果:不重复比较:
+-------+-------+-------+
| col1 | col2 | score |
+-------+-------+-------+
| word1 | word1 | 1 |
| word1 | word2 | 0.345 |
| word1 | word3 | 0.432 |
| word2 | word2 | 1 |
| word2 | word3 | 0.543 |
| word3 | word3 | 1 |
+-------+-------+-------+
【问题讨论】:
标签: apache-spark pyspark cross-join