如何将 URL 组合在一起？答案

【问题标题】：How to group URLs together?如何将 URL 组合在一起？
【发布时间】：2021-10-18 13:18:24
【问题描述】：

假设我有这样的数据集：

url	event	timestamp
https://name/view/adsd-12455-adf1/show	view	1
https://name/close/anotherpage/12dgksdfgas-adsjasf-54a4551/	close	2
https://name/close/anotherpage/98713sdfspdf-asdaj-1/	close	3
https://name/view/45asdaj-asdasd-lf5633/show	view	6
https://name/close/anotherpage/89kkpi-fslo-s521344/	close	10
https://name/view/124sfsdf-ajsdasd-551/delete	view	20
https://name/purchase/6654asdasd-asdfd-asda12	purchase	30

假设我所有的 URL 都采用 https://name/pagename/anotherpage/123456/randompage 这样的格式，其中 123456 就像 URL 名称中的唯一用户 ID，并且对于每个用户都是唯一的。（注意：我在这里使用了 123456 或仅作为唯一用户 ID 的数字作为示例，实际上用户 ID 也可能包含字母）

我想向数据集添加另一列，将引用同一页面但针对不同用户的 URL 组合在一起。所以表格看起来像这样：

url	event	timestamp	inferred url
https://name/view/adsd-12455-adf1/show	view	1	https://name/view/*/show
https://name/close/anotherpage/12dgksdfgas-adsjasf-54a4551/	close	2	https://name/close/anotherpage/*/
https://name/close/anotherpage/98713sdfspdf-asdaj-1/	close	3	https://name/close/anotherpage/*/
https://name/view/45asdaj-asdasd-lf5633/show	view	6	https://name/view/*/show
https://name/close/anotherpage/89kkpi-fslo-s521344/	close	10	https://name/close/anotherpage/*/
https://name/view/124sfsdf-ajsdasd-551/delete	view	20	https://name/view/*/delete
https://name/purchase/6654asdasd-asdfd-asda12	purchase	30	https://anothername/purchase/*/

我如何在 PySpark 中做到这一点？我需要有关如何解决此问题的帮助。

编辑：由于 Steven 是对的，而且我的示例数据太简单，所以我对其进行了一些更改

【问题讨论】：

你试过用正则表达式来做吗？
页面名称中有数字吗？例如可以是https://name/purchase_2/665412 吗？如果你不能创建一个简单的规则来识别用户 ID，那么你就完蛋了。而且我认为您过于简化了您的问题，您应该添加一些真实数据，否则我们可能会给出不完整的解决方案。
@eshirvana 我正在处理太多页面，以便为每个页面使用正则表达式，可悲的是
@Steven 据我所知，当 URL 包含用户 ID 时，URL 中的其他页面名称中没有数字，但可以有像 name/pagename/2020/view 这样的 URL，其数字用于引用年。我不允许公开发布我正在使用的数据
@user16679629 抱歉，在这种情况下，我认为没有人可以帮助您。如果您设法定义一个明确的规则来分隔页面 id 和用户 id，也许我们可以做点什么，但目前，这是不可能的。

标签： sql dataframe apache-spark pyspark apache-spark-sql

【解决方案1】：

如果您想要一个示例案例的解决方案，那么它只是一个正则表达式：

from pyspark.sql import functions as F 

df.withColumn("inferred_url", F.regexp_replace("url", r"\d+", "\*")).show(
    truncate=False
)
+--------------------------------------+--------+---------+---------------------------------+
|url                                   |event   |timestamp|inferred_url                     |
+--------------------------------------+--------+---------+---------------------------------+
|https://name/view/124551/show         |view    |1        |https://name/view/*/show         |
|https://name/close/anotherpage/124551/|close   |2        |https://name/close/anotherpage/*/|
|https://name/close/anotherpage/987131/|close   |3        |https://name/close/anotherpage/*/|
|https://name/view/455633/show         |view    |6        |https://name/view/*/show         |
|https://name/close/anotherpage/891344/|close   |10       |https://name/close/anotherpage/*/|
|https://name/view/124551/delete       |view    |20       |https://name/view/*/delete       |
|https://anothername/purchase/665412   |purchase|30       |https://anothername/purchase/*   |
+--------------------------------------+--------+---------+---------------------------------+

如果你想增加一点复杂性，假设用户 id 至少包含一个数字，也许还有其他字符，那么你可以这样做：

df.withColumn(
    "inferred_url", F.regexp_replace("url", r"\/[^\/]*\d[^\/]*", "\/\*")
).show(truncate=False)

不包括“a2z”的版本：

import re 

@F.udf
def infer_url(url):
    user_id_re = re.compile(r"\/([^\/]*\d[^\/]*)")
    user_id_list = user_id_re.findall(url)
    for user_id in user_id_list:
        if "a2z" not in user_id:
            url = url.replace(user_id, "*")
    return url

df.withColumn("user_id", infer_url(F.col("url"))).show(truncate=False)
+---------------------------------------+--------+---------+----------------------------------+
|url                                    |event   |timestamp|user_id                           |
+---------------------------------------+--------+---------+----------------------------------+
|https://name/view/124551/show          |view    |1        |https://name/view/*/show          |
|https://name/close/anotherpage/124551/ |close   |2        |https://name/close/anotherpage/*/ |
|https://name/close/anotherpage/987131/ |close   |3        |https://name/close/anotherpage/*/ |
|https://name/view/455633/show          |view    |6        |https://name/view/*/show          |
|https://name/close/anotherpage/891344/ |close   |10       |https://name/close/anotherpage/*/ |
|https://name/view/124551/delete        |view    |20       |https://name/view/*/delete        |
|https://anothername/purchase/665412    |purchase|30       |https://anothername/purchase/*    |
|https://anothername/purchase_a2z/665412|purchase|31       |https://anothername/purchase_a2z/*|
+---------------------------------------+--------+---------+----------------------------------+

【讨论】：

是的，你是对的。我发布的示例太简单了。但是谢谢你的帮助。如果用户 ID 中不仅有数字，而且还有字母，我该怎么办？
@user16679629 F.regexp_replace("url", "\/[^\/]*\d[^\/]*", "\/\*")
太好了！这实际上几乎解决了我的问题！谢谢！是否也可以向正则表达式添加一个例外，以便如果它在某处看到“a2z”，它不会用 * 替换文本？我对正则表达式不是很熟悉，但如果可能的话，它会 99% 解决问题（或者我认为是合理的错误百分比）
@user16679629 您添加的每个案例都会大幅更改代码......没有通用的解决方案，特别是使用正则表达式