【发布时间】:2020-11-09 08:34:13
【问题描述】:
我的 json 架构如下所示
{
"uid": "a7f2e98835c1fb67e9aa9f1fbaae5e98",
"gender": "F",
"click": [
{
"url": "htp://abc.com/1.html?utm_campaign=397"
},
{
"url": "htp://qaz.com/1.html?utm_campaign=397"
}
]
}
我有干净的 visits.url udf,例如 my_udf("htp://abc.com/1.html?utm_campaign=397") 我得到 abc.com
我想获得带有净化 url 的数据框:
uid gender urls
a7f2e98835c1fb67e9aa9f1fbaae5e98 F [abc.com,qaz.com]
我的代码:
from pyspark.sql import functions as F
from pyspark.sql.types import *
import re
from urllib.parse import urlparse
from urllib.request import urlretrieve, unquote
clean = F.udf (lambda z:my_udf(z), ArrayType(StringType()))
def my_udf(url):
url = re.sub('(http(s)*://)+', 'http://', url)
parsed_url = urlparse(unquote(url.strip()))
if parsed_url.scheme not in ['http','https']: return None
netloc = re.search("(?:www\.)?(.*)", parsed_url.netloc).group(1)
if netloc is not None: return str(netloc.encode('utf8')).strip()
return None
dataFrame = spark.read.json('1.json') \
.withColumn("urls", clean(F.col("click.url"))) \
.select( F.col("uid"), F.col("gender"), F.col("urls") ) \
show(3)
但我得到错误:
TypeError: expected string or bytes-like object
我做错了什么?
【问题讨论】:
-
您对 udf 的定义有问题 - 您不需要
lambda。你也可以显示my_udf的源代码吗? -
添加my_udf的代码
-
试试
clean = F.udf (my_udf, ArrayType(StringType()))
标签: apache-spark pyspark apache-spark-sql