【发布时间】:2021-11-10 16:13:30
【问题描述】:
我是 PySpark 的新手,我想将 Python 的 Feature Extraction (FE) 部分脚本翻译成 PySpark。起初,我有 Spark 数据框,即所谓的sdf,包括 2 列 A 和 B:
下面是例子:
| data | A | B |
|---|---|---|
| https://example1.org/path/to/file?param=42#fragment | path/to/file | param=42#fragment |
| https://example2.org/path/to/file | path/to/file | NaN |
现在我想应用一些特征工程并提取特征并将结果与列B 中的sdf 连接起来。到目前为止,我可以使用 pythonic 脚本来做到这一点:
#================================> Type <==========================================
def getType(input_value):
if pd.isna(input_value):
return "-"
type_ = "-"
if input_value.isdigit(): # Only numeric
type_ = "Int"
elif bool(re.match(r"^[a-zA-Z0-9_]+$", input_value)): # Consists of one or more of a-zA-Z, 0-9, underscore , and Chinese
type_ = "String"
elif bool(re.match(r"^[\d+,\s]+$", input_value)): # Only comma exists as separator "^[\d+,\s]+$"
type_ = "Array"
else:
existing_separators = re.findall(r"([\+\;\,\:\=\|\\/\#\'\"\t\r\n\s])+", input_value)
# There are one or more separators
# when there is only one separator it is not comma (!= "^[\d+,\s]+$")
if len(existing_separators) > 1 or (len(existing_separators) == 1 and existing_separators[0] != ","):
type_ = "Sentence"
return type_
#================================> Length <==========================================
#Number of charactesrs in parameter value
getLength = lambda input_text: 0 if pd.isna(input_text) else len(input_text)
#================================> Token number <==========================================
double_separators_regex = re.compile(r"[\<\[\(\{]+[0-9a-zA-Z_\.\-]+[\}\)\]\>]+")
single_separators_regex = re.compile(r"([0-9a-zA-Z_\.\-]+)?[\+\,\;\:\=\|\\/\#\'‘’\"“â€\t\r\n\s]+([0-9a-zA-Z_\.\-]+)?")
token_number = lambda input_text: 0 if pd.isna(input_text) else len(double_separators_regex.findall(input_text) + [element for pair in single_separators_regex.findall(input_text) for element in pair if element != ""])
#quick test
param_example = "url=http://news.csuyst.edu.cn/sem/resource/code/rss/rssfeed.jsp?type=list"
out = double_separators_regex.findall(param_example) + [element for pair in single_separators_regex.findall(param_example) for element in pair if element != ""]
print(out) #['url','http','news.csuyst.edu.cn','sem','resource','code','rss','rssfeed.jsp','type','list']
print(len(out)) #9
#===================================> Encoding type <============================================
import base64
def isBase64(input_value):
try:
return base64.b64encode(base64.b64decode(input_value)) == input_value
except Exception as e:
return False
#================================> Character feature <==========================================
N = 2
n_grams = lambda input_text: 0 if pd.isna(input_text) else len(set([input_text[character_index:character_index+N] for character_index in range(len(input_text)-N+1)]))
#quick test
n_grams_example = 'zhang1997' #output = [‘zh’, ‘ha’, ‘an’, ‘ng’, ‘g1’, ‘19’, ‘99’ , ‘97’]
n_grams(n_grams_example) # 8
#frame the features
features_df = pd.DataFrame()
features_df["Type"] = df.fragment.apply(getType)
features_df["Length"] = df.fragment.apply(getLength)
features_df["Token_number"] = df.fragment.apply(token_number)
features_df["Encoding_type"] = df.fragment.apply(isBase64)
features_df["Character_feature"] = df.fragment.apply(n_grams)
features_df.columns #Index(['Type', 'Length', 'Token number', 'Encoding type', 'Character feature'], dtype='object')
features_df
问题:什么是翻译 FE 的最佳方法将 Spark 数据帧转换为 Pandas 数据帧toPandas() 以优化管道并以 100% 的火花形式处理它?
所以我很乐意提供一个colab notebook 以便快速调试和评论。
预期的输出以 Spark 数据框的形式显示如下:
+--------------------+------------+-----------------+--------+------+-------------+--------------+-----------------+
|data |A |B |Type |Length|Token_number |Encoding_type |Character_feature|
+--------------------+------------+-----------------+--------+------+-------------+--------------+-----------------+
|https://example1....|path/to/file|param=42#fragment|Sentence|17.0 |3.0 |False |15.0 |
|https://example2....|path/to/file|Null |- |0.0 |0.0 |False |0.0 |
+--------------------+------------+-----------------+--------+------+-------------+--------------+-----------------+
【问题讨论】:
标签: apache-spark pyspark feature-extraction