【问题标题】:Recommender System using SciKit-Learn's cross_validate, missing 1 required positional argument: 'y_true'使用 SciKit-Learn 的 cross_validate 的推荐系统,缺少 1 个必需的位置参数:'y_true'
【发布时间】:2018-08-08 19:20:00
【问题描述】:

我在尝试为当地报纸(作为学校项目)创建推荐系统时遇到问题,但是当我尝试使用 model_selection 库中的 cross_validate 函数时遇到了麻烦。

我正在尝试使用 SVD 并获得 f1 分数。但我有点困惑。所以这是无监督学习,我没有测试集,所以我想使用 KFolding 进行交叉验证。我相信为此的折叠数由 cross_validate 函数中的“cv”参数表示。这是正确的吗?

当我尝试运行代码时出现问题,因为我得到以下堆栈跟踪:https://hastebin.com/kidoqaquci.tex

我没有向 cross_validate 函数的“y”参数传递任何东西,但这是错的吗?这不是测试集应该去的地方吗?正如我所说,我没有任何测试集,因为这是无人监督的。我在这里查看了第 3.1.1.1 章中的示例:http://scikit-learn.org/stable/modules/cross_validation.html

看起来他们正在为 cross_validate 函数中的数据集传递一个“目标”。但为什么他们同时传递目标集和 cv 参数? cv 值大于 1 不是表示应该使用 kfolding 并且将遗漏的折叠用作目标(测试)集吗?

或者我完全误解了什么?为什么我在堆栈跟踪中收到“缺少参数”错误?

这是失败的代码:

from sklearn.model_selection import cross_val_score as cv
from sklearn.decomposition.truncated_svd import TruncatedSVD
import pandas as pd

# keywords_data_filename = 'keywords_data.txt'
active_data_filename = 'active_time_data.txt'

header = ['user_id', 'item_id', 'rating']
# keywords_data = pd.read_csv(keywords_data_filename, sep='*', names=header, engine='python')
active_time_data = pd.read_csv(active_data_filename, sep='*', names=header, engine='python')


# Number of users in current set
print('Number of unique users in current data-set', active_time_data.user_id.unique().shape[0])
print('Number of unique articles in current data-set', active_time_data.item_id.unique().shape[0])

# SVD allows us to look at our input matrix as a product of three smaller matrices; U, Z and V.
# In short this will help us discover concepts from the original input matrix,
# (subsets of users that like subsets of items)
# Note that use of SVD is not strictly restricted to user-item matrices
# https://www.youtube.com/watch?v=P5mlg91as1c

algorithm = TruncatedSVD()

# Finally we run our cross validation in n folds, where n is denoted by the cv parameter.
# Verbose can be adjusted by an integer to determine level of verbosity.
# We pass in our SVD algorithm as the estimator used to fit the data.
# X is our data set that we want to fit.
# Since our estimator (The SVD algorithm), We must either define our own estimator, or we can simply define how it
# score the fitting.
# Since we currently evaluate the enjoyment of our users per article highly binary, (Please see the rate_article fn in
# the filter script), we can easily decide our precision and recall based on whether or not our prediction exactly
# matches the binary rating field in the test set.
# This, the F1 scoring metric seems an intuitive choice for measuring our success, as it provides a balanced score
# based on the two.

cv(estimator=algorithm, X=active_time_data, scoring='f1', cv=5, verbose=True)

【问题讨论】:

    标签: python scikit-learn


    【解决方案1】:

    这里有多个问题:

    1) TruncatedSVD 是dimensionality reduction algorithm。所以我不明白你打算如何计算 f1_score。

    2) f1_score 传统上用于分类任务,有一个公式:

    f1 = 2*recall*precision
        --------------------
         recall + precision
    

    其中召回率和精度是根据真阳性、真阴性、假阳性、假阴性定义的,而这又需要计算真实类和预测类。

    3) cv = 1 没有意义。在cross_val_score 中,cv 表示折叠数。所以 cv = 5 表示在每一折中,80% 的数据将在训练中,20% 在测试中。那么你打算如何在没有某种真实标签的情况下测试数据。

    【讨论】:

      猜你喜欢
      • 2019-02-04
      • 2020-10-20
      • 2020-06-03
      • 1970-01-01
      • 2020-09-21
      • 2014-09-13
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多