【问题标题】:How to use glmnet in R for classfication of multiple classes如何在 R 中使用 glmnet 对多个类进行分类
【发布时间】:2017-11-29 10:06:51
【问题描述】:

我正在尝试弄清楚如何使用 glmnet 对文本进行分类。

我设法让它为两个班级工作

family="二项式"

type.measure="auc"

我想对使用多项式族的多个类做同样的事情。

我试过这样的

NFOLDS = 4;
glmnet_classifier = cv.glmnet(x = dtm_train, y = classes,
                              family = 'multinomial',
                              alpha = 1,
                              type.measure = "class",
                              nfolds = NFOLDS,
                              thresh = 1e-3,
                              maxit = 1e3)

类中的每个条目/行都是一个向量,其中应该分类的类/类为 1,其余为 0。像这样 [1,0,1,0,0,0,0]。

我得到以下错误,这是可以理解的。

错误:来自 glmnet Fortran 代码(错误代码 8034);第 34 类的空概率

问题是我是否正确创建了类向量?如果不是,它应该是什么样子?

我可以使用 glmnet 对具有多个类的文本进行分类吗?

最好的问候

编辑: 数据集样本(瑞典报纸文章): 这是一篇体育文章,所以对应的类应该是:篮球(Basketball) 或 [1,0,0,0,0 ... , 0]

Södertälje Kings är bara en seger fån klubbens tredje raka SM-guld i 篮子。 Efter idel hemmasegrar i finalserien bröt Södertälje mönstret i den fjärde matchen - och vann med 76-67 borta mot Uppsala.Inför rekordpublik, 3 004 åskådare i Fyrishov, ledde Uppsala med 32-28 i 暂停。 Men när Södertälje inledde andra halvlek med tio raka poäng svängde Initiativet i matchen。约翰·罗伯森 gjorde bara tre poäng i första halvlek, men slutade matchen som bäste poängplockare med 20.Södertälje leder med 3-1 i matcher och får sin första chans att säkra guldet påonsdag kväll hemma i Täljehallen。

基于100篇文章的部分dput(dtm_train)

dput(head(dtm_train))
new("dgCMatrix"
    , i = c(3L, 4L, 5L, 0L, 4L, 1L, 4L, 2L, 0L, 2L, 3L, 4L, 3L, 3L, 3L, 
4L, 1L, 2L, 1L, 3L, 3L, 2L, 5L, 3L, 4L, 5L, 2L, 5L, 3L, 3L, 4L, 
3L, 4L, 3L, 3L, 1L, 2L, 2L, 3L, 4L, 4L, 4L, 2L, 1L, 2L, 2L, 3L, 
3L, 3L, 3L, 5L, 0L, 2L, 4L, 3L, 0L, 1L, 3L, 4L, 2L, 4L, 0L, 0L, 
2L, 3L, 0L, 3L, 1L, 0L, 3L, 3L, 5L, 3L, 0L, 3L, 0L, 3L, 1L, 3L, 
5L, 0L, 1L, 5L, 3L, 2L, 3L, 2L, 3L, 0L, 3L, 4L, 0L, 0L, 2L, 3L, 
0L, 3L, 0L, 2L, 3L, 3L, 1L, 0L, 3L, 1L, 2L, 3L, 4L, 5L, 2L, 4L, 
5L, 0L, 2L, 3L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 3L, 5L, 3L, 5L, 
0L, 3L, 3L, 3L, 3L, 2L, 5L, 3L, 4L, 3L, 4L, 0L, 1L, 2L, 2L, 5L, 
0L, 3L, 4L, 3L, 4L, 0L, 3L, 4L, 5L, 4L, 4L, 1L, 2L, 5L, 0L, 3L, 
5L, 0L, 1L, 5L, 3L, 0L, 2L, 3L, 4L, 3L, 4L, 3L, 3L, 3L, 4L, 3L, 
1L, 3L, 4L, 4L, 3L, 1L, 1L, 4L, 2L, 3L, 1L, 4L, 4L, 5L, 4L, 3L, 
3L, 5L, 1L)
    , p = c(0L, 0L, 0L, 0L, 0L, 0L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 
7L, 7L, 8L, 8L, 8L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 16L, 
16L, 16L, 16L, 16L, 18L, 20L, 20L, 20L, 21L, 23L, 25L, 26L, 26L, 
26L, 26L, 28L, 29L, 31L, 31L, 33L, 34L, 35L, 37L, 40L, 40L, 41L, 
41L, 41L, 41L, 42L, 43L, 45L, 47L, 47L, 48L, 49L, 51L, 54L, 54L, 
55L, 56L, 59L, 61L, 62L, 63L, 64L, 65L, 65L, 65L, 65L, 67L, 68L, 
69L, 69L, 69L, 70L, 70L, 71L, 71L, 72L, 73L, 75L, 76L, 76L, 77L, 
77L, 79L, 80L, 80L, 81L, 83L, 84L, 86L, 86L, 86L, 88L, 88L, 90L, 
91L, 92L, 95L, 97L, 97L, 98L, 100L, 101L, 102L, 104L, 104L, 109L, 
112L, 115L, 115L, 118L, 123L, 124L, 124L, 124L, 125L, 125L, 127L, 
127L, 129L, 129L, 130L, 131L, 132L, 134L, 136L, 138L, 140L, 141L, 
143L, 145L, 146L, 147L, 148L, 148L, 152L, 152L, 153L, 154L, 157L, 
159L, 159L, 160L, 163L, 164L, 168L, 170L, 171L, 171L, 171L, 171L, 
171L, 172L, 173L, 174L, 175L, 175L, 178L, 178L, 178L, 178L, 178L, 
179L, 179L, 179L, 179L, 179L, 179L, 179L, 180L, 180L, 181L, 182L, 
183L, 183L, 183L, 184L, 184L, 185L, 185L, 185L, 187L, 187L, 187L, 
187L, 189L, 190L, 191L, 193L, 193L, 193L, 193L, 193L, 193L, 193L, 
193L, 194L, 194L, 194L)
    , Dim = c(6L, 223L)
    , Dimnames = list(c("4146090", "3670901", "3592883", "3662097", "3910593", 
"3617345"), c("Hancock", "London", "slutet", "nnu", "F_r", "tycker", 
"kronor", "start", "ngt", "OS", "tar", "m_ste", "lv", "sk", "ser", 
"16", "ngen", "Jag_r", "slut", "s_ger", "spelar", "k_nns", "3", 
"rde", "match", "po_ng", "po", "fj_rde", "d_rmed", "a", "r_n", 
"rbundet", "14", "EM", "pressmeddelande", "lla", "d_r", "H", 
"rande", "svarade", "rst", "nns", "sj_lv", "raka", "seger", "I", 
"matchen", "omg", "9", "spela", "fr_n", "Men", "par", "trea", 
"96", "b_ttre", "fr", "r_p", "tt", "p_plats", "h_r", "F", "2", 
"1", "se", "plats", "r_r", "6", "r_l", "Marcus", "gon", "senaste", 
"Efter", "4", "N", "l_ngre", "g_r", "Jag", "k", "s_songen", "r_v", 
"beh", "inf", "ny", "De", "nya", "Hon", "ta", "finalen", "Anna", 
"hela", "VM", "n_got", "llen", "tta", "sta", "da", "rt", "vinna", 
"r_b", "ringen", "g_ra", "ryska", "guld", "vlingar", "st_r", 
"rmed", "m_l", "nd", "rldscupen", "kt", "En", "f_re", "p_l", 
"Att", "r_s", "st", "n_sta", "Det_r", "nde", "tog", "r_f", "Vi", 
"r_h", "rnan", "ller", "ju", "matcher", "Och", "ker", "rsta", 
"t", "5", "ge", "sl", "f_rsta", "Den", "n_r", "ka", "sv", "m_nga", 
"tillbaka", "ret", "gra", "Han", "ng", "ocks", "r_g", "vlingen", 
"P", "f_rst", "laget", "vling", "t_vlingen", "nda", "TT", "jd", 
"rv", "V", "tv", "v_ldigt", "p_v", "rige", "25", "n_gra", "r_m", 
"tr", "ver", "ren", "b", "Sverige", "slutade", "Ericsson", "sekunder", 
"sj", "minuter", "ngre", "p_h", "f_rs", "riktigt", "forts", "inf_r", 
"11", "p_s", "vet", "rra", "rs", "songen", "vunnit", "tappade", 
"avg", "e", "lag", "ven", "tror", "got", "SM", "klart", "final", 
"S", "svenska", "Holmlund", "f_tt", "D", "ttre", "tid", "inleds", 
"fj", "r_klart", "ll", "ste", "re", "ra", "ldigt", "0", "b_sta", 
"Artemis", "j", "18", "nga", "vann", "nt", "Rio"))
    , x = c(0.0459223389874228, 0.0647622729309809, 0.0757236977214318, 
0.0419740104600043, 0.052313354577604, 0.0948559992442941, 0.0543657316974382, 
0.0605789581771455, 0.0629426371656586, 0.0407628507358551, 0.00648499898070422, 
0.0182910227660888, 0.0241750912448434, 0.020066135574452, 0.0155890766190175, 
0.0439691904638956, 0.0524911062249339, 0.0299949178428194, 0.0916290731874155, 
0.049979494465863, 0.0771004922254579, 0.0234565872019952, 0.0586414680049879, 
0.0294220816737799, 0.0276617861890238, 0.301436862510004, 0.022814505606222, 
0.0570362640155551, 0.0357475064795061, 0.0212942360273263, 0.0300303328590499, 
0.0450136129455133, 0.0317403681026056, 0.0126026760101808, 0.017873753239753, 
0.0654666659991881, 0.0374095234281075, 0.126129995039413, 0.040132271148904, 
0.0565967926458903, 0.0439691904638956, 0.0565967926458903, 0.0542034281395966, 
0.0948559992442941, 0.0542034281395966, 0.0657881455141156, 0.0209325917544913, 
0.020066135574452, 0.0106471180136631, 0.0351848736822438, 0.0345565723664895, 
0.0262234740965819, 0.0254742319795367, 0.0457229804760915, 0.016659831488621, 
0.0521163777038787, 0.0830365603410825, 0.0150975564256514, 0.0425828514569654, 
0.0605789581771455, 0.0543657316974382, 0.0677230909704131, 0.064919850387933, 
0.116584047344375, 0.0438937612482027, 0.0770196070578684, 0.0119030301816706, 
0.106013176810005, 0.0677230909704131, 0.020066135574452, 0.0344930906342888, 
0.0822351818926445, 0.0425631204072182, 0.297144546389207, 0.0229611694937114, 
0.0708219296662315, 0.0459223389874228, 0.0830365603410825, 0.0452926692769541, 
0.0612428010032831, 0.203169272911239, 0.106013176810005, 0.0757236977214318, 
0.0146312537494009, 0.0630649975197063, 0.020066135574452, 0.022814505606222, 
0.014518321749414, 0.0396198131754885, 0.0122461240724237, 0.050413150163406, 
0.0385098035289342, 0.0488450354947544, 0.0474494630520472, 0.0301951128513027, 
0.0396198131754885, 0.0367383722172712, 0.064919850387933, 0.0582920236721873, 
0.0185474620775141, 0.0578253691690934, 0.0916290731874155, 0.057826848716848, 
0.017873753239753, 0.0510825623765991, 0.0291900356437709, 0.0185754772278542, 
0.0261961858341534, 0.0364875445547136, 0.0611442761037827, 0.0182910227660888, 
0.0509535634198189, 0.0521163777038787, 0.101254676681821, 0.0322173971260341, 
0.066639325954484, 0.0469892683012387, 0.0654493379910111, 0.0569717141594182, 
0.0325552652339533, 0.0207169869670612, 0.0292162636714965, 0.0406940815424416, 
0.0112534032363783, 0.0788312468996329, 0.0185474620775141, 0.0728650295902341, 
0.0538994548161268, 0.033319662977242, 0.010945207312054, 0.0150975564256514, 
0.0385502461127289, 0.104718940785618, 0.0654493379910111, 0.0155890766190175, 
0.0879383809277911, 0.016659831488621, 0.0469892683012387, 0.064919850387933, 
0.110363745659486, 0.0657881455141156, 0.0630649975197063, 0.0788312468996329, 
0.064919850387933, 0.020066135574452, 0.0543657316974382, 0.020066135574452, 
0.0590406434101037, 0.0198042515665813, 0.0122426282411594, 0.0172652449554812, 
0.0480960395188404, 0.0565967926458903, 0.0617421950936377, 0.120397280432594, 
0.137596891922964, 0.0859980574518526, 0.0557976466142906, 0.0344930906342888, 
0.0481097731416646, 0.0770196070578684, 0.0654666659991881, 0.0467619042851344, 
0.0146312537494009, 0.0198042515665813, 0.0192384158075362, 0.0122426282411594, 
0.0345304899109623, 0.016108698563017, 0.045434790818766, 0.016659831488621, 
0.020066135574452, 0.0209325917544913, 0.0469892683012387, 0.0385502461127289, 
0.0830365603410825, 0.0150975564256514, 0.0425828514569654, 0.0355460092594844, 
0.0155890766190175, 0.106013176810005, 0.120397280432594, 0.0439691904638956, 
0.0396084103177112, 0.020066135574452, 0.102011041426328, 0.052313354577604, 
0.100826300326812, 0.0702183162990297, 0.050413150163406, 0.0146312537494009, 
0.0328356219361619, 0.0429990287259263, 0.0554331312260806)
    , factors = list()
)

我还可以看到我有一些编码问题,正在解决它

【问题讨论】:

  • 你能分享一个示例数据集吗?您能否提供dtm_train data.frame 的 dput 输出?这将使其他人更容易帮助您。
  • 无权访问此计算机上的数据集。明天早上我会给你一个样品。 dtm_train 数据集包含已转换为向量(词袋)的文本。我几乎遵循了本教程,但使用了我的数据。 cran.r-project.org/web/packages/text2vec/vignettes/…
  • @MarcoSandri 使用 dtm_train 的示例和部分 dput 输出更新了帖子。有很多输出,所以我取了其中的一些部分。希望够了。
  • 当前形式的数据集没有用处。请发布 dput(head(dtm_train)) 的完整输出。
  • @MarcoSandri 完成

标签: r classification text-classification glmnet


【解决方案1】:

我认为你应该使用以下来训练模型

图书馆(glmnet) glmnetModel

并根据训练好的模型进行预测

glm.Predict

这两个步骤将为您提供作为 Lamda.min 输出的类

【讨论】:

    【解决方案2】:

    您的类向量现在是二元向量,这就是二项式版本有效的原因。如果要使用多项式版本,请制作多项式结果向量。

    【讨论】:

    • 也试过了。我收到以下错误:“FUN 中的错误(左,右):二进制运算符的非数字参数”我尝试了这样的事情:classes = ['class name', ... , 'class name']
    • 你能显示一些数据吗?如果我看到数据,也许我能更好地理解你的问题
    • 用一些数据更新了帖子。
    猜你喜欢
    • 2013-09-11
    • 1970-01-01
    • 2017-10-23
    • 2022-12-02
    • 1970-01-01
    • 2018-07-16
    • 2015-02-07
    • 2021-02-17
    • 1970-01-01
    相关资源
    最近更新 更多