【问题标题】:classifying identically pattern in words using R使用 R 对单词中的相同模式进行分类
【发布时间】:2019-02-20 02:36:31
【问题描述】:

我想进行文本挖掘分析,但遇到任何麻烦。 使用 dput(),我加载了我的一小部分文本。

text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L, 
3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L, 
3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L, 
3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME = structure(c(19L, 
17L, 15L, 18L, 16L, 23L, 21L, 14L, 22L, 20L, 6L, 2L, 10L, 8L, 
7L, 13L, 5L, 11L, 7L, 12L, 4L, 3L, 9L, 9L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "* 2108609 SLOB.Mayon.OLIVK.67% 400ml", "* 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg", 
"* 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35", "* 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g", 
"197 Onion 1 kg", "2013077 MAKFA Makar.RAKERS 450g", "2030918 MARIA TRADITIONAL Biscuit 180g", 
"2049750 MAKFA Makar.SHIGHTS 450g", "3420159 LEBED.Mol.past.3,4-4,5% 900g", 
"3491144 LIP.NAP.ICE TEA green yellow 0.5 liter", "6788 MAKFA Makar.perya 450g", 
"809 Bananas 1kg", "FetaXa Cheese product 60% 400g (", "Lemons 55+", 
"MAKFA Macaroni feathers like. in / with", "Napkins paper color 100pcs PL", 
"Package \"Magnet\" white (Plastiktre)", "Pasta Makfa snail flow-pack 450 g.", 
"SHEBEKINSKIE Macaroni Butterfly №40", "SOFT Cotton sticks 100 PE (BELL", 
"TENDER AGE Cottage cheese 10", "TOBUS steering-wheel 0.5kg flow"
), class = "factor")), .Names = c("ID_C_REGCODES_CASH_VOUCHER", 
"GOODS_NAME"), class = "data.frame", row.names = c(NA, -61L))

(NA是偶然的。) 正文是检查的产品名称。

我想对任何相似的名字进行分组。

例如。在这里,我手动取 MAKFA makar(乌克兰名称)。我找到了 7 行 "root or key word MAKFA Makar"

Pasta Makfa snail flow-pack 450 g.
MAKFA Macaroni feathers like. in / with
2013077 MAKFA Makar.RAKERS 450g
2013077 MAKFA Makar.RAKERS 450g
6788 MAKFA Makar.perya 450g
2049750 MAKFA Makar.SHIGHTS 450g
2049750 MAKFA Makar.SHIGHTS 450g

所有产品位置都有相同的词根。 MAKFA Makar 不能像MFAMKR 作为输出我想得到

                                                Initially                 class
1                       Pasta Makfa snail flow-pack 450 g.          MAKFA Makar.
2                  MAKFA Macaroni feathers like. in / with          MAKFA Makar.
3                          2013077 MAKFA Makar.RAKERS 450g          MAKFA Makar.
4                          2013077 MAKFA Makar.RAKERS 450g          MAKFA Makar.
5                              6788 MAKFA Makar.perya 450g          MAKFA Makar.
6                         2049750 MAKFA Makar.SHIGHTS 450g          MAKFA Makar.
7                         2049750 MAKFA Makar.SHIGHTS 450g          MAKFA Makar.
8          * 3398012 DD Kolb.SERV.OKHOTN in / to v / y0.35                  kolb
9               * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg             Spikachki
10                                         809 Bananas 1kg              Bananas 
11                                              Lemons 55+                Lemons
12                           Napkins paper color 100pcs PL        Napkins paper 
13                         SOFT Cotton sticks 100 PE (BELL         Cotton sticks
14                     SHEBEKINSKIE Macaroni Butterfly №40 SHEBEKINSKIE Macaroni
15 * 3426789 WH.The corn rav guava / yagn.d / Cat SEED 85g              CAT seed
16                        FetaXa Cheese product 60% 400g (               Cheese 
17          3491144 LIP.NAP.ICE TEA green yellow 0.5 liter                  TEA 
18                  2030918 MARIA TRADITIONAL Biscuit 180g              Biscuit 
19                                          197 Onion 1 kg                 Onion
20                          TOBUSsteering-wheel 0.5kg flow        steering-wheel
21                     Package "Magnet" white (Plastiktre) Package  (Plastiktre)
22                    * 2108609 SLOB.Mayon.OLIVK.67% 400ml                 Mayon
23                            TENDER AGE Cottage cheese 10        Cottage cheese

我如何按词根对产品进行分类?(相反,单词 Makar.Makfa、奶酪中存在相同的模式)

【问题讨论】:

    标签: r dplyr tm fuzzy-search


    【解决方案1】:

    我认为您可以通过清理然后对文本进行聚类来获得所需的位置 - 这是一个入门:

    text <- text[1:24,]
    library(quanteda)
    library(tidyverse)
    hc <- text %>% 
      pull(GOODS_NAME) %>% 
      as.character %>% 
      quanteda::tokens(
        remove_numbers = T,  
        remove_punct = T,
        remove_symbols = T, 
        remove_separators = T
      ) %>% 
      quanteda::tokens_tolower() %>% 
      quanteda::tokens_remove(valuetype="regex", pattern = c("^\\d.*")) %>% 
      quanteda::dfm() %>% 
      textstat_simil(method = "jaccard") %>% 
      magrittr::multiply_by(-1) %>% 
      `attr<-`("Labels", text$GOODS_NAME) %>% 
      hclust(method = "average") 
    
    pdf(tf<-tempfile(fileext = ".pdf"), width = 20, height = 10)
    plot(hc)
    dev.off()
    shell.exec(tf)
    
    clusters <- cutree(hc, h = -0.1)
    split(text, clusters)
    

    【讨论】:

      【解决方案2】:

      这是一种具有要搜索的词向量的方法:

      patt <- c("MAKFA Makar.", "kolb","Spikachki", "Bananas", "Lemons",
      "Napkins paper", "Cotton sticks","SHEBEKINSKIE Macaroni","CAT seed","Cheese",
      "TEA", "Biscuit", "Onion", "steering-wheel", "Package  (Plastiktre)",
      "Mayon", "Cottage", "cheese")
      
      lst <-lapply(patt, function(x) text[grep(x,text$GOODS_NAME), ])
      do.call(rbind.data.frame, lst)
      
         ID_C_REGCODES_CASH_VOUCHER                                              GOODS_NAME
      15                       3953                         2013077 MAKFA Makar.RAKERS 450g
      19                       3960                         2013077 MAKFA Makar.RAKERS 450g
      20                       3960                             6788 MAKFA Makar.perya 450g
      23                       3967                        2049750 MAKFA Makar.SHIGHTS 450g
      24                       3967                        2049750 MAKFA Makar.SHIGHTS 450g
      22                       3960              * 3014084 D.Dym.Spikachki DEREVEN.MINI 1kg
      16                       3953                                         809 Bananas 1kg
      3                        3941                                              Lemons 55+
      2                        3941                           Napkins paper color 100pcs PL
      7                        3945                         SOFT Cotton sticks 100 PE (BELL
      10                       3945                     SHEBEKINSKIE Macaroni Butterfly №40
      17                       3960 * 3426789 WH.The corn rav guava / yagn.d / CAT seed 85g
      8                        3945                        FetaXa Cheese product 60% 400g (
      18                       3960          3491144 LIP.NAP.ICE TEA green yellow 0.5 liter
      14                       3953                  2030918 MARIA TRADITIONAL Biscuit 180g
      11                       3953                                          197 Onion 1 kg
      6                        3945                         TOBUS steering-wheel 0.5kg flow
      12                       3953                    * 2108609 SLOB.Mayon.OLIVK.67% 400ml
      9                        3945                            TENDER AGE Cottage cheese 10
      91                       3945                            TENDER AGE Cottage cheese 10
      

      【讨论】:

      • 你的方法还不错,但无论如何,我必须从整个数组中获取这些根词,以便以后可以将它们映射到单独的类。因此,如果您有选择词根的机制,那么您的方法将很有效。在我们得到词根之后,我们将其粘贴到 patt 中,然后开始)))
      猜你喜欢
      • 2018-05-19
      • 2018-01-14
      • 1970-01-01
      • 2021-10-18
      • 1970-01-01
      • 2015-08-24
      • 2016-10-02
      • 2014-10-31
      • 1970-01-01
      相关资源
      最近更新 更多