【问题标题】:R: find words from tweets in Lexicon, count them and save number in dataframe with tweetsR:从 Lexicon 中的推文中查找单词,计算它们并将数字保存在带有推文的数据框中
【发布时间】:2021-07-28 14:14:27
【问题描述】:

我有一个包含 50,176 条推文的数据集(tweets_data: 50176 obs. of 1 variable)。现在,我创建了一个自制的词典(formal_lexicon),它由大约100万个单词组成,都是正式的语言风格。现在,我想创建一个小代码,每条推文计算该词典中有多少(如果有的话)单词。

tweets_data:

                   Content            
1                 "Blablabla"               
2                 "Hi my name is"               
3                 "Yes I need"                 
.  
.
. 
50176            "TEXT50176" 

formal_lexicon:

                       X            
1                 "admittedly"               
2                 "Consequently"               
3                 "Furthermore"                 
.  
.
. 
1000000            "meanwhile"   

因此输出应如下所示:

                  Content             Lexicon
1                 "TEXT1"                1
2                 "TEXT2"                3
3                 "TEXT3"                0 
.  
.
. 
50176            "TEXT50176"             2

应该是一个简单的 for 循环,例如:

for(sentence in tweets_data$Content){ 
  for(word in sentence){
    if(word %in% formal_lexicon){
         ...
}
}
}

我认为“单词”不起作用,如果单词在词典中,我不确定如何在特定列中计数。有人可以帮忙吗?

structure(list(X = c("admittedly", "consequently", "conversely",  "considerably", "essentially", "furthermore")), row.names = c(NA,  6L), class = "data.frame")

c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ",  "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art",  "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381",  "Damn, it's hard to wrap presents when you're drunk. cc @santa",  "When my whole fam tryna have a peaceful holiday " )

【问题讨论】:

  • 你能添加一个你的数据和词典的可用(也是假的)例子吗?
  • @s__ 喜欢这样吗?

标签: r nlp lexicon


【解决方案1】:

你可以试试这样的:

library(tidytext)
library(dplyr)

# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly", "consequently", "conversely",  "considerably", "essentially", "furthermore")), row.names = c(NA,  6L), class = "data.frame")
tweets_data <- c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ",  "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art",  "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381",  "Damn, it's hard to wrap presents when you're drunk. cc @santa",  "When my whole fam tryna have a peaceful holiday " )

# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data, id = 1:length(tweets_data))


tweets_data_df  %>% 
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id, Content, cnt)

结果:

Joining, by = "id"
# A tibble: 6 x 3
     id Content                                                              cnt
  <int> <chr>                                                              <dbl>
1     1 "@barackobama Thank you for your incredible grace in leadership a~     0
2     2 "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles Co~     0
3     3 "2017 resolution: to embody authenticity!"                             0
4     4 "Happy Holidays! Sending love and light to every corner of the ea~     0
5     5 "Damn, it's hard to wrap presents when you're drunk. cc @santa"        0
6     6 "When my whole fam tryna have a peaceful holiday "                     0

【讨论】:

  • 我收到此错误.. UseMethod("pull") 中的错误:没有适用于“字符”类对象的“拉”方法
  • 是我的代码和数据有问题,还是我的代码有问题但其他数据有问题?
  • 使用您的代码和其他数据。但是,如果我执行您的代码和数据,我也会收到一个错误:错误:xy 没有公共变量时必须提供by。 ℹ 使用 by = character()` 执行交叉连接。运行rlang::last_error() 以查看错误发生的位置。 xy 没有公共变量时,必须提供by。 ℹ 使用 by = character()` 执行交叉连接。
  • 对于我的代码和数据,我将正确的代码放在最后一行:select(...)。对于您的数据,您需要共享一些导致编辑问题的错误的数据。 dput(head(formal_lexicon))dput(head(tweets_data)) 也可以(您必须发布输出):发布这些结果,因为如果数据是问题,只有查看它们,我才能帮助您解决问题。
  • @Ja123 我看到了问题,您需要将tweets_data 转换为data.frame,然后将其放入代码中。见编辑。显然,词典中没有任何数据词,因此您会看到 0 作为结果。
【解决方案2】:

希望这对你有用:

library(magrittr)
library(dplyr)
library(tidytext)

# Data frame with tweets, including an ID
tweets <- data.frame(
  id = 1:3,
  text = c(
    'Hello, this is the first tweet example to your answer',
    'I hope that my response help you to do your task',
    'If it is tha case, please upvote and mark as the correct answer'
  )
)

lexicon <- data.frame(
  word = c('hello', 'first', 'response', 'task', 'correct', 'upvote')
)


# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
  tidytext::unnest_tokens(output = 'words', input = text) %>% 
# Determining if a word is in your lexicon
  dplyr::mutate(
    in_lexicon = words %in% lexicon$word
  ) %>% 
  dplyr::group_by(id) %>%
  dplyr::summarise(words_in_lexicon = sum(in_lexicon))

# Binding count and the original data
dplyr::left_join(tweets, in_lexicon)

【讨论】:

  • 我收到此错误(我从大数据集中进行了 50 次观察以检查它是否有效):Error: Must extract column with a single valid subscript. x Subscript var` 的大小为 50,但必须为 1。`
  • ¿该错误是在运行我的 reprex 或使用您的实际数据时出现的?
  • 用我的实际数据
猜你喜欢
  • 2021-10-24
  • 1970-01-01
  • 2016-07-29
  • 2021-02-03
  • 2012-10-13
  • 1970-01-01
  • 2021-05-10
  • 1970-01-01
  • 2023-04-01
相关资源
最近更新 更多