Quanteda：我如何创建语料库和情节分散的单词？答案

【问题标题】：Quanteda: How do I create a corpus and plot dispersion of words?Quanteda：我如何创建语料库和情节分散的单词？
【发布时间】：2021-12-16 01:41:43
【问题描述】：

我有一些看起来像这样的数据：

  date      signs  horoscope                                                      newspaper   
  <chr>     <chr>  <chr>                                                          <chr>       
1 06-06-20~ ARIES  Your week falls neatly into distinct phases. The completion o~ Indian Expr~
2 06-06-20~ TAURUS You're coming to the end of an emotional period, when you've ~ Indian Expr~
3 06-06-20~ GEMINI Passions are still running high, and the degree of emotional ~ Times of In~
4 06-06-20~ CANCER First things first - don't rush it! There is still a great de~ Indian Expr~
5 06-06-20~ LEO    The greatest pressures are coming from all directions at once~ Indian Expr~

我想根据这些数据创建一个语料库，其中所有horoscope 都按newspaper 和signs 分组为文档。

例如，报纸上的所有ARIES Times of India 应该是一个文档，但要按日期顺序排列（它们的索引应该按日期排序）。

由于我不知道如何按newspaper 和signs 对文本进行分组，因此我尝试为每份报纸创建两个不同的语料库。我试过这样做：


# Create a dataframe of only Times of India text
h_toi <- horoscopes %>%
  filter(newspaper == "Times of India") %>%
  select(-c("newspaper"))
  
# Create a corpus of out this
horo_corp_toi <- corpus(h_toi, text_field = "horoscope")

# Create docids
docids <- paste(h_toi$signs)

# Use this as docnames
docnames(horo_corp_toi) <- docids

head(docnames(horo_corp_toi), 5)
# [1] "ARIES.1"  "TAURUS.1" "GEMINI.1" "CANCER.1" "LEO.1"

但如您所见，语料库的docnames 是"ARIES.1"、`"TAURUS.1" 等等。这是一个问题，因为当我尝试使用 quanteda 的 textplot_xray() 绘制它时，绘制了数千个文档，而不是每个符号只有 12 个文档：

# Plot lexical dispersion of love in all signs 
kwic(tokens(horo_corp_toi), pattern = "love") %>%
    textplot_xray()

相反，我希望能够做这样的事情：

我无法获得此可视化，因为我最初不知道如何操作和创建语料库。我该怎么做，我做错了什么？

示例 DPUT 为 here

【问题讨论】：

标签： r corpus quanteda

【解决方案1】：

由于问题是如何同时按标志和报纸进行分组，所以让我先回答一个。

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textplots")

## horoscopes <- [per linked dput in OP]

corp <- corpus(horoscopes, text_field = "horoscope")
toks <- tokens(corp)

# grouped by sign and newspaper
tokens_group(toks, groups = interaction(signs, newspaper)) %>%
  kwic(pattern = "love") %>%
  textplot_xray()

要实现上面的结果输出（此处仅显示最后一张图片），您可以循环浏览报纸并仅按signs 分组。请注意，此处的星座数量有限，因为在提供的示例数据中，并非所有的生肖范围都包含在数据中。

# separate kwic for each newspaper
for (i in unique(toks$newspaper)) {
  thiskwic <- toks %>%
    tokens_subset(newspaper == i) %>%
    tokens_group(signs) %>%
    kwic(pattern = "love")
  textplot_xray(thiskwic) +
    ggplot2::ggtitle(paste("Lexical dispersion plot -", toupper(i)))
}

【讨论】：

谢谢，这很有魅力！你能解释一下toks 和tokens_group 函数中的interaction 参数发生了什么吗？我很难理解如何将这样的数据集转换为语料库，因此简短的解释将非常有用。此外，有没有办法在最终图表的 x 轴上添加日期组件？
请参阅?tokens_group 了解有关分组变量的说明。要了解输入是什么，请检查 with(docvars(toks), interaction(signs, newspaper)) 的输出。另见?interaction。没有直接的方法可以用日期替换 x 轴，除非您指的是标签，在这种情况下它只是 + xlab("Your date component label")。
截至日期，我的意思是用具有日期的轴（在数据集中给出）替换相对令牌索引值（0.4、0.6），以便我可以跟踪随时间的分散。不然我们怎么解释Relative Token Index，它是发生的时间顺序对吗？
相对标记索引是模式匹配的“文档”中的位置。当您使用tokens_group() 对文档进行分组时，它们将按照您语料库中文档的顺序排列。因此，如果这是按时间顺序排列的，那么令牌也是如此。要将轴标签替换为其他任何内容，您可以使用标准的 ggplot2 方法。