如何从 R 中的 data.frame 文件中删除文本中的标点符号和数字答案

【问题标题】：How can I remove punctuations and numbers in text from data.frame file in R如何从 R 中的 data.frame 文件中删除文本中的标点符号和数字
【发布时间】：2019-01-06 00:28:10
【问题描述】：

我想从 data.frame 文件中删除文本中的标点符号、数字和 http 链接。我尝试了 tm、stringr、quanteda、tidytext 包，但它们都不起作用。我正在为干净的 data.frame 文件寻找有用的基本包或函数，而无需将其转换为语料库或类似的东西。

我该怎么做？

mycorpus

mycorpus

而且，当我尝试查看一些包含任何符号的推文时： nchar（输出）中的错误：无效的多字节字符串，元素 1

mycorpus

【问题讨论】：

您究竟尝试了什么？请see here 发表我们可以帮助的 R 帖子。这包括有代表性的数据样本、无效的代码和预期的输出。
欢迎来到 SO。始终建议在您的帖子中使用代码标签发布输入和预期输出示例。
> mycorpus mycorpus mycorpus
请提供我们可以使用的数据的简短示例。否则我们必须继续猜测。
你可以再看看 tidytext 中的 unnest_tokens，它现在有一个 token = "tweets" 选项，可能很适合你。它的选项包括 strip_punct = TRUE 和 strip_url = TRUE。

标签： r tm stringr tidytext

【解决方案1】：

由于您尚未发布任何示例输入或示例输出，因此无法对其进行测试，为了从数据框的特定列中删除标点符号、数字和 http 链接，您可以尝试关注一次。

gsub("[[:punct:]]|[[:digit:]]|^http:\\/\\/.*|^https:\\/\\/.*","",df$column)

或者根据 Rui 在 cmets 中的建议，也使用以下。

gsub("[[:punct:]]|[[:digit:]]|(http[[:alpha:]]*:\\/\\/)","",df$column)

【讨论】：

不错的尝试，但它不会删除http:，因为它可以在冒号前有一个s。我用过"[[:punct:]]|[[:digit:]]|(http[[:alpha:]]*:\\/\\/)"。
我的测试字符串是这个问题的网址。
@RuiBarradas，非常感谢您的告知，现在更改了/。

【解决方案2】：

如果您的目标是仅保留字符，则可以通过替换所有非字符来实现简洁的版本。此外，我猜你想用空白替换它，因为你提到了一些关于语料库的东西。否则，您的地址将被折叠成没有长字符串（但也许这就是您想要的 - 正如您可能提供的示例所述）。

x = c("https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r"
      , "http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r")

gsub("\\W|\\d|http\\w?", " ", x, perl = T)
# [1] "    stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r"
# [2] "    stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r"

 the same task for a data frame of  100000 rows
# make sure that your strings are not factors
df = data.frame(id = 1:1e5, url = rep(x, 1e5/2), stringsAsFactors = FALSE)
# df before replacement
df[1:4, ]
# id    url
# 1  1 https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 2  2  http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 3  3 https://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# 4  4  http://stackoverflow.com/questions/51582369/how-can-i-remove-punctuations-and-numbers-in-text-from-data-frame-file-in-r
# apply replacement on a specific column and assign result back to this column
df$url = gsub("\\W|\\d|http\\w?", " ", df$url, perl = T)
# check output
df[1:4, ]
# id        url
# 1  1     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
# 2  2     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
# 3  3     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r
# 4  4     stackoverflow com questions          how can i remove punctuations and numbers in text from data frame file in r

【讨论】：

我不能这样做，因为我的数据有 86909 行。当我使用 gsub R 尝试转换控制台中的所有数据时，例如 # [1] ... ... ... 并且程序崩溃了。所以我需要一个解决方案来删除 data.frame 本身中的所有标点符号
更新了我的答案，以显示如果您有 100000 行的 data.frame，您将如何应用替换，这只需要几秒钟