从文本字符串中删除 HTML 标记并保留文本答案

【问题标题】：Remove HTML tags from a text string and keep the text从文本字符串中删除 HTML 标记并保留文本
【发布时间】：2019-01-17 04:47:21
【问题描述】：

我有一个如下所示的文本字符串：-

^style>           
  p,span,li{font-family:Arial;font-size:10.5pt;}        
^/style>  
^p>
  ^img src="https://app.keysurvey.com/" alt="image" width="462" />
^/p>  
^p>
  Dear Adam,
^/p>  
^p>
  Thank you for your query, the Reference ID for your query is 
  ^strong>^u> 28600 ^/u>^/strong>
  .&nbsp; We will respond to you within the next 1-2 business days.
^/p>  
^p>For further correspondence with us, kindly reply by maintaining the 
   Reference ID number of this case in the subject line of your e-mail.
^/p>  
^p>
  Regards
^/p>

我的目标是清除所有 html 标签和其他垃圾值并返回如下文本：

输出：-

亲爱的亚当，

感谢您的查询，您查询的参考 ID 是我们将在接下来的 1-2 个工作日内回复您。与我们通信，请通过维护参考 ID 回复您的电子邮件主题行中此案例的编号。问候，

我尝试过tm.plugin.webmining、extractHTMLStrip，但无法清除垃圾值

library(tm.plugin.webmining)
df$text1 <- extractHTMLStrip(df$text)

【问题讨论】：

这个问题已经被问过很多次了。多种解决方案可通过各种库或正则表达式获得。尝试例如here 或 here
无论如何这些都无济于事，谢谢第二个链接是我的问题的链接我已经尝试使用 xml、Rcurl 和 RVest 库来清除垃圾值，但是这些没有帮助，谢谢，祝你有美好的一天
您可以尝试gsub("[^p]", "", x)，然后对要删除的任何内容重复此操作。这将替换 ^p 的任何实例
对不起，一定是搞砸了链接的复制和粘贴。我在下面使用正则表达式提供了答案，但如果是字符串损坏的情况，您可以执行gsub("\\^", "<", df$text)，这应该可以让您的 hmtl 工具正常工作。

标签： r

【解决方案1】：

如果你的字符串有小于号损坏，你可以用正则表达式来做。

yourstring <- '^style> p,span,li{ font-family:Arial; font-size:10.5pt; } ^/style> ^p>^img src="https://app.keysurvey.com/" alt="image" width="462" />^/p> ^p>Dear Adam,^/p> ^p>Thank you for your query, the Reference ID for your query is ^strong>^u> 28600 ^/u>^/strong>.  We will respond to you within the next 1-2 business days.^/p> ^p>For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail.^/p> ^p>Regards'
# reproducible example of your string

yourstring <- gsub("\\^.*?>", "", yourstring)
yourstring <- gsub("p,span.*?}", "", yourstring)
yourstring <- trimws(yourstring)

这让你：

> yourstring
[1] "Dear Adam, Thank you for your query, the Reference ID for your query is  28600 .  We will respond to you within the next 1-2 business days. For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail. Regards"

为了更优雅，你可以使用stringr 和magrittr 库。

【讨论】：