如何使用mongodb从html文档中删除html标签答案

【问题标题】：How to remove html tags from html documents using mongodb如何使用mongodb从html文档中删除html标签
【发布时间】：2016-08-02 10:33:41
【问题描述】：

我使用的是 mongodb 版本 3.2.4。我有一个只有一个文件的集合，Html（id，ofc 除外）。共有 8.821.851 个文档。每个文档代表 html 文件，大约 5 页纯文本，但大约 20 页带有 html 标签。

当我尝试搜索包含某些文本的文档（例如 ID 号为“5555888”或名称为“John”）的简单查询时，它不会返回任何结果，或者我必须等待很长时间.

我想从所有文档中删除 html 标签以减小所有文件的大小。那么查询特定文档会更快。

Mongodb 中有解决这个任务的方法吗？

一个html文档示例（小部分）：

"<a name=\"JR_PAGE_ANCHOR_0_1\"></a>\n<table style=\"width: 595px; border-collapse: collapse; empty-cells: show\" cellpadding=\"0\" cellspacing=\"0\" border=\"0\" bgcolor=\"white\">\n<tr>\n  <td style=\"width: 20px; height: 1px;\"></td>\n  <td style=\"width: 40px; height: 1px;\"></td>\n  <td style=\"width: 20px; height: 1px;\"></td>\n  <td style=\"width: 22px; height: 1px;\"></td>\n  <td style=\"width: 108px; height: 1px;\"></td>\n  <td style=\"width: 88px; height: 1px;\"></td>\n  <td style=\"width: 117px; height: 1px;\"></td>\n  <td style=\"width: 80px; height: 1px;\"></td>\n  <td style=\"width: 79px; height: 1px;\"></td>\n  <td style=\"width: 1px; height: 1px;\"></td>\n  <td style=\"width: 20px; height: 1px;\"></td>\n</tr>\n<tr valign=\"top\">\n  <td colspan=\"11\" style=\"width: 595px; height: 20px;\"></td>\n</tr>\n<tr valign=\"top\">\n  <td colspan=\"2\" style=\"width: 60px; height: 46px;\"></td>\n  <td colspan=\"2\"><img src=\"/servlets/PublicReportServlet?_image_=img_0_0_9\" style=\"height: 46px\" alt=\"\"/></td>\n  <td colspan=\"6\" style=\"text-align: right;\"><span style=\"font-family: Times New Roman; color: #000000; font-size: 18.0px; font-weight: bold;\">NESLUŽBENA KOPIJA</span></td>\n  <td style=\"width: 20px; height: 46px;\"></td>\n</tr>\n<tr valign=\"top\">\n  <td style=\"width: 20px; height: 12px;\"></td>\n  <td colspan=\"9\"><span style=\"font-family: Times New Roman; color: #000000; font-size: 10.0px; font-weight: bold;\">REPUBLIKA HRVATSKA</span></td>\n  <td style=\"width: 20px; height: 12px;\"></td>\n</tr>\n<tr valign=\"top\">\n  <td colspan=\"11\" style=\"width: 595px; height: 4px;\"></td>\n</tr>\n<tr valign=\"top\">\n  <td style=\"width: 20px; height: 12px;\"></td>\n  <td colspan=\"9\"><span style=\"font-family: Times New Roman; color: #000000; font-size: 10.0px; font-weight: bold;\">Općinski sud u Novom Zagrebu</span></td>\n  <td style=\"width: 20px; height: 12px;\"></td>\n</tr>\n<tr valign=\"top\">\n  <td style=\"width: 20px; height: 12px;\"></td>\n  <td colspan=\"9\"><span style=\"font-family: Times New Roman; color: #000000; font-size: 10.0px; font-weight: bold;\">ZEMLJIŠNOKNJIŽNI ODJEL NOVI ZAGREB</span></td>\n  <td style=\"width: 20px; height: 12px;\"></td>\n</tr>\n<tr valign=\"top\">\n  <td style=\"width: 20px; height: 12px;\"></td>\n  <td colspan=\"4\" style=\"border-bottom-style: solid; border-bottom-width: 1px; border-bottom-color: #000000; \"><span style=\"font-family: Times New Roman; color: #000000; font-size: 10.0px; font-weight: bold;\">Stanje na dan: 19.06.2016 23:05</span></td>\n  <td colspan=\"5\" style=\"border-bottom-style: solid; border-bottom-width: 1px; border-bottom-color: #000000; text-align: right;\"><span style=\"font-family: Times New Roman; color: #000000; font-size: 10.0px;\">Verificirani poduložak</span></td>\n  <td style=\"width: 20px; height: 12px;\"></td>\n</tr>\n<tr valign=\"top\">\n  <td style=\"width: 20px; height: 12px;\"></td>\n  <td colspan=\"9\"></td>\n  <td style=\"width: 20px; height: 12px;\"></td>\n</tr>\n<tr valign=\"top\">\n  <td style=\"width: 20px; height: 12px;\"></td>\n  <td colspan=\"5\"><span style=\"font-family: Times New Roman; color: #000000; font-size: 10.0px; font-weight: bold;\">Knjiga PU: ZAPRUDSKI OTOK

【问题讨论】：

此类任务最好在客户端上完成。 MongoDB 应该是一个大而笨的数据存储——它几乎不进行任何处理，只是存储和检索数据。你应该尊重这个目标，并尽量避免强迫 MongoDB 做任何可以在客户端完成的计算。
您需要使用 HTML 解析器来解析您的文档并删除这些标签，然后使用结果来更新您的文档吗？您使用的是哪种语言驱动程序？
好的。我可以使用 Python 或 R 来执行此操作吗？如何操作？

标签： html mongodb tags

【解决方案1】：

显然，您在 MongoDB 中有一个庞大的集合，并且您想要编辑该集合中的每个文档。文档编辑或多或少是字符串编辑和处理。

这不是 MongoDB 的任务。您应该使用一些 MongoDB 客户端或驱动程序，并用 c# 或 java 编写一个非常简单的代码来遍历每个文档并对其进行编辑，

如果您在编写代码时需要帮助，我可以提供一个小的 JAVA 示例来提供帮助。如有需要，请发表评论。

【讨论】：

我不知道该怎么做。我正在使用 R，有时使用 Python。我可以在这个程序中做同样的事情吗？如何做？
你有 Python 的 MongoDB 驱动程序吗？