动态生成数据库条目的标签

根据我最近在博客上收到的一个问题，我决定写这篇文章。让我简要解释一下该想法的一些背景和背景。我最近创建了一个网站，在后端保存了唐纳德·特朗普的许多演讲。在前端，用户可以搜索任何单词，以查看该单词是否包含在任何语音中。如果匹配，它将为您提供有关特朗普最常谈论用户搜索内容的方式，时间和地点的统计信息。如果您有兴趣，这是该网站的工作示例： https : //trumpspeechdata.herokuapp.com/

在最近的一篇文章中，我提到了如何（如果您有一个类似的项目）如何使数据库对用户可搜索，以及在搜索结果与用户预期结果不匹配的情况下经常会出现问题。一种缓解这种情况的方法是使用单词的映射对象，该对象将相似的单词分组并搜索到一个“关键字”伞中。基本上，如果用户的搜索与映射对象中的任何单词匹配，则将其替换为“伞形”单词，以便返回的搜索结果与主题更紧密地匹配，而不仅仅是用户搜索的特定单词。有关更多信息，请参见以下文章： https : //hackernoon.com/creating-more-accurate-search-results-for-small-sites-436e64da79b6

所以这是今天的问题：如果您的数据库结构当前如下所示：

{
   "speechtitle": "Remarks by President Trump on TaxReform",
   "speechdate": "September 2017",
   "speechlocation": "Missouri",
   "text": "blah, blah, blah, text here",
}

尚无标签字段。如何将标签动态添加到数据库？好吧，首先，您可以通过阅读每个语音并手动将字段和标签添加到每个语音来手动进行操作。但是，让我们面对现实，在大多数情况下这是不可行的。我们要做的是将最相关的标签添加到每个数据库条目中，而无需手动进行。我想出了一个解决方案。要查看我在本文中的工作方向，可以单击此处，然后再返回参考以查看其与我编写的代码如何匹配：

https://trumpspeechdata.herokuapp.com/taginput

我先承认，我确信有一个使用Python或PHP的更好的解决方案。我的专长是Node.JS和JavaScript，因此，如果您想继续学习，这就是我的解决方案所在。我认为这句老话是对的：“如果你只有锤子，一切看起来就像钉子。” 无论如何，让我们开始吧。这是我对此的思考过程：

步骤1：从语音文本中提取标签

我需要对每个演讲的文本进行排序，找到最常用的单词。哪个词出现得最频繁，我将提取它们以用作该数据库条目的标记。但是，如果我只是简单地这样做，我将以“和”，“该”，“它”和“如果”是最常用的词结尾。因此，我创建了一系列可能的主题来循环播放并与语音文本进行匹配……那样，我不会得到很多“非主题”文本，而且我的结果更有意义。这是我使用的标签，并将它们存储在数组中：

let tags = [ "women's rights", "women", "deport", "border", "security", "immigration", "extremism", "terrorism", "tuition", "healthcare", "tax reform", "taxes", "citizenship", "abortion", "religion", "lgbt", "gay rights", "transgender", "military", "marriage", "gun control", "surveillance", "net neutrality", "drugs", "social security", "obamacare", "medicaid", "israel", "military spending", "north korea", "isis", "equal rights", "minimum wage", "welfare", "student loans", "education", "climate change", "transportation", "loans", "student", "school", "teacher", "jobs", "salary"];

接下来，我将创建一个将这些单词与每个语音中的所有单词匹配的函数。我将为此使用一个for循环。使用提取并将数据存储在名为“ api”的变量中之后，我准备开始操作。

response.json().then(function(data) {

let api = data;


            let tagsOutput = [];
            for (var i = 0; i < api.length; i++) {
              let tags = [ "women's rights", "women", "deport", "border", "security", "immigration", "extremism", "terrorism", "tuition", "healthcare", "tax reform", "taxes", "citizenship", "abortion", "religion", "lgbt", "gay rights", "transgender", "military", "marriage", "gun control", "surveillance", "net neutrality", "drugs", "social security", "obamacare", "medicaid", "israel", "military spending", "north korea", "isis", "equal rights", "minimum wage", "welfare", "student loans", "education", "climate change", "transportation", "loans", "student", "school", "teacher", "jobs", "salary"];

              if(api[i].text.length > 1) {
                let index = api.indexOf(api[i]);
                let stringX = api[i].text.split(" ");
                for (var j = 0; j < tags.length; j++) {
                  for (var k = 0; k < stringX.length; k++) {
                    if (tags[j] == stringX[k]) {
                      tagsOutput.push([
                        index,
                        tags[j],
                        api[i].title,
                        api[i].location,
                        api[i].date,
                        api[i].text,
                    ])
                   }
                 }
               }
             }

我将“标签”数组保留在for循环中。然后，我创建一个if语句，以确保文本在那里。如果是这样，我将执行一个嵌套的for循环，方法是先将原始文本拆分为多个单词，然后将其存储在变量stringX中。然后，我循环遍历标签，并在其中嵌套第二个for循环，遍历StringX，然后如果文本中的单词与标签中的单词匹配，则将几件事推入名为“ tagsOutput”的新数组中”。这就是我要推入的数组，原因：

index —我已将api [i]的索引存储在此变量中。由于将创建一个新数组，因此如果以后需要访问原始api中的数据，则可以通过匹配索引号轻松地将新数组中的项目与api中的项目进行匹配。
标签-这些是我将提取并最终发布到我的数据库中的标签。
标题-我想把标题推翻，以便以后找到最常用的单词时，可以将它们与特定的语音相关联。
位置，日期和文字-我不一定需要所有这些东西。由于我有索引，因此以后可以从api数组中获取这些项目。但是，如果可能的话，我经常发现将我需要的所有内容组合在一个数组中会更容易。就像去露营一样。我可能不使用我的马赫套，4个手电筒和8个换内裤，但是如果以后需要它，很高兴能拥有所有这些。

步骤2：按频率对标签进行排序

接下来，我需要计算某些单词在某些语音中出现的频率。我将使用一个函数来计算项目频率，摆脱重复的条目，并返回项目和频率。我将把该函数同时交给tile和标记。这样，如果演讲的标题是“亚利桑那州的集会”，并且标签“ war”在演讲中出现了6次，“ women”出现了8次，那么我应该得到如下结果：

{women, rally in Arizona, 8}
{war, rally in Arizona, 6}

这样，我不仅获得了标签的频率或语音的频率……而且将它们组合起来获得了该标签在该语音中的频率。该函数将如下所示：

Array.prototype.byCount= function(){
            var itm, a= [], L= this.length, o= {};
            for(var i= 0; i<L; i++){
              itm= this[i];
              if(!itm) continue;
              if(o[itm]== undefined) o[itm]= 1;
              else ++o[itm];
            }
            for(var p in o) a[a.length]= {item: p, frequency: o[p]};
            return a.sort(function(a, b){
              return o[b.item]-o[a.item];
            });
          }

let tagFreq = tagsOutput.byCount();

步骤3：优化标签

现在，我将得出很多带有频率为1、2或3的标签。我感觉，在45分钟的演讲中，如果“预算”仅出现两次，则与演讲的意义不足够将其添加为标签。因此，我将创建一个新数组，并循环遍历“ tagFreq”，仅添加频率高于6的对象。这应确保这些标签在语音中出现的频率足够高，可以添加为相关标签。

let arrayOfTags = [];
for (var i = 0; i < tagFreq.length; i++) {
  if (tagFreq[i].frequency > 6) {
    arrayOfTags.push([
      tagFreq[i].item
    ])
  }
}

下一部分完全没有必要，但我将包括它。经过这一事实，我才意识到，这样做会返回一个巨大的字符串，然后是频率。我需要将该字符串的每个部分都分成一个数组，以便标题，日期，位置和文本具有自己的数组索引。有更好的方法可以做到这一点。但是我在逗号处使用了一系列字符串拆分来提取特定元素。

步骤4：重组数组中的数据

let matchedTags = [];
for (var i = 0; i < arrayOfTags.length; i++) {

let index = arrayOfTags[i][0].split(',', 1)[0];
  let remainingString1 = arrayOfTags[i][0].split(/,(.+)/)[1];
  let tag = remainingString1.split(',', 1)[0];
  let remainingString2 = remainingString1.split(/,(.+)/)[1];
  let title = remainingString2.split(',', 1)[0];
  let remainingString3 = remainingString2.split(/,(.+)/)[1];
  let location = remainingString3.split(',', 1)[0];
  let remainingString4 = remainingString3.split(/,(.+)/)[1];
  let date = remainingString4.split(',', 1)[0];
  let text = remainingString4.split(/,(.+)/)[1];

  matchedTags.push([
    index,
    tag,
    title,
    location,
    date,
    text
  ])
}

我担心语音文本中的逗号太多，而且我担心JavaScript会不小心将其错误分解。所以在这里，我要做的基本上是在第一个逗号处分割字符串，这给了我“索引”。我要分别保存“索引”之后的所有内容，然后做同样的事情，将其拆分，然后在下一个逗号处获取标签，然后冲洗并重复。

步骤5：在Node.JS中设置后端

“匹配的标签”数组或多或少为我提供了我希望的数据结构。现在是棘手的部分。我要在这里换档并进入后端。使用Node.JS，我正在使用MongoDB创建一个基于文档的数据数据库。与其简单地在数据中添加一个“标签”字段，不如说我将返回并添加两个字段。在本条目中，我将不涉及在Node中设置后端的整个过程，而仅足以解释我在做什么。

最初，我曾使用Mongoose为我的数据创建模型。我只有一个模型，“说话”。现在，我要返回，并添加另一个名为“ speechID”的模型。该模型的结构如下所示：

let Schema = mongoose.Schema;

const speechIDSchema = new Schema({

speechID: {
    type: String,
  },

})
speechIDSchema.plugin(timestamps);
const SpeechID = mongoose.model('speechID', speechIDSchema);

module.exports = SpeechID;

在第二个模型中，我引用第一个模型。这类似于在关系数据库中创建表关联。在这里，我的语音文档将与我的SpeechID文档相关。我这样做的原因是以后，我需要在演讲中添加标签。通过在“语音”模型中包含语音ID，我可以在post方法中使用该ID，以确保将添加的标签添加到正确的语音中。我的语音模型如下所示：

let Schema = mongoose.Schema;

const speechSchema = new Schema({

  speechID: {
  type: String,
  ref: 'SpeechID',
  },
  title: {
    type: String,
  },
  date: {
    type: String,
  },
  location: {
    type: String,
  },
  text: {
    type: String,
  },
  tags: [{
    type: String,
  }],

})
speechSchema.plugin(timestamps);
const Speech = mongoose.model('speech', speechSchema);

module.exports = Speech;

正如您在此处看到的那样，每个语音都将获得一个SpeechID，但是我不必输入它，只需从SpeechID模型中获取ID。在我的app.js中，这是我构造app.post的方式以适应此更改：

//====APP POST SPEECH===//

app.post('/speechnew/:speechID', function(req, res) {
  Speech.create({
  speechID: req.params.speechID,
  title: req.body.title,
  date: req.body.date,
  location: req.body.location,
  text: req.body.text,
  tags: req.body.tags,
}).then(speechs => {
  res.json(speechs)
});
});

//==========================//

app.post操作将为/ speechnew /：speechID。但是，再次创建表单时，我会给它实际的语音ID，以确保这种方式标签可以更新正确的文档。您可能会注意到，这是“创建”方法，但是当我添加标签时，我将使用“ findOneAndUpdate”方法。在猫鼬中几乎是一样的，只是基本上改变了一个单词。该帖子请求的外观如下：

//====APP UPDATE SPEECH===//

app.post('/speechupdate/:speechID', function(req, res) {
  Speech.findOneAndUpdate({
  speechID: req.params.speechID,
  title: req.body.title,
  date: req.body.date,
  location: req.body.location,
  text: req.body.text,
  tags: req.body.tags,
}).then(speechs => {
  res.json(speechs)
});
});

//==========================//

步骤6：动态创建一个表单，该表单将输出后端。

同样，在这里我不会讨论大量的Node.js。我需要做的就是确保表单动作与此处的路由匹配，并且表单输入名称与此处的req.body名称匹配。好的，让我们回到前端的JavaScript。我离开了“ matchedTags”数组，该数组保存了我想要的数据结构。我将逐步分解下一部分。基本上，将要发生的是，我要遍历数组，并为每个数组项创建一个表单和表单输入。通过创建用作元素ID的变量，我将为创建的每个元素动态设置唯一ID。然后，我将使用.value将这些表单输入的值设置为数组中的数据。我的表单操作将是我在后端创建的更新路由，因此更新时唯一会发生变化的应该是“标签”。

for (var i = 0; i < api.length; i++) {
let speechID = api[i].speechID;

我再次遍历api，并将每个数组元素的ID保存在变量中。

for (var k = 0; k < matchedTags.length; k++) {
    let formID = "form"+matchedTags[k][0]+"";
    let inputID1 = "input1"+matchedTags[k][0]+"";
    let inputID2 = "input2"+matchedTags[k][0]+"";
    let inputID3 = "input3"+matchedTags[k][0]+"";
    let inputID4 = "input4"+matchedTags[k][0]+"";
    let inputID5 = "input5"+matchedTags[k][0]+"";

在最初的for循环中，我将开始第二个for循环，在此循环遍历我的matchTags数组。我创建的每个元素都需要具有唯一的ID。由于我不知道数组的大小，因此必须让我的for循环来决定。我唯一知道的是将有一个表单，并为“ title”，“ date”，“ location”，“ text”和“ tags”提供5个输入字段。每个元素都会获得一个基于SpeechID的唯一ID，该ID包含在“ matchedTags [k] [0]”中。

if(matchedTags[k][0] == api.indexOf(api[i])) {

接下来，使用if语句，确保早先退出的IndexOf并已存储在matchTags [k] [0]中，并且与api数组的indexOf相匹配。如果匹配，我将启动一个新功能，创建所需的所有表单元素，并使用之前创建的ID。

let newForm = document.createElement('form');

newForm.id = formID;
              document.body.appendChild(newForm);
              document.getElementById('data').appendChild(newForm);
              newForm.action = "/speechupdate/"+speechID+"";
              newForm.method = "post";

再次使用formID变量，这是我的新表单。在newForm.action中，我使用的是我在后端/ speechupdate /：speechID上创建的路由，仅在这里，speechID是一个包含语音ID的变量。这就是将表单连接到正确的数据库条目的原因。接下来，我将创建一些输入并将它们附加到表单，就像我将表单附加到HTML正文中的and元素一样：

let newInput1 = document.createElement('input');
          newInput1.id = inputID1;
          document.body.appendChild(newInput1);
          document.getElementById(formID).appendChild(newInput1);
          newInput1.name = "title";
          newInput1.value = matchedTags[k][2];

          let newInput2 = document.createElement('input');
          newInput2.id = inputID2;
          document.body.appendChild(newInput2);
          document.getElementById(formID).appendChild(newInput2);
          newInput2.name = "date";
          newInput2.value = matchedTags[k][4];

          let newInput3 = document.createElement('input');
          newInput3.id = inputID3;
          document.body.appendChild(newInput3);
          document.getElementById(formID).appendChild(newInput3);
          newInput3.name = "location";
          newInput3.value = matchedTags[k][3];

          let newInput4 = document.createElement('input');
          newInput4.id = inputID4;
          document.body.appendChild(newInput4);
          document.getElementById(formID).appendChild(newInput4);
          newInput4.name = "text";
          newInput4.value = matchedTags[k][5];

          let newInput5 = document.createElement('input');
          newInput5.id = inputID5;
          document.body.appendChild(newInput5);
          document.getElementById(formID).appendChild(newInput5);
          newInput5.name = "tags";
          newInput5.value = matchedTags[k][1];

          let submitButton = document.createElement('input');
          submitButton.id = "submitbtn";
          document.body.appendChild(submitButton);
          document.getElementById(formID).appendChild(submitButton);
          submitButton.type = "submit";

每个表单输入都将附加到表单，并具有自己的唯一ID。 input.name也匹配我们在后端使用的req.body名称。这是整个函数的外观：

for (var i = 0; i < api.length; i++) {
  let speechID = api[i].speechID;
  for (var k = 0; k < matchedTags.length; k++) {
    let formID = "form"+matchedTags[k][0]+"";
    let inputID1 = "input1"+matchedTags[k][0]+"";
    let inputID2 = "input2"+matchedTags[k][0]+"";
    let inputID3 = "input3"+matchedTags[k][0]+"";
    let inputID4 = "input4"+matchedTags[k][0]+"";
    let inputID5 = "input5"+matchedTags[k][0]+"";
    if(matchedTags[k][0] == api.indexOf(api[i])) {

      let newForm = document.createElement('form');

newForm.id = formID;
              document.body.appendChild(newForm);
              document.getElementById('data').appendChild(newForm);
              newForm.action = "/speechupdate/"+speechID+"";
              newForm.method = "post";

          let newInput1 = document.createElement('input');
          newInput1.id = inputID1;
          document.body.appendChild(newInput1);
          document.getElementById(formID).appendChild(newInput1);
          newInput1.name = "title";
          newInput1.value = matchedTags[k][2];

          let newInput2 = document.createElement('input');
          newInput2.id = inputID2;
          document.body.appendChild(newInput2);
          document.getElementById(formID).appendChild(newInput2);
          newInput2.name = "date";
          newInput2.value = matchedTags[k][4];

          let newInput3 = document.createElement('input');
          newInput3.id = inputID3;
          document.body.appendChild(newInput3);
          document.getElementById(formID).appendChild(newInput3);
          newInput3.name = "location";
          newInput3.value = matchedTags[k][3];

          let newInput4 = document.createElement('input');
          newInput4.id = inputID4;
          document.body.appendChild(newInput4);
          document.getElementById(formID).appendChild(newInput4);
          newInput4.name = "text";
          newInput4.value = matchedTags[k][5];

          let newInput5 = document.createElement('input');
          newInput5.id = inputID5;
          document.body.appendChild(newInput5);
          document.getElementById(formID).appendChild(newInput5);
          newInput5.name = "tags";
          newInput5.value = matchedTags[k][1];

          let submitButton = document.createElement('input');
          submitButton.id = "submitbtn";
          document.body.appendChild(submitButton);
          document.getElementById(formID).appendChild(submitButton);
          submitButton.type = "submit";

              console.log(newInput3.value);

}
  }
}

现在，此解决方案可行，但它可能更干净，并且其他地方可能有更好的解决方案。但是，如果您像我一样，并且真的很喜欢node.js和JavaScript，则此解决方案应该有效，如果您需要使用数据库内容生成的标记动态更新数据库。如果您有任何疑问或反馈，请与我们联系。谢谢！

From: https://hackernoon.com/dynamically-generate-tags-for-database-entries-d4fba077dd33