sql中的朴素贝叶斯计算答案

【问题标题】：Naive bayes calculation in sqlsql中的朴素贝叶斯计算
【发布时间】：2010-10-19 03:46:57
【问题描述】：

我想使用朴素贝叶斯将文档分类为数量相对较多的类。我希望根据该文章是否与已正确验证该实体的文章相似，来确认文章中提及的实体名称是否真的是该实体。

比如说，我们在一篇文章中找到了“通用汽车”的文字。我们有一组数据，其中包含文章和其中提到的正确实体。因此，如果我们发现新文章中提到的“通用汽车”，它是否应该属于先前数据中包含已知正品的那类文章提及“通用汽车”与未提及该实体的文章类别？

（我不是为每个实体创建一个类，并试图将每篇新文章分类到每个可能的类中。我已经有一种启发式方法来查找实体名称的合理提及，我只是想验证该方法已经检测到的每篇文章中提及的实体名称数量有限。）

考虑到潜在的类和文章的数量非常多，而且朴素贝叶斯相对简单，我想在 sql 中完成整个事情，但是我在评分查询方面遇到了麻烦......

这是我目前所拥有的：

CREATE TABLE `each_entity_word` (
  `word` varchar(20) NOT NULL,
  `entity_id` int(10) unsigned NOT NULL,
  `word_count` mediumint(8) unsigned NOT NULL,
  PRIMARY KEY (`word`, `entity_id`)
);

CREATE TABLE `each_entity_sum` (
  `entity_id` int(10) unsigned NOT NULL DEFAULT '0',
  `word_count_sum` int(10) unsigned DEFAULT NULL,
  `doc_count` mediumint(8) unsigned NOT NULL,
  PRIMARY KEY (`entity_id`)
);

CREATE TABLE `total_entity_word` (
  `word` varchar(20) NOT NULL,
  `word_count` int(10) unsigned NOT NULL,
  PRIMARY KEY (`word`)
);

CREATE TABLE `total_entity_sum` (
  `word_count_sum` bigint(20) unsigned NOT NULL,
  `doc_count` int(10) unsigned NOT NULL,
  `pkey` enum('singleton') NOT NULL DEFAULT 'singleton',
  PRIMARY KEY (`pkey`)
);

标记数据中的每篇文章都被拆分为不同的单词，对于每个实体的每篇文章，每个单词都添加到each_entity_word 和/或其word_count 递增，doc_count 在entity_word_sum 递增，两者都与entity_id 有关。对于该文章中提到的每个已知实体，都会重复此操作。

对于每篇文章，无论每个单词中包含的实体如何，total_entity_word total_entity_word_sum 都会以类似方式递增。

P(word|any document) 应该等于 word_count 在total_entity_word 中表示该词结束 doc_count 在total_entity_sum
P(word|document 提到实体 x) 应该等于word_count in each_entity_word 对于 entity_id x 的那个词在 doc_count in each_entity_sum for entity_id x
P(word|document does not提及实体x)应该等于（total_entity_word中的word_count减去each_entity_word中的word_count该实体的词）超过（total_entity_sum 中的 doc_count 减去 each_entity_sum 中该实体的 doc_count）
P(文档提及实体 x) 应该等于 doc_count in each_entity_sum for that entity id over doc_count in total_entity_word
P（文档未提及实体 x）应等于 1 减（doc_count in each_entity_sum for x's entity id over doc_count in @ 987654354@)。

对于进来的新文章，将其拆分为单词，然后在each_entity_word 或total_entity_word 中选择单词的位置（'I'、'want'、'to'、'use'...） .在我正在使用的数据库平台（mysql）中，IN 子句的优化相对较好。

sql 中也没有 product() 聚合函数，所以当然你可以只做 sum(log(x)) 或 exp(sum(log(x))) 来得到 product(x) 的等价物。

因此，如果我收到一篇新文章，将其拆分为不同的单词并将这些单词放入一个大 IN() 子句和一个潜在的实体 id 中进行测试，我怎样才能获得文章落入的朴素贝叶斯概率在 sql 中进入该实体 id 的类？

编辑：

尝试#1：

set @entity_id = 1;

select @entity_doc_count = doc_count from each_entity_sum where entity_id=@entity_id;

select @total_doc_count = doc_count from total_entity_sum;

select 
            exp(

                log(@entity_doc_count / @total_doc_count) + 

                (
                    sum(log((ifnull(ew.word_count,0) + 1) / @entity_doc_count)) / 
                    sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (@total_doc_count - @entity_doc_count)))
                )

            ) as likelihood,
        from total_entity_word aew 
        left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=@entity_id

        where aew.word in ('I', 'want', 'to', 'use'...);

【问题讨论】：

标签： sql mysql machine-learning nlp bayesian

【解决方案1】：

这是 SQL Server 的简单版本。我在一个免费的 SQL Express 实现上运行它，它非常快。

http://sqldatamine.blogspot.com/2013/07/classification-using-naive-bayes.html

【讨论】：

虽然此链接可能会回答问题，但最好在此处包含答案的基本部分并提供链接以供参考。如果链接页面发生更改，仅链接的答案可能会失效。

【解决方案2】：

这是一篇博文，详细介绍了您正在寻找的内容：http://nuncupatively.blogspot.com/2011/07/naive-bayes-in-sql.html

我已经在 SQL 中编写了许多版本的 NB 分类器。上面提倡更改分析包的答案不适用于我的大数据和处理时间要求。我有一个表格，每个单词/类组合都有一行（nrows = words * classes）和一个系数列。我有另一个表，其中包含 document_id 和 word 列。我只是将这些表按单词组合在一起，按文档分组，并对系数求和，然后调整类概率的总和。这给我留下了一张包含 document_id、class、score 的表格。然后我只选择了最低分数（因为我在做一个补充朴素贝叶斯方法，我发现它在多类情况下效果更好）。

作为旁注，我发现许多转换/算法修改大大改善了我的坚持预测。它们在 Jason Rennie 的“解决朴素贝叶斯文本分类器的不良假设”的著作中进行了描述，并在此处进行了总结：http://www.ist.temple.edu/~vucetic/cis526fall2007/liang.ppt

【讨论】：

【解决方案3】：

使用 R 到 Postgres（或 MySQL 等）接口

或者，我建议使用已建立的 stats 包和数据库的连接器。如果您想从朴素贝叶斯切换到更复杂的东西，这将使您的应用程序更加灵活：

http://rpgsql.sourceforge.net/

bnd.pr> data(airquality)

bnd.pr> db.write.table(airquality, no.clobber = F)

bnd.pr> bind.proxy("airquality")

bnd.pr> summary(airquality)
Table name: airquality 
Database: test 
Host: localhost
Dimensions: 6 (columns) 153 (rows)


bnd.pr> print(airquality)
   Day Month Ozone Solar.R Temp
1    1     5    41     190   67
2    2     5    36     118   72
3    3     5    12     149   74
4    4     5    18     313   62
5    5     5    NA      NA   56
6    6     5    28      NA   66
7    7     5    23     299   65
8    8     5    19      99   59
9    9     5     8      19   61
10  10     5    NA     194   69
Continues for 143 more rows and 1 more cols...

bnd.pr> airquality[50:55, ]
   Ozone Solar.R Wind Temp Month Day
50    12     120 11.5   73     6  19
51    13     137 10.3   76     6  20
52    NA     150  6.3   77     6  21
53    NA      59  1.7   76     6  22
54    NA      91  4.6   76     6  23
55    NA     250  6.3   76     6  24

bnd.pr> airquality[["Ozone"]]
  [1]  41  36  12  18  NA  28  23  19   8  NA   7  16  11  14  18  14  34   6
 [19]  30  11   1  11   4  32  NA  NA  NA  23  45 115  37  NA  NA  NA  NA  NA
 [37]  NA  29  NA  71  39  NA  NA  23  NA  NA  21  37  20  12  13  NA  NA  NA
 [55]  NA  NA  NA  NA  NA  NA  NA 135  49  32  NA  64  40  77  97  97  85  NA
 [73]  10  27  NA   7  48  35  61  79  63  16  NA  NA  80 108  20  52  82  50
 [91]  64  59  39   9  16  78  35  66 122  89 110  NA  NA  44  28  65  NA  22
[109]  59  23  31  44  21   9  NA  45 168  73  NA  76 118  84  85  96  78  73
[127]  91  47  32  20  23  21  24  44  21  28   9  13  46  18  13  24  16  13
[145]  23  36   7  14  30  NA  14  18  20

然后您需要安装 e1071 软件包来执行朴素贝叶斯。在 R 提示符下：

[ramanujan:~/base]$R

R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


 ~/.Rprofile loaded.
Welcome at  Sun Apr 19 00:45:30 2009
> install.packages("e1071")  
> install.packages("mlbench")
> library(e1071)
> ?naiveBayes
> example(naiveBayes)

【讨论】：

【解决方案4】：

如果使用 Oracle，它具有内置的数据挖掘功能

我不确定您运行的是什么数据库，但如果您使用的是 Oracle，则数据挖掘功能已融入数据库：

http://www.oracle.com/technology/products/bi/odm/index.html

...包括朴素贝叶斯：

http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/algo_nb.htm

还有很多其他人：

http://www.oracle.com/technology/products/bi/odm/odm_techniques_algorithms.html

这让我很惊讶。绝对是 Oracle 在该领域相对于开源替代品的竞争优势之一。

【讨论】：

不幸的是，发布者不太可能有多余的 1000 美元用于购买许可证。出于好奇，我查看了试用版，下载量超过 2GB。我敢肯定它非常好，但对于他正在寻找的东西来说，这几乎肯定是矫枉过正。

【解决方案5】：

我没有时间计算NB 公式的所有表达式，但主要思想如下：

SET @entity = 123;

SELECT  EXP(SUM(LOG(probability))) / (EXP(SUM(LOG(probability))) + EXP(SUM(LOG(1 - probability))))
FROM    (
        SELECT  @entity AS _entity,
                /* Above is required for efficiency, subqueries using _entity will be DEPENDENT and use the indexes */
                (
                SELECT  SUM(word_count)
                FROM    total_entity_word
                WHERE   word = d.word
                )
                /
                (
                SELECT  doc_count
                FROM    each_entity_sum
                WHERE   entity_id = _entity
                ) AS pwordentity,
                /* I've just referenced a previously selected field */
                (
                SELECT  1 - pwordentity
                ) AS pwordnotentity,
                /* Again referenced a previously selected field */
                ... etc AS probability
        FROM    total_entity_word
        ) q

请注意，您可以通过在相关子查询中使用 SELECT 中的前一个字段（如示例）轻松引用它们。

【讨论】：