使用 Hive 计算文本变量的单词频率答案

【问题标题】：Count Frequency of words of a Text variable with Hive使用 Hive 计算文本变量的单词频率
【发布时间】：2021-04-02 21:32:08
【问题描述】：

我有一个变量，每一行都是一个句子。示例：

 -Row1 "Hey, how are you?
 -Rwo2 "Hey, Who is there?

我希望输出是按单词分组的计数。

例子：

Hey 2
How 1
are 1
...

我正在使用拆分功能，但我有点卡住了。对此有什么想法吗？

谢谢！

【问题讨论】：

标签： hadoop text hive counter hiveql

【解决方案1】：

这在 Hive 中是可能的。按非字母分割并使用横向视图+分解，然后计算单词：

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;

结果：

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

另一种方法使用sentences函数，它返回标记化句子数组（单词数组数组）：

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
   lateral view explode(s.sentence) w as word
group by w.word;

结果：

word    cnt
are     1
hey     2
how     1
is      1
there   1
who     1
you     1

sentences(string str, string lang, string locale) 函数将自然语言文本字符串标记为单词和句子，其中每个句子在适当的句子边界处被分解并作为单词数组返回。 'lang' 和 'locale' 是可选参数。例如，sentence('Hello there! How are you?') 返回[["Hello", "there"], ["How", "are", "you"]]

【讨论】：

【解决方案2】：

Hive 无法单独做到这一点。您可以将 Hive 中的数据读入 Pandas DataFrame 并使用 Python 进行处理。那么您的问题是如何计算 DataFrame 列中的词频。

Counting the Frequency of words in a pandas data frame

【讨论】：