【发布时间】:2021-04-02 21:32:08
【问题描述】:
我有一个变量,每一行都是一个句子。 示例:
-Row1 "Hey, how are you?
-Rwo2 "Hey, Who is there?
我希望输出是按单词分组的计数。
例子:
Hey 2
How 1
are 1
...
我正在使用拆分功能,但我有点卡住了。对此有什么想法吗?
谢谢!
【问题讨论】:
标签: hadoop text hive counter hiveql
我有一个变量,每一行都是一个句子。 示例:
-Row1 "Hey, how are you?
-Rwo2 "Hey, Who is there?
我希望输出是按单词分组的计数。
例子:
Hey 2
How 1
are 1
...
我正在使用拆分功能,但我有点卡住了。对此有什么想法吗?
谢谢!
【问题讨论】:
标签: hadoop text hive counter hiveql
这在 Hive 中是可能的。按非字母分割并使用横向视图+分解,然后计算单词:
with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)
select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;
结果:
word cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1
另一种方法使用sentences函数,它返回标记化句子数组(单词数组数组):
with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)
select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
lateral view explode(s.sentence) w as word
group by w.word;
结果:
word cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1
sentences(string str, string lang, string locale) 函数将自然语言文本字符串标记为单词和句子,其中每个句子在适当的句子边界处被分解并作为单词数组返回。 'lang' 和 'locale' 是可选参数。例如,sentence('Hello there! How are you?') 返回[["Hello", "there"], ["How", "are", "you"]]
【讨论】:
Hive 无法单独做到这一点。您可以将 Hive 中的数据读入 Pandas DataFrame 并使用 Python 进行处理。那么您的问题是如何计算 DataFrame 列中的词频。
【讨论】: