【问题标题】:Count Frequency of words of a Text variable with Hive使用 Hive 计算文本变量的单词频率
【发布时间】:2021-04-02 21:32:08
【问题描述】:

我有一个变量,每一行都是一个句子。 示例:

 -Row1 "Hey, how are you?
 -Rwo2 "Hey, Who is there?

我希望输出是按单词分组的计数。

例子:

Hey 2
How 1
are 1
...

我正在使用拆分功能,但我有点卡住了。对此有什么想法吗?

谢谢!

【问题讨论】:

    标签: hadoop text hive counter hiveql


    【解决方案1】:

    这在 Hive 中是可能的。按非字母分割并使用横向视图+分解,然后计算单词:

    with your_data as(
    select stack(2,
    'Hey, how are you?',
    'Hey, Who is there?'
    ) as initial_string
    )
    
    select w.word, count(*) cnt
    from
    (
    select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
    )s lateral view explode(words) w as word
    where w.word!=''
    group by w.word;
    

    结果:

    word    cnt
    are     1
    hey     2
    how     1
    is      1
    there   1
    who     1
    you     1
    

    另一种方法使用sentences函数,它返回标记化句子数组(单词数组数组):

    with your_data as(
    select stack(2,
    'Hey, how are you?',
    'Hey, Who is there?'
    ) as initial_string
    )
    
    select w.word, count(*) cnt
    from
    (
    select sentences(lower(initial_string)) sentences from your_data
    )d lateral view explode(sentences) s as sentence
       lateral view explode(s.sentence) w as word
    group by w.word;
    

    结果:

    word    cnt
    are     1
    hey     2
    how     1
    is      1
    there   1
    who     1
    you     1
    

    sentences(string str, string lang, string locale) 函数将自然语言文本字符串标记为单词和句子,其中每个句子在适当的句子边界处被分解并作为单词数组返回。 'lang' 和 'locale' 是可选参数。例如,sentence('Hello there! How are you?') 返回[["Hello", "there"], ["How", "are", "you"]]

    【讨论】:

      【解决方案2】:

      Hive 无法单独做到这一点。您可以将 Hive 中的数据读入 Pandas DataFrame 并使用 Python 进行处理。那么您的问题是如何计算 DataFrame 列中的词频。

      Counting the Frequency of words in a pandas data frame

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2011-05-30
        • 2014-01-07
        • 2011-07-21
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多