每个平台的 Python 计数标签答案

【问题标题】：Python count hashtag per platform每个平台的 Python 计数标签
【发布时间】：2022-01-09 15:38:01
【问题描述】：

我的数据被组织在一个具有以下结构的数据框中

| ID       | Post                | Platform    | 

| -------- | ------------------- | ----------- |

| 1        | Something #hashtag1 | Twitter     |

| 2        | Something #hashtag2 | Insta       |

| 3        | Something #hashtag1 | Twitter     |

我已经能够使用以下方法（使用this post）提取和计算主题标签：

df.Post.str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')

我现在正在尝试计算每个平台的主题标签操作发生次数。我正在尝试以下方法：

df.groupby(['Post', 'Platform'])['Post'].str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')

但是，我收到以下错误：

AttributeError: 'SeriesGroupBy' object has no attribute 'str'

【问题讨论】：

标签： python python-3.x pandas text pandas-groupby

【解决方案1】：

我们可以使用 2 个步骤轻松解决这个问题。假设每个帖子只有一个标签

    Step 1: Create a new column with Hashtag
    df['hashtag']= df.Post.str.extractall(r'(\#\w+)')[0].reset_index()[0]

    Step 2: Group by and get the counts
    df.groupby([ 'Platform']).hashtag.count()

通用解决方案适用于任意数量的标签 我们可以使用 2 个步骤轻松解决此问题。

    # extract all hashtag
    df1  = df.Post.str.extractall(r'(\#\w+)')[0].reset_index()
    # Ste index as index of original tagle where hash tag came from
    df1.set_index('level_0',inplace = True)


    df1.rename(columns={0:'hashtag'},inplace = True)

    df2 = pd.merge(df,df1,right_index = True, left_index = True)

   df2.groupby([ 'Platform']).hashtag.count()

【讨论】：

这是一个很好的起点。 df2 中的行数会增加，因为它会为包含 1 个以上主题标签的帖子创建多行。创建一个宽数据框而不是一个长数据框会很棒。
同意，但你不认为如果我们进行广泛的改造会增加复杂性成倍