【发布时间】:2021-11-28 19:23:02
【问题描述】:
我正在尝试通过对 Stack Overflow 开发人员调查 2021 数据进行数据分析来学习 pandas。可以从https://insights.stackoverflow.com/survey 访问数据。
我的目标是检索每种开发人员类型使用的 5 种最常用的语言。
总而言之,示例的 DevType 如下所示:
C++;HTML/CSS;JavaScript;Objective-C;PHP;Swift
LanguageHaveWorkedWith 看起来像这样:
Developer, desktop or enterprise applications;Developer, full-stack;Developer, back-end
所以每种语言和类型都用分号(;)分隔。
我通过使用循环实现了这一点。这是我想要的示例输出:
...
Developer, back-end
JavaScript: 72.66%
SQL: 58.62%
HTML/CSS: 56.65%
C#: 46.8%
Python: 40.15%
Developer, front-end
JavaScript: 92.19%
HTML/CSS: 78.91%
SQL: 54.3%
TypeScript: 51.17%
Node.js: 50.39%
...
代码:
from collections import Counter
dev_type_info = {}
for index, row in merged_df.iterrows():
dev_types = row['dev_type'].split(';')
for dev_type in dev_types:
dev_type_info.setdefault(dev_type, {
'total': 0,
'language_counter': Counter()
})
languages = row['languages'].split(';')
dev_type_info[dev_type]['language_counter'].update(languages)
dev_type_info[dev_type]['total'] += 1
for dev_type, info in dev_type_info.items():
print(dev_type)
for language, value in info['language_counter'].most_common(5):
language_pct = (value / info['total']) * 100
language_pct = round(language_pct, 2)
print(f'\t{language}: {language_pct}%')
有没有办法使用 pandas 或更短的实现而不是像上面那样的循环来实现这一点?
【问题讨论】:
-
类似
df['dev_type'].str.split(';').explode().groupby(level=0).value_counts(),或df['dev_type'].str.get_dummies(';')。
标签: python pandas data-analysis