【发布时间】:2020-03-06 16:19:09
【问题描述】:
我需要对 n=630 个职位描述的 job_experience 部分中的各种句子进行分类/分类。我对提取工作经验和能力相关的句子特别感兴趣,但我需要能够将它们附加到与之关联的job_title。
这些职位描述的当前状态:许多不同的表达方式(例如,“需要 Microsoft Office 技能。”“使用 Microsoft Word、PowerPoint 的经验。”“至少 3 年相关工作经验领域。”“至少三年的类似职位经验。”)。
未来,我们将需要压缩这些职位描述语句,以便例如将同一语句应用于多个职位,并且经理从下拉列表中选择工作经验陈述。
所以我想对这些单独的句子进行分类,以便我们可以开始浓缩它们并决定未来将使用哪些语句。
我一直在研究我应该做什么,如果我能提供任何关于哪种方法最有效的建议,我将不胜感激。我熟悉 R,但主要将它用于数据整理和可视化。 LDA、kmeans 文本聚类、特征识别……这些是我在我的研究 (scikit-learn.org) 中发现的东西,主要是在 Python 中的应用。
- Python 最适合这种事情吗?我可以使用 R 吗?
- 哪种算法方法最适合初学者?
- 我知道这不是魔法 - 只是寻找完成这项任务的最佳方法。
我的数据如下:
df <- data.frame(job_title = c("Recruiter","Recruiter","Recruiter","Recruiter",
"File Clerk","File Clerk",
"Learning & Org. Development Specialist","Learning & Org. Development Specialist","Learning & Org. Development Specialist","Learning & Org. Development Specialist",
"CNA","CNA","CNA"),
job_experience = c("Minimum 1 year experience in recruitment or related human resources function.",
"Proficient in Microsoft Office Applications.",
"High school diploma required.",
"Bachelors Degree in Human Resources or related field preferred.",
"High School diploma preferred.",
"Ability to use relevant computer systems.",
"Bachelors Degree in related field (e.g., Human Resources, Education, Organizational Development).",
"Minimum 2 years experience applying L&OD principles and practices in an organizational setting.",
"Previous work experience in Human Resources preferred.",
"Experience with a learning management system (LMS).",
"High school diploma or GED equivalent.",
"Certified Nursing Assistant, certified by the Virginia Board of Health Professions.",
"CPR certification required at date of hire."))
我的目标是拥有这样的数据集(新列 = job_exp_category):
job_title job_experience job_exp_category
"Recruiter" "Minimum 1 year experience in recruitment..." "Work experience"
"Recruiter" "Proficient in Microsoft Office Applicati..." "Skill/Ability"
"Recruiter" "High school diploma required." "Degree"
... ... ...
"CNA" "Certified Nursing Assistant, certificati..." "Certification/License"
"CNA" "CPR certification required at date of hire." "Certification/License"
感谢您对 SO 社区的任何见解。
【问题讨论】:
标签: python r nlp text-classification topic-modeling