【问题标题】:Split list created by re.findall to single words, then count occurrence of each word sorted descending by number of occurrences将 re.findall 创建的列表拆分为单个单词,然后按出现次数降序计算每个单词的出现次数
【发布时间】:2019-02-02 11:45:19
【问题描述】:

我必须从由 re.findall 创建的列表的每个元素中计算每个单词出现的次数。

例如: jobs = ["Java 开发人员","数据科学家","业务架构师流程挖掘","JavaScript 开发人员"]

jobs_split = ["Java","Developer","Data","Scientist","Business","Architect", "Process","Mining","JavaScript","Developer"]

然后计算每个单词的出现次数并显示它 f.e.在文件中作为 Word:出现次数

我知道我可以在 python 中构建“计数器”,但是我不知道如何拆分列表中的所有元素

import urllib.request
import re
from collections import Counter

jobs = []
jobs_split = []

from urllib.request import urlopen, Request
for i in range(10):
    html = Request("https://mysite?pn={}".format(i), headers={'User-Agent':         'Mozilla/5.0'})
page = urlopen(html).read().decode('utf-8')

jobs += re.findall(r'"@type":"JobPosting","title":"([A-Za-z0-9 -/]+)","description"', page)

my_set = set(jobs)
# print(Counter(my_set))
print(my_set)

【问题讨论】:

  • 你能添加预期的输出吗?
  • 开发人员:2,Java:1,数据:1,科学家:1,业务:1,架构师:1,流程:1,挖掘:1,JavaScript:1

标签: python regex parsing


【解决方案1】:

您可以使用itertools.chain 将所有单词连接到一个可迭代对象中:

from collections import Counter
from itertools import chain

jobs = ["Java Developer","Data Scientist","Business Architect Process Mining","JavaScript Developer"]

tokens = chain.from_iterable(job.split() for job in jobs)
counts = Counter(tokens)

print(counts)

输出

Counter({'Developer': 2, 'JavaScript': 1, 'Architect': 1, 'Process': 1, 'Mining': 1, 'Business': 1, 'Scientist': 1, 'Java': 1, 'Data': 1})

【讨论】:

  • 这正是我所需要的!谢谢!
【解决方案2】:

就像使用.split() 并在空间上分割" " 一样简单

但必须遍历您的列表:

jobs = ["Java Developer","Data Scientist","Business Architect Process Mining","JavaScript Developer"]

split = [ job.split() for job in jobs ]
jobs_split = [item for sublist in split for item in sublist]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-04-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多