【发布时间】:2017-08-05 14:58:52
【问题描述】:
目标是获取字符串中出现的二元组计数
换句话说,如何获取较大字符串中子字符串的计数?
# Sample data with text
hi = {1: "My name is Lance John",
2: "Am working at Savings Limited in Germany",
3: "Have invested in mutual funds",
4: "Savings Limited accepts mutual funds as investment option",
5: "Savings Limited also accepts other investment option"}
hi = pd.DataFrame(hi.items(), columns = ['id', 'notes'])
# have two categories with pre-defined words
name = ['Lance John', 'Germany']
finance = ['Savings Limited', 'investment option', 'mutual funds']
# want count of bigrams in each category for each record
# the output should look like this
ID name finance
1 1 0
2 1 2
3 0 1
4 0 3
5 0 2
【问题讨论】:
-
我知道 string.count(substring),但不知道为每行搜索多个单词的最佳方法?
-
Regex 在这种情况下是最佳的。
标签: python string text-mining n-gram