从 python 3 中的用户输入计算二元组？答案

【问题标题】：Counting bigrams from user input in python 3?从 python 3 中的用户输入计算二元组？
【发布时间】：2017-08-11 14:52:51
【问题描述】：

我遇到了困难，需要一些指导。我正在努力使用 Grok Learning 自己学习 Python。下面是问题和示例输出以及我在代码中的位置。我感谢任何可以帮助我解决此问题的提示。

在语言学中，二元组是句子中的一对相邻单词。句子“The big red ball.”有三个二元组：The big, big 红色和红色的球。

编写一个程序来读取用户的多行输入，其中每一行是一个以空格分隔的单词句子。你的程序然后应该计算每个二元组出现的次数所有输入的句子。应该在一个案例中处理二元组通过将输入行转换为小写的不敏感方式。一次用户停止输入，你的程序应该打印出每个出现不止一次的二元组，以及它们对应的频率。例如：
Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line: 
near the: 2
red ball: 2
the big: 3
big red: 3

我的代码还没有走得太远，而且真的被卡住了。但这里是我的位置：

words = set()
line = input("Line: ")
while line != '':
  words.add(line)
  line = input("Line: ")

我这样做对吗？尽量不要导入任何模块，只使用内置功能。

谢谢，杰夫

【问题讨论】：

嗨@Jeff，在处理此类问题时，请考虑它们而不考虑实际代码。试着用英语描述它们。步骤 1 读取输入行。步骤 2 将线条分成二元组，步骤 3 计算二元组。直到你有你需要做什么的轮廓。很难编码。您的第一组代码几乎完成了第 1 步读取输入。请参阅inspectorG4dget 答案。第 1 步。
好吧，我比过去几天更进一步了。进步就是进步:) 谢谢！！

标签： python python-3.x

【解决方案1】：

让我们从接收句子（带标点符号）并返回找到的所有小写二元组列表的函数开始。

所以，我们首先需要从句子中去除所有非字母数字，将所有字母转换为小写对应，然后将句子用空格分割成单词列表：

import re

def bigrams(sentence):
    text = re.sub('\W', ' ', sentence.lower())
    words = text.split()
    return zip(words, words[1:])

我们将使用标准（内置）re 包进行基于正则表达式的非字母数字与空格的替换，并使用内置 zip 函数来配对连续的单词。（我们将单词列表与相同的列表配对，但移动了一个元素。）

现在我们可以测试它了：

>>> bigrams("The big red ball")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams("THE big, red, ball.")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams(" THE  big,red,ball!!?")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]

接下来，要计算每个句子中的二元组，您可以使用collections.Counter。

例如，像这样：

from collections import Counter

counts = Counter()
for line in ["The big red ball", "The big red ball is near the big red box", "I am near the box"]:
    counts.update(bigrams(line))

我们得到：

>>> Counter({('the', 'big'): 3, ('big', 'red'): 3, ('red', 'ball'): 2, ('near', 'the'): 2, ('red', 'box'): 1, ('i', 'am'): 1, ('the', 'box'): 1, ('ball', 'is'): 1, ('am', 'near'): 1, ('is', 'near'): 1})

现在我们只需要打印出现多次的那些：

for bigr, cnt in counts.items():
    if cnt > 1:
        print("{0[0]} {0[1]}: {1}".format(bigr, cnt))

全部放在一起，有一个循环供用户输入，而不是固定列表：

import re
from collections import Counter

def bigrams(sentence):
    text = re.sub('\W', ' ', sentence.lower())
    words = text.split()
    return zip(words, words[1:])

counts = Counter()
while True:
    line = input("Line: ")
    if not line:
        break
    counts.update(bigrams(line))

for bigr, cnt in counts.items():
    if cnt > 1:
        print("{0[0]} {0[1]}: {1}".format(bigr, cnt))

输出：

Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line: 
near the: 2
red ball: 2
big red: 3
the big: 3

【讨论】：

会再次 +1 以最大限度地使用随附的电池。
唯一的问题是 Grok Learning 不喜欢像 re 那样导入模块。他们希望我通过使用内置功能来学习。不过，谢谢你，我会尽我所能。
@JeffSingleton, re 仅用于去除标点符号，因此如果不需要，您可以跳过该部分（尽管我想在回答中表明预处理也很重要）。 Counter 也可以简单地（重新）实现——但 Python 的美妙之处实际上在于让所有这些漂亮的部分都准备好供您使用。这就是为什么 Python 的座右铭是“包含电池”。

【解决方案2】：

words = []
while True:
    line = input("Line: ").strip().lower()
    if not line: break
    words.extend(line.split())


counts = {}
for t in zip(words[::2], words[1::2]):
    if t not in counts: counts[t] = 0
    counts[t] += 1

【讨论】：

谢谢@inspectorG4dget。我要求指导，这就是你给的。我仍在努力寻找解决方案，但这帮助我通过了我所在的位置。

【解决方案3】：

usr_input = "Here is a sentence without multiple bigrams. Without multiple bigrams, we cannot test a sentence."

def get_bigrams(word_string):
    words = [word.lower().strip(',.') for word in word_string.split(" ")]
    pairs = ["{} {}".format(w, words[i+1]) for i, w in enumerate(words) if i < len(words) - 1]
    bigrams = {}

    for bg in pairs:
        if bg not in bigrams:
            bigrams[bg] = 0
        bigrams[bg] += 1
    return bigrams

print(get_bigrams(usr_input))

【讨论】：

【解决方案4】：

仅使用从 OP 提到的 Grok Learning Python Course 先前模块中学到的知识，此代码可以很好地执行所需的操作：

counts = {} # this creates a dictionary for the bigrams and the tally for each one
n = 2
a = input('Line: ').lower().split() # the input is converted into lowercase, then split into a list
while a:
  for x in range(n, len(a)+1):
    b = tuple(a[x-2:x]) # the input gets sliced into pairs of two words (bigrams)
    counts[b] = counts.get(b,0) + 1 # adding the bigrams as keys to the dictionary, with their count value set to 1 initially, then increased by 1 thereafter
  a = input('Line: ').lower().split()  
for c in counts:
  if counts[c] > 1: # tests if the bigram occurs more than once
    print(' '.join(c) + ':', counts[c]) # prints the bigram (making sure to convert the key from a tuple into a string), with the count next to it

注意：您可能需要向右滚动才能完整查看对代码所做的注释。

它非常简单，不需要导入任何东西等。我意识到我的对话已经很晚了，但希望其他任何从事相同课程/遇到类似问题的人会发现这个答案很有帮助。

【讨论】：