【问题标题】:How to calculate the number of characters in sentence of a text file?如何计算文本文件句子中的字符数?
【发布时间】:2019-07-09 23:25:58
【问题描述】:

我想将一段文本拆分成句子,然后打印每个句子的字符数,但是程序并没有计算每个句子的字符数。

我尝试将用户输入的文件标记为句子并循环遍历句子计数并打印每个句子中的字符数。我试过的代码是:

from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize,wordpunct_tokenize
import re
import os
import sys
from pathlib import Path

while True:
    try:
        file_to_open =Path(input("\nYOU SELECTED OPTION 8: 
            CALCULATE SENTENCE LENGTH. Please, insert your file 
path: "))
        with open(file_to_open,'r', encoding="utf-8") as f:
            words = sent_tokenize(f.read())
            break
    except FileNotFoundError:
        print("\nFile not found. Better try again")
    except IsADirectoryError:
        print("\nIncorrect Directory path.Try again")


print('\n\n This file contains',len(words),'sentences in total')



wordcounts = []
caracter_count=0
sent_number=1
with open(file_to_open) as f:
    text = f.read()
    sentences = sent_tokenize(text)
    for sentence in sentences:
        if sentence.isspace() !=True:
            caracter_count = caracter_count + 1
            print("Sentence", sent_number,'contains',caracter_count, 
'characters')
            sent_number +=1
            caracter_count = caracter_count + 1

我想打印类似的东西:

“句子 1 有 35 个字符” “第 2 句有 45 个字符”

等等……

我通过这个程序得到的输出是: 该文件共包含 4 个句子 “句子 1 包含 0 个字符” “句子 2 包含 1 个字符” “第 3 句包含 2 个字符” "第 4 句包含 3 个字符"

有人可以帮我做吗?

【问题讨论】:

    标签: python character nltk


    【解决方案1】:

    您没有使用 caracter_count 计算句子中的字符数。我认为将您的 for 循环更改为:

    sentence_number = 1
    for sentence in sentences:
        if not sentence.isspace():
            print("Sentence {} contains {} characters".format(sentence_number, len(sentence))
            sentence_number += 1
    

    会正常工作

    【讨论】:

      【解决方案2】:

      你的问题似乎很有趣,这个问题有一个简单的解决方案。请记住,第一次运行时使用此命令“nltk.download('punkt')”,第一次运行后只需将其注释掉即可。

      import nltk
      #nltk.download('punkt')
      from nltk.tokenize import sent_tokenize
      
      def count_lines(file):
          count=0
          myfile=open(file,"r")
          string = ""
      
          for line in myfile:
              string+=line  
              print(string)
      
          number_of_sentences = sent_tokenize(string)
      
          for w in number_of_sentences:
              count+=1
              print("Sentence ",count,"has ",len(w),"words")
      
      count_lines("D:\Atharva\demo.txt")
      

      输出:

      What is Python language?Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than possible in languages such as C++ or Java. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles. It features a dynamic type system and automatic memory management and has a large and comprehensive standard library. The best way we learn anything is by practice and exercise questions. We  have started this section for those (beginner to intermediate) who are familiar with Python.  
      Sentence  1 has  119 words
      Sentence  2 has  175 words
      Sentence  3 has  134 words
      Sentence  4 has  117 words
      Sentence  5 has  69 words
      Sentence  6 has  95 words
      

      【讨论】:

      • 感谢您的解决方案,但我需要句子中的字符数,而不是单词数。
      • 实际上它的计数。对于字符,函数 len() 给出任何值的长度。当程序将 para 分解成句子时,它会计算 no。使用 len() 函数的单词。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-09-04
      • 2013-10-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多