【问题标题】:make text file by repeating word based on frequency通过基于频率重复单词来制作文本文件
【发布时间】:2016-03-11 09:22:48
【问题描述】:

我知道根据 stackoverflow 问题标准,这个问题可能不合适,但我已经做了几个月的编码练习,以解析和分析我以前从未做过编程的文本,并得到了这个论坛的帮助。

我用频率分析分析了多个xml文件,存储在mysqldb中。 [字数]

我想根据频率重复单词来制作一个文本文件。 (例如早餐,6 => 早餐早餐早餐早餐早餐) 包括重复单词之间的一个空格,并从最低(文本的开头)到最高频率解析单词('a'或'the'将是最频繁的段,并到达文本内容的最后部分) .

请允许我了解一些想法、库、编码示例.. 谢谢。

import math
import random
import requests
import collections
import string
import re
import MySQLdb as mdb
import xml.etree.ElementTree as ET
from xml.dom import minidom
from string import punctuation
from oauthlib import *
from operator import itemgetter
from collections import defaultdict
from functools import reduce
import requests, re
from xml.etree import ElementTree
from collections import Counter
from lxml import html




### MYSQL ###

db = mdb.connect(host="****", user="****", passwd="****", db="****")

cursor = db.cursor()
sql = "DROP TABLE IF EXISTS Table1"
cursor.execute(sql)
db.commit()
sql = "CREATE TABLE Table1(Id INT PRIMARY KEY AUTO_INCREMENT, keyword TEXT, frequency INT)"
cursor.execute(sql)
db.commit()



## XML PARSING
def main(n=1000):

    # A list of feeds to process and their xpath


    feeds = [
        {'url': 'http://www.nyartbeat.com/list/event_type_print_painting.en.xml', 'xpath': './/Description'},
        {'url': 'http://feeds.feedburner.com/FriezeMagazineUniversal?format=xml', 'xpath': './/description'},
        {'url': 'http://www.artandeducation.net/category/announcement/feed/', 'xpath': './/description'},
        {'url': 'http://www.blouinartinfo.com/rss/visual-arts.xml', 'xpath': './/description'},
        {'url': 'http://feeds.feedburner.com/ContemporaryArtDaily?format=xml', 'xpath': './/description'}
    ]



    # A place to hold all feed results
    results = []

    # Loop all the feeds
    for feed in feeds:
        # Append feed results together
        results = results + process(feed['url'], feed['xpath'])

    # Join all results into a big string
    contents=",".join(map(str, results))

    # Remove double+ spaces
    contents = re.sub('\s+', ' ', contents)

    # Remove everything that is not a character or whitespace
    contents = re.sub('[^A-Za-z ]+', '', contents)

    # Create a list of lower case words that are at least 8 characters
    words=[w.lower() for w in contents.split() if len(w) >=1 ]


    # Count the words
    word_count = Counter(words)

    # Clean the content a little
    filter_words = ['art', 'artist', 'artist']
    for word in filter_words:
        if word in word_count:
            del word_count[word]



# Add to DB
    for word, count in word_count.most_common(n):
                sql = """INSERT INTO Table1 (keyword, frequency) VALUES(%s, %s)"""
                cursor.execute(sql, (word, count))
                db.commit()

def process(url, xpath):
    """
    Downloads a feed url and extracts the results with a variable path
    :param url: string
    :param xpath: string
    :return: list
    """
    contents = requests.get(url)
    root = ElementTree.fromstring(contents.content)
    return [element.text.encode('utf8') if element.text is not None else '' for element in root.findall(xpath)]





if __name__ == "__main__":
    main()

【问题讨论】:

  • 我投票结束这个问题,因为这既不是代码编写也不是教程服务
  • 我将添加一个我已经解析和分析过xmls的代码

标签: python parsing text frequency word


【解决方案1】:

假设您在 for 循环中使用的 word_count.most_common(n) 将返回一个元组或一个列表,其中 wordcount 按顺序排列:

让我们将它存储在一个变量中:

words = word_count.most_common(n)
# Ex: [('a',5),('apples',2),('the',4)]

使用 itemgetter,按计数排序:

from operator import itemgetter
words = sorted(words, key = itemgetter(1))
# words = [('apples', 2), ('the', 4), ('a', 5)]

现在遍历每个条目,并将其附加到列表中:

out = []
for word, count in words:
    out += [word]*count
# out = ['apples', 'apples', 'the', 'the', 'the', 'the', 'a', 'a', 'a', 'a', 'a']

下面一行会变成一个长字符串:

final = " ".join(out)
# final = "apples apples the the the the a a a a a"

现在只需将其写入文件:

with open("filename.txt","w+") as f:
    f.write(final)

代码如下所示:

from operator import itemgetter

words = word_count.most_common(n)
words = sorted(words, key = itemgetter(1))

out = []
for word, count in words:
    out += [word]*count

final = " ".join(out)

with open("filename.txt","w+") as f:
    f.write(final)

【讨论】:

  • 非常感谢 Electron!
猜你喜欢
  • 2016-01-23
  • 1970-01-01
  • 1970-01-01
  • 2013-02-02
  • 2020-09-03
  • 2011-05-30
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多