【发布时间】:2016-03-11 09:22:48
【问题描述】:
我知道根据 stackoverflow 问题标准,这个问题可能不合适,但我已经做了几个月的编码练习,以解析和分析我以前从未做过编程的文本,并得到了这个论坛的帮助。
我用频率分析分析了多个xml文件,存储在mysqldb中。 [字数]
我想根据频率重复单词来制作一个文本文件。 (例如早餐,6 => 早餐早餐早餐早餐早餐) 包括重复单词之间的一个空格,并从最低(文本的开头)到最高频率解析单词('a'或'the'将是最频繁的段,并到达文本内容的最后部分) .
请允许我了解一些想法、库、编码示例.. 谢谢。
import math
import random
import requests
import collections
import string
import re
import MySQLdb as mdb
import xml.etree.ElementTree as ET
from xml.dom import minidom
from string import punctuation
from oauthlib import *
from operator import itemgetter
from collections import defaultdict
from functools import reduce
import requests, re
from xml.etree import ElementTree
from collections import Counter
from lxml import html
### MYSQL ###
db = mdb.connect(host="****", user="****", passwd="****", db="****")
cursor = db.cursor()
sql = "DROP TABLE IF EXISTS Table1"
cursor.execute(sql)
db.commit()
sql = "CREATE TABLE Table1(Id INT PRIMARY KEY AUTO_INCREMENT, keyword TEXT, frequency INT)"
cursor.execute(sql)
db.commit()
## XML PARSING
def main(n=1000):
# A list of feeds to process and their xpath
feeds = [
{'url': 'http://www.nyartbeat.com/list/event_type_print_painting.en.xml', 'xpath': './/Description'},
{'url': 'http://feeds.feedburner.com/FriezeMagazineUniversal?format=xml', 'xpath': './/description'},
{'url': 'http://www.artandeducation.net/category/announcement/feed/', 'xpath': './/description'},
{'url': 'http://www.blouinartinfo.com/rss/visual-arts.xml', 'xpath': './/description'},
{'url': 'http://feeds.feedburner.com/ContemporaryArtDaily?format=xml', 'xpath': './/description'}
]
# A place to hold all feed results
results = []
# Loop all the feeds
for feed in feeds:
# Append feed results together
results = results + process(feed['url'], feed['xpath'])
# Join all results into a big string
contents=",".join(map(str, results))
# Remove double+ spaces
contents = re.sub('\s+', ' ', contents)
# Remove everything that is not a character or whitespace
contents = re.sub('[^A-Za-z ]+', '', contents)
# Create a list of lower case words that are at least 8 characters
words=[w.lower() for w in contents.split() if len(w) >=1 ]
# Count the words
word_count = Counter(words)
# Clean the content a little
filter_words = ['art', 'artist', 'artist']
for word in filter_words:
if word in word_count:
del word_count[word]
# Add to DB
for word, count in word_count.most_common(n):
sql = """INSERT INTO Table1 (keyword, frequency) VALUES(%s, %s)"""
cursor.execute(sql, (word, count))
db.commit()
def process(url, xpath):
"""
Downloads a feed url and extracts the results with a variable path
:param url: string
:param xpath: string
:return: list
"""
contents = requests.get(url)
root = ElementTree.fromstring(contents.content)
return [element.text.encode('utf8') if element.text is not None else '' for element in root.findall(xpath)]
if __name__ == "__main__":
main()
【问题讨论】:
-
我投票结束这个问题,因为这既不是代码编写也不是教程服务
-
我将添加一个我已经解析和分析过xmls的代码
标签: python parsing text frequency word