Python XML解析并计算字符串的出现然后输出到Excel答案

【问题标题】：Python XML parse and count occurence of a string then output to ExcelPython XML解析并计算字符串的出现然后输出到Excel
【发布时间】：2015-06-09 12:57:36
【问题描述】：

所以这是我的难题！

我有 100 多个 XML 文件需要解析并通过标签名称（或正则表达式）查找字符串。

一旦我找到该字符串/标签值，我需要计算它出现的次数（或找到该字符串的最大值。）

例子：

<content styleCode="Bold">Value 1</content>
<content styleCode="Bold">Value 2</content>
<content styleCode="Bold">Value 3</content>

<content styleCode="Bold">Another Value 1</content>
<content styleCode="Bold">Another Value 2</content>
<content styleCode="Bold">Another Value 3</content>
<content styleCode="Bold">Another Value 4</content>

所以基本上我想解析 XML，找到上面列出的标签并输出到 Excel 电子表格中找到的最高值。电子表格已经有标题，所以只有数值输出到 Excel 文件。

所以输出将在 Excel 中：

Value    Another Value
3               4

每个文件都会输出到另一行。

【问题讨论】：

标签： python regex xml excel

【解决方案1】：

我不确定您的 XML 文件是如何命名的。对于简单的情况，假设它们以这种模式命名：

file1.xml, file2.xml, ... 它们与您的 python 脚本存储在同一个文件夹中。

那么你可以使用下面的代码来完成这项工作：

import xml.etree.cElementTree as ElementTree
import re
from xlrd import open_workbook
from xlwt import Workbook
from xlutils.copy import copy

def process():
    for i in xrange(1, 100): #loop from file1.xml to file99.xml
        resultDict = {}
        xml = ElementTree.parse('file%d.xml' %i)
        root = xml.getroot()
        for child in root:
            value = re.search(r'\d+', child.text).group()
            key = child.text[:-(1+len(value))]
            try:
                if value > resultDict[key]:
                    resultDict[key] = value
            except KeyError:
                resultDict[key] = value

        rb = open_workbook("names.xls")
        wb = copy(rb)
        s = wb.get_sheet(0)
        for index, value in enumerate(resultDict.values()):
            s.write(i, index, value)
        wb.save('names.xls')

if __name__ == '__main__':
    process()

【讨论】：

您可能需要在here安装相关包

【解决方案2】：

所以问题有两个主要部分。 (1) 从每个文件中找出最大值对，(2) 将它们写在 Excel 工作簿中。我一直提倡的一件事是编写可重用的代码。在这里，您必须将所有 xml 文件放在一个文件夹中，然后简单地执行 main 方法并获取结果。

现在有几个选项可以写入 excel。最简单的方法是创建一个制表符或逗号分隔文件 (CSV) 并手动将其导入到 excel 中。 XMWT 是一个标准库。 OpenPyxl 是另一个库，它使创建 excel 文件的代码行数变得更加简单和小。

确保在文件开头导入所需的库和模块。

import re
import os
import openpyxl

在读取 XML 文件时，我们使用正则表达式来提取您想要的值。

regexPatternValue = ">Value\s+(\d+)</content>"
regexPatternAnotherValue = ">Another Value\s+(\d+)</content>"

为了进一步模块化，创建一个方法来解析给定 XML 文件中的每一行，查找正则表达式模式，提取所有值并返回其中的最大值。在下面的方法中，我返回一个包含两个元素 (Value, Another) 的元组，它们是在该文件中看到的每种类型的最大数量。

def get_values(filepath):
    values = []
    another = []
    for line in open(filepath).readlines():
        matchValue = re.search(regexPatternValue, line)
        matchAnother = re.search(regexPatternAnotherValue, line)
        if matchValue:
            values.append(int(matchValue.group(1)))
        if matchAnother:
            another.append(int(matchAnother.group(1)))
    # Now we want to calculate highest number in both the lists.
    try:
        maxVal = max(values)
    except:
        maxVal = '' # This case will handle if there are NO values at all
    try:
        maxAnother = max(another)
    except:
        maxAnother = ''
    return maxVal, maxAnother

现在将您的 XML 文件保存在一个文件夹中，对它们进行迭代，然后提取每个中的正则表达式模式。在以下代码中，我将这些提取的值附加到名为 writable_lines 的列表中。最后在解析完所有文件后，创建一个工作簿并以格式添加提取的值。

def process_folder(folder, output_xls_path):
    files = [folder+'/'+f for f in os.listdir(folder) if ".txt" in f]
    writable_lines = []
    writable_lines.append(("Value","Another Value")) # Header in the excel

    for file in files:
        values = get_values(file)
        writable_lines.append((str(values[0]),str(values[1])))

    wb = openpyxl.Workbook()
    sheet = wb.active

    for i in range(len(writable_lines)):
        sheet['A' + str(i+1)].value = writable_lines[i][0]
        sheet['B' + str(i+1)].value = writable_lines[i][1]

    wb.save(output_xls_path)

在较低的 for 循环中，我们指示 openpyxl 将值写入指定的单元格中，例如典型的 excel 格式 sheet["A3"]、sheet["B3"] 等。

准备好了...

if __name__ == '__main__':
    process_folder("xmls", "try.xls")

【讨论】：

我收到此错误Traceback (most recent call last): File "C:/Python34/Scripts/solution.py", line 41, in <module> process_folder("xmls", "try.xls") File "C:/Python34/Scripts/solution.py", line 23, in process_folder files = [folder+'/'+f for f in os.listdir(folder) if ".txt" in f] FileNotFoundError: [WinError 3] The system cannot find the path specified: 'xmls'
啊，我使用的是 .txt 文件而不是 .xml。先更新那个。其次，您的文件夹不可访问。重新检查 process_folder() 的第一个参数或尝试给出完整（绝对）文件夹路径。
在我的代码中，process_folder("xmls", "try.xls") 在您当前文件夹中名为“xmls”的文件夹中搜索 XML 文件。所以应该有一个像 C:/Python34/Scripts/xmls 这样的文件夹来存储你的所有文件，或者在代码中更改这个文件夹名称。
好的，这有帮助。现在我收到File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7118: character maps to <undefined>. 我相信解码错误与 UTF-8 格式的文件有关，我需要在打开文件时添加它。我不确定如何应用它。
我正在使用 get_values() 方法读取文件。你必须在那里修复它。我讨厌 python 的这个问题——它已经占用了我很多时间。但是，对于我之前的问题，这里有关于这个问题的精彩见解：stackoverflow.com/questions/27522015/…