tshark 提取字段及其字符串表示答案

【问题标题】：tshark extract fields with their string representationtshark 提取字段及其字符串表示
【发布时间】：2017-09-06 17:08:27
【问题描述】：

我有一个 tshark 的 pcap 文件，其中包含我想要分析的数据。我想分析它并导出到 CSV 或 xls 文件。在tshark documentation 中，我可以看到我可以将-z 选项与适当的参数一起使用，也可以将-T 与-E 和-e 一起使用。我在 Debian 机器上使用 python 3.6。目前，我的命令如下所示：

command="tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
              "-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
              "Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
              "Subscription-Id,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
              "Multiple-Services-Credit-Control,Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
              "SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
              "Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)

稍后我将使用 pandas 数据框来处理它，如下所示：

# loops adding TCP and/or UDP ports to scan traffic from
    if args.tcp:
        for port in args.tcp:
            command += " -d tcp.port=={},diameter".format(port)

    if args.udp:
        for port in args.udp:
            command += " -d udp.port=={},diameter".format(port)

    # calling subprocess with output redirection to task variable
    task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)

    # a loop adding new data dictionaries to data_list
    for line in task.stdout:
        line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
        # secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
        line = re.split(r"\s|=", line)

        # convert obtained list to ordered dictionary to preserve column order
        # transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
        dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
        data_list.append(dict)

    # remove last 4 dictionaries (last 4 lines of task.stdout)
    data_list = data_list[:-4]
    df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"
    df.to_excel("{}.xls".format(args.output_file), index=False)
    print("Please remember that 'frame' column may not correspond to row index!")

当我打开输出文件时，我可以看到它工作正常，除了在例如CC-Request-Number 我有数值而不是字符串表示，例如在 Wireshark 我有这样的数据：

并且在CC-Request-Number列的输出excel文件中，我可以在与此数据包对应的行中看到3，而不是TERMINATION-REQUEST。

我的问题是：如何在使用 -z 选项时将此数字转换为其字符串表示形式，或者（我可以从我在网上看到的内容猜测）如何使用他们的使用-T 和-e 命令的值？我用tshark -G 列出了所有可用的字段，但是它们太多了，我想不出任何合理的方法来找到我想要的。

【问题讨论】：

标签： python wireshark tshark

【解决方案1】：

奇怪的是，对于-T fields 和-e，tshark 总是打印数字表示，但对于“自定义字段”输出格式，它会打印文本表示。好消息是自定义字段模式实际上比 -T fields 模式快 3 倍。坏消息我知道无法控制自定义字段之间的单独字符，因此如果您的字段内容可能包含空格，这似乎相当不可用。

不要使用-z，试试这个：

-o column.format:'"time", "%t", "type", "%Cus:diameter.CC-Request-Number"'

【讨论】：

我应该将-o 应用到我想要字符串表示的每一列吗？除了-z 用于统计，我可以看到-o 只是覆盖了默认值，那么为什么我实际上应该使用-o 而不是-z？
@Colonder：试试看什么对你有用。您的情况有点特殊，因为-z diameter 可能正在做一些不容易复制的事情。您可以尝试混合我的答案，同时仍然使用-z diameter——我不知道它会做什么。实验。
我会的。也许你知道任何类型的字典都有我可以在其中搜索正确数字键的键值对？
@Colonder：当然，就在这里：github.com/wireshark/wireshark/blob/… - 也许解决方案就是在 Pandas 中加载这个文件并在那里映射值。
嗯，这可能是一个解决方案，我要试试这个，我其实很早就想到了。谢谢你巩固了我的直觉

【解决方案2】：

感谢 John Zwick 的建议、this answer 和The ElementTree XML API 上的 Python 文档，我实现了以下代码（我从官方 Wireshark Github 存储库下载了dictionary.xml 和 chargecontrol.xml）：

chargecontrol_tree = ET.parse("chargecontrol.xml")
dictionary_tree = ET.parse("dictionary.xml")
chargecontrol_root = chargecontrol_tree.getroot()
dictionary_root = dictionary_tree.getroot()

# list that will contain data dictionaries
data_list = []

# base command
command = "tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
          "-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
          "Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
          "Subscription-Id-Data,Subscription-Id-Type,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
          "Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
          "SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
          "Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)

# loops adding tcp and/or udp ports to scan traffic from
if args.tcp:
    for port in args.tcp:
        command += " -d tcp.port=={},diameter".format(port)

if args.udp:
    for port in args.udp:
        command += " -d udp.port=={},diameter".format(port)

# calling subprocess with output redirection to task variable
task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)

# a loop adding new data dictionaries to data_list
for line in task.stdout:
    line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
    # secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
    line = re.split(r"\s|=", line)

    # convert obtained list to ordered dictionary to preserve column order
    # transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
    dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
    data_list.append(dict)

# remove last 4 dictionaries (last 4 lines of task.stdout)
data_list = data_list[:-4]
df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"

# values taken from official wireshark repository
# https://github.com/boundary/wireshark/blob/master/diameter/dictionary.xml
# https://github.com/wireshark/wireshark/blob/2832f4e97d77324b4e46aac40dae0ce898ae559d/diameter/chargecontrol.xml
df["Auth-Application-Id"] = df["Auth-Application-Id"].map({node.attrib["code"]:node.attrib["name"] for node in
      dictionary_root.findall(".//*[@name='Auth-Application-Id']/enum")})

# list of columns that values of have to be substituted
for col in ["CC-Request-Type", "CC-Session-Failover", "Credit-Control-Failure-Handling", "Subscription-Id-Type"]:
    df[col] = df[col].map({node.attrib["code"]: node.attrib["name"] for node in
          chargecontrol_root.findall((".//*[@name='{}']/enum").format(col))})


df.to_excel("{}.xls".format(args.output_file), index=False)
print("Please remember that 'frame' column may not correspond to row index!")

【讨论】：