为什么 Haskell/unpack 会弄乱我的字节？答案

【问题标题】：Why is Haskell/unpack messing with my bytes?为什么 Haskell/unpack 会弄乱我的字节？
【发布时间】：2012-06-18 12:12:30
【问题描述】：

我已经构建了一个微型 UDP/protobuf 发送器和接收器。我花了一上午的时间试图找出 protobuf 解码产生错误的原因，却发现是发送器 (Spoke.hs) 发送了不正确的数据。

代码使用unpack 将 Lazy.ByteStrings 转换为网络包将发送的字符串。我在 Hoogle 中找到了 unpack。它可能不是我要找的函数，但它的描述看起来很合适：“O(n) 将 ByteString 转换为字符串。”

Spoke.hs 产生以下输出：

chris@gigabyte:~/Dropbox/haskell-workspace/hub/dist/build/spoke$ ./spoke
45
45
["a","8","4a","6f","68","6e","20","44","6f","65","10","d2","9","1a","10","6a","64","6f","65","40","65","78","61","6d","70","6c","65","2e","63","6f","6d","22","c","a","8","35","35","35","2d","34","33","32","31","10","1"]

虽然wireshark向我显示数据包中的数据是：

0a:08:4a:6f:68:6e:20:44:6f:65:10:c3:92:09:1a:10:6a:64:6f:65:40:65:78:61:6d:70:6c:65:2e:63:6f:6d:22:0c:0a:08:35:35:35:2d:34:33:32:31:10

长度 (45) 与 Spoke.hs 和 Wireshark 相同。

Wireshark 缺少最后一个字节（值 Ox01），并且中心值流不同（并且在 Wireshark 中大一个字节）。

Spoke.hs 中的"65","10","d2","9" 与 Wireshark 中的65:10:c3:92:09。

由于 0x10 是 DLE，我觉得可能发生了一些转义，但我不知道为什么。

我对 Wireshark 有多年的信任，并且只有几十小时的 Haskell 经验，所以我认为是代码出了问题。

任何建议表示赞赏。

-- Spoke.hs:

module Main where

import Data.Bits
import Network.Socket -- hiding (send, sendTo, recv, recvFrom)
-- import Network.Socket.ByteString
import Network.BSD
import Data.List
import qualified Data.ByteString.Lazy.Char8 as B
import Text.ProtocolBuffers.Header (defaultValue, uFromString)
import Text.ProtocolBuffers.WireMessage (messageGet, messagePut)
import Data.Char (ord, intToDigit)
import Numeric

import Data.Sequence ((><), fromList)

import AddressBookProtos.AddressBook
import AddressBookProtos.Person
import AddressBookProtos.Person.PhoneNumber
import AddressBookProtos.Person.PhoneType

data UDPHandle = 
     UDPHandle {udpSocket  :: Socket,
                udpAddress :: SockAddr}
opensocket :: HostName             -- ^ Remote hostname, or localhost
           -> String               -- ^ Port number or name
           -> IO UDPHandle         -- ^ Handle to use for logging
opensocket hostname port =
    do -- Look up the hostname and port.  Either raises an exception
       -- or returns a nonempty list.  First element in that list
       -- is supposed to be the best option.
       addrinfos <- getAddrInfo Nothing (Just hostname) (Just port)
       let serveraddr = head addrinfos

       -- Establish a socket for communication
       sock <- socket (addrFamily serveraddr) Datagram defaultProtocol

       -- Save off the socket, and server address in a handle
       return $ UDPHandle sock (addrAddress serveraddr)

john = Person {
  AddressBookProtos.Person.id = 1234,
  name = uFromString "John Doe",
  email = Just $ uFromString "jdoe@example.com",
  phone = fromList [
    PhoneNumber {
      number = uFromString "555-4321",
      type' = Just HOME
    }
  ]
}

johnStr = B.unpack (messagePut john)

charToHex x = showIntAtBase 16 intToDigit (ord x) ""

main::IO()
main = 
    do udpHandle <- opensocket "localhost" "4567"
       sent <- sendTo (udpSocket udpHandle) johnStr (udpAddress udpHandle)
       putStrLn $ show $ length johnStr
       putStrLn $ show sent
       putStrLn $ show $ map charToHex johnStr
       return ()

【问题讨论】：

我看到的字节串包的文档将unpack 列为将ByteString 转换为[Word8]，这与String 不同。我希望ByteString 和String 之间存在一些字节差异，因为String 是Unicode 数据，而ByteString 只是一个有效的字节数组，但unpack 不应该能够在第一名。
能不能使用network-bytestring，避免多余的数据转换？
@MatthewWalton: unpack from Data.ByteString.Char8，或者惰性变体，输出Strings。但它们不支持 Unicode。
Network 包似乎正在获取您给它的字符并将它们 utf-8 编码为字节流（然后截断）。
谢谢大家。我会看看使用 Network.ByteString.Lazy？

标签： haskell protocol-buffers unpack

【解决方案1】：

我想你会想要utf8-string 中的toString 和fromString，而不是unpack 和pack。 This blog post 对我很有帮助。

【讨论】：

【解决方案2】：

我看到的 bytestring 包的文档将 unpack 列为将 ByteString 转换为 [Word8]，这与 String 不同。我希望ByteString 和String 之间存在一些字节差异，因为String 是Unicode 数据，而ByteString 只是一个有效的字节数组，但unpack 不应该能够在第一名。

因此，您可能在这里遇到了 Unicode 转换问题，或者至少当底层数据确实不是并且很少有好的结果时，某些东西将其解释为 Unicode。

【讨论】：

不，它还说unpack :: ByteString -> [Char]（我认为String是[Char]的别名）。 hackage.haskell.org/packages/archive/bytestring/latest/doc/html/…
那是Data.ByteString.Char8 - 我在看Data.ByteString.Lazy。尽管如此，正如 John L 在 cmets 中指出的仍然不支持 Unicode 的问题。
绝对是 Unicode 转换：例如code point D8 is C3 98 in UTF-8。这就是为什么任何低于 0x7F 的值都能毫发无损地通过。