Skip to content

Decompressing a specific amount of zlib data "eats" following data #20

@davean

Description

@davean

Some data formats (git pack files) store not the amount of compressed data, but the size that data is uncompressed. One would suppose that using the streaming-commons toolkit would easily handle this case, it does not seem to.

While it does decompress the correct amount, it does not return the unused input to be read again as demonstrated by the following example program:

{-# LANGUAGE OverloadedStrings #-}
module Main where

import Data.Bits
import Control.Monad.Trans
import qualified Codec.Compression.Zlib as Z
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Builder as BSB
import qualified Data.Text as T
import qualified Data.Text.Encoding as TE
import Data.Conduit (($$), (=$))
import qualified Data.Conduit as C
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.Zlib as CZ

main :: IO ()
main = do
  let c = TE.encodeUtf8 "This data is stored compressed."
  let u = TE.encodeUtf8 "This data isn't."
  let encoded = writeExample c u
  (c', u') <- CB.sourceLbs encoded $$ readExample
  print (c, u)
  print (c', u')
  putStrLn $ "Input and output matched: " ++ show (c==c' && u==u')

readExample :: C.Sink BS.ByteString IO (BS.ByteString, BS.ByteString)
readExample = do
  sbs <- CB.take 4
  let size = case map fromIntegral . BSL.unpack $ sbs of
        [s0, s1, s2, s3] -> (s3 `shiftL` 24) .|. (s2 `shiftL` 16) .|. (s1 `shiftL` 8) .|. s0
        _ -> error "We really had to read 4 octets there."
  -- We know how large it should decompress to, but not how large it is compressed.                                                                    
  -- We proced to decompress untill we have decompressed enough data.                                                                                  
  c <- (CZ.decompress CZ.defaultWindowBits) =$ (CB.take size)
  -- Immediately following the compressed stream is more data we need.                                                                                 
  u <- CB.sinkLbs
  return (BSL.toStrict c, BSL.toStrict u)

writeExample :: BS.ByteString -> BS.ByteString -> BSL.ByteString
writeExample cdata udata =
  let c = Z.compress . BSL.fromStrict $ cdata
  in BSB.toLazyByteString . mconcat $
   [ BSB.int32LE . fromIntegral . BS.length $ cdata -- We record the size of the uncompressed data                                                     
   , BSB.lazyByteString c -- but store it compressed.                                                                                                  
   , BSB.byteString udata -- Then we store other important data with no delimination.                                                                  
   ]

example output:

("This data is stored compressed.","This data isn't.")
("This data is stored compressed.","")
Input and output matched: False

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions