Skip to content

Parquet Viewer #735

@norberttech

Description

@norberttech

Working with parquet files is not as easy as when we are working with CSV files, mostly because it's hard to browse those files.
The idea is to build a simple CLI app, that could be released as a docker image.

We can start simple, with 2 commands:

  • metadata
  • data
bin/parquet read:metadata
bin/parquet read:data

For metadata, I already created a POC, here is the output:

$ ./parquet read:metadata --help

Usage:
  ./parquet_viewer.php [options] [--] <file>

Arguments:
  file                  path to parquet file

Options:
      --columns         Display column details
      --row-groups      Display row group details
      --column-chunks   Display column chunks details
      --page-headers    Display page headers details
  -h, --help            Display help for the given command. When no command is given display help for the ./parquet_viewer.php command
  -q, --quiet           Do not output any message
  -V, --version         Display this application version
      --ansi|--no-ansi  Force (or disable --no-ansi) ANSI output
  -n, --no-interaction  Do not ask any interactive question
  -v|vv|vvv, --verbose  Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Usage example:

$ ./parquet.php read:metadata path/to/parquet.file --columns --row-groups --column-chunks --page-headers

 ----------------- -------------------- Metadata --------------------------------------- 
  file_path         path/to/parquet.file 
  parquet_version   1                                                                    
  created_by        flow-php parquet version 1.x                                                          
  rows              1                                                                    
 ----------------- --------------------------------------------------------------------- 

 -------------------------------------------- ------- Flat Columns ------ ------------ ---------------- ---------------- 
  path                                         type         logical_type   repetition   max_repetition   max_definition  
 -------------------------------------------- ------------ -------------- ------------ ---------------- ---------------- 
  id                                           INT32        -              OPTIONAL     0                1               
  aMapField.key_value.someKey                  BYTE_ARRAY   -              REQUIRED     1                1               
  aMapField.key_value.aStructField.anInteger   INT32        -              OPTIONAL     1                2               
  aMapField.key_value.aStructField.aString     BYTE_ARRAY   -              OPTIONAL     1                2               
  rootLevelStructField.anotherInteger          INT32        -              OPTIONAL     0                1               
  rootLevelStructField.anotherString           BYTE_ARRAY   -              OPTIONAL     0                1               
  aListField.list.someInteger                  INT32        -              OPTIONAL     1                2               
 -------------------------------------------- ------------ -------------- ------------ ---------------- ---------------- 

 ---------- ----- Row Groups  --------------- 
  num_rows   total_byte_size   columns_count  
 ---------- ----------------- --------------- 
  1          285               7              
 ---------- ----------------- --------------- 

 -------------------------------------------- ------------------------ Column Chunks ------------- ------------ ------------------------ ------------------ 
  path                                         encodings                compression   file_offset   num_values   dictionary_page_offset   data_page_offset  
 -------------------------------------------- ------------------------ ------------- ------------- ------------ ------------------------ ------------------ 
  id                                           [RLE,BIT_PACKED,PLAIN]   GZIP          4             1            -                        4                 
  aMapField.key_value.someKey                  [RLE,BIT_PACKED,PLAIN]   GZIP          56            2            -                        56                
  aMapField.key_value.aStructField.anInteger   [RLE,BIT_PACKED,PLAIN]   GZIP          119           2            -                        119               
  aMapField.key_value.aStructField.aString     [RLE,BIT_PACKED,PLAIN]   GZIP          179           2            -                        179               
  rootLevelStructField.anotherInteger          [RLE,BIT_PACKED,PLAIN]   GZIP          243           1            -                        243               
  rootLevelStructField.anotherString           [RLE,BIT_PACKED,PLAIN]   GZIP          295           1            -                        295               
  aListField.list.someInteger                  [RLE,BIT_PACKED,PLAIN]   GZIP          358           3            -                        358               
 -------------------------------------------- ------------------------ ------------- ------------- ------------ ------------------------ ------------------ 

 -------------------------------------------- -------------- -------- Page Headers ------ ------------------- ----------------------- ----------------- 
  path                                         type           encoding   compressed_size   uncompressed_size   dictionary_num_values   data_num_values  
 -------------------------------------------- -------------- ---------- ----------------- ------------------- ----------------------- ----------------- 
  id                                           DATA_PAGE_V2   PLAIN      26                6                   -                       1                
  aMapField.key_value.someKey                  DATA_PAGE_V2   PLAIN      37                22                  -                       2                
  aMapField.key_value.aStructField.anInteger   DATA_PAGE_V2   PLAIN      34                14                  -                       2                
  aMapField.key_value.aStructField.aString     DATA_PAGE_V2   PLAIN      38                20                  -                       2                
  rootLevelStructField.anotherInteger          DATA_PAGE_V2   PLAIN      26                6                   -                       1                
  rootLevelStructField.anotherString           DATA_PAGE_V2   PLAIN      37                17                  -                       1                
  aListField.list.someInteger                  DATA_PAGE_V2   PLAIN      35                18                  -                       3                
 -------------------------------------------- -------------- ---------- ----------------- ------------------- ----------------------- -----------------

read:data - this is a bit more tricky, but I think we can leverage ETL here and just read the file and display X number of results.

The ideal solution would be also to allow users to scroll through data using Pgdn/Pgup/up/down keys but it can come later, we can start easy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions