Format, CSV
Zip archive
Both project and study downloads (under 🧬) are .zip archives containing three files named after the project or study: csv, parquet, and yaml. The CSV file (download csv for SRP265240) contains TPM data for gene expression, one gene per row. The columns are
- Gene ID (transcriptome-specific, like CA01g00010 in example)
- Transcriptome (multiple are possible if so was selected, like CM334_v1)
- Multiple columns representing various run IDs (SRR31361249, etc).
CSV is the easiest format to start from, and is readily human-readable.
gene,transcriptome,SRR11873594,SRR11873596,SRR11873597,SRR11873598,SRR11873599,SRR11873600,SRR11873601,SRR11873602,SRR11873603,SRR11873604,SRR11873605,SRR11873606,SRR11873607,SRR11873608,SRR11873609,SRR11873610,SRR11873611
CA01g00010,CM334_v1,5.73843,5.04719,3.13854,8.7418,9.43804,6.23472,3.72557,6.69566,8.4037,7.27707,8.87913,6.1115,8.29229,8.51418,7.82719,6.49714,3.37158
CA01g00060,CM334_v1,14.3275,7.97699,12.3409,1.66795,0.467861,0.28916,18.3788,5.97843,5.31497,2.04632,4.1111,1.42572,3.17748,8.70099,10.403,16.5913,16.126
...
The parquet file (download parquet) has a similar organization, but each numeric cell in the table is a small vector containing TPM, length, effective length, and number of reads (in that order). It is not human-readable, but suitable libraries exist in many languages.
We are currently working on open source repository to support reading Verdanta data in R and Python (parquet and CSV)