Skip to content

APE6 - Enhanced Character Separated Values table format#7

Merged
eteq merged 19 commits into
astropy:masterfrom
taldcroft:ape6
Jan 26, 2015
Merged

APE6 - Enhanced Character Separated Values table format#7
eteq merged 19 commits into
astropy:masterfrom
taldcroft:ape6

Conversation

@taldcroft

Copy link
Copy Markdown
Member

APE6 is primarily a specification of a new standard for the interchange of tabular data in a text-only format. The proposed format handles the key issue of serializing column specifications and table metadata by using a JSON-encoded data structure which is included in the file as a series of #-prefixed comment lines. The actual tabular data are stored in a standard comma-separated-values (CSV) format, giving compatibility with a wide variety of non-specialized CSV table readers. Using JSON makes it extremely easy for applications to read both the standardized data format elements (e.g. column name, type, description) as well as complex metadata structures. Support for schemas to describe and validate the metadata is part of the standard along with support for non-ASCII unicode character encoding.

The implementation in astropy.io.ascii is relatively straightforward and will provide a significant benefit of allowing text serialization of most astropy Table objects, persistent storage, and subsequent interchange with other users.

Comment thread APE6.rst Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the PR references to links for the convenience of the reader?
E.g. astropy/astropy#2319 in this case.

@taldcroft

Copy link
Copy Markdown
Member Author

@cdeil - done.

@cdeil

cdeil commented Apr 13, 2014

Copy link
Copy Markdown
Member

I Googled JSON Table and found http://dataprotocols.org/json-table-schema/.
This seems very similar to the format you propose: http://dataprotocols.org/tabular-data-package/

From a very quick it looks to me like the main difference is that they put the JSON metadata in a separate file (I'm sure there are other differences, as I said I didn't look in detail at either spec).
My first thought when I saw your format was how the # commenting will interact with the JSON.
Is it possible to have non-metadata comments like

# Changing units from `cm` to `inches` because my boss said so

mixed with the JSON?

Do you think that other existing format is an option?
If no, can you explain in the APE what features it is lacking wrt. your new proposed format and add a link?

@taldcroft

Copy link
Copy Markdown
Member Author

@cdeil - Very nice that you found this existing standard for specifying fields and tabular data. Nicer that it largely coincides with what I made up and amplifies the arguments I made for both JSON and CSV.

My only strong disagreement with the standards you found is splitting the data out into a separate file. I don't see any really compelling reason to suffer the problems associated with using two files instead of one. I believe most CSV readers support the concept of a comment line.

About the non-metadata comments, the original version (astropy/astropy#683) did actually allow for non-metadata comments by encapsulating the JSON part in a section delimited by special START and STOP lines. This is reasonable, but I took that out to simplify things overall. It would be easy enough to put back in if that was the consensus.

@cdeil

cdeil commented Apr 13, 2014

Copy link
Copy Markdown
Member

Agreed ... one file is nicer than two!

But putting the JSON in a separate file instead of in the # part of a CSV file also has some small advantages:

  • In your current proposal non-metadata comments are not allowed, but that is very commonly used in CSV files ... I would be 👍 for START / STOP lines.
  • Another disadvantage is that for every programming language one has two implement an (admittedly extremely simple ... see your Python implementation) parser for your format, whereas parsers for CVS and JSON already exist. (I'm not sure this is a good argument, because some custom code is needed anyways to combine metadata and data into an in-memory table structure ... trying to implement this in a few languages (say JavaScript/JQuery or C) and feeding it bad inputs should reveal if this is a real issue.)

It could be worth adding a link to http://dataprotocols.org/ to the APE for now and maybe starting a mailing list thread or github issue with the guys from http://dataprotocols.org/ what they think about extending their format specification to optionally allow the one-file format where the JSON is in a comment in the CVS file?

@taldcroft

Copy link
Copy Markdown
Member Author

Having done a little bit of googling e.g. csv comment lines, I'm backing off on my statement that # prefixed comments are generally supported outside of astronomy. I tried importing a comma-separated DTIF file to google sheets and there wasn't any way automatically exclude the header lines. On the other hand the rest of the data parsed perfectly well and so the resultant spreadsheet had some cruft up top which might informative for users, or could just be deleted.

I've read the http://dataprotocols.org/ standard now and it looks generally quite good. I'm planning to modify APE6 to reflect using this to define the JSON content. There is one issue that the type system http://dataprotocols.org/json-table-schema/#field-types doesn't include support for specific bit lengths which is needed for round-tripping scientific data tables. We might be able to work to extend the standard accordingly.

So the big question is one file or two. Since a primary goal is interoperability, I guess we do need to consider whether the two-file solution is more effective and worth it.

@taldcroft

Copy link
Copy Markdown
Member Author

Links and discussion of the Tabular Data Package has been added.

@taldcroft

Copy link
Copy Markdown
Member Author

Contacted the Data Protocols org:
https://lists.okfn.org/pipermail/data-protocols/2014-April/000091.html

@taldcroft

Copy link
Copy Markdown
Member Author

For the single-file protocol (if this gets accepted) we might consider something like:

### {
###   "fields":
###      ...
### }
# Anything with ### is part of the JSON header (and must be strictly parseable), 
# otherwise just a plain comment put in by the user.
a b c
1 2 3
4 5 6

It's not 100% bulletproof, but probably 99.99% OK. This seems a lot cleaner than START STOP lines an other special markers.

Comment thread APE6.rst Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since json is also human readable we should leverage that and add a "schema_descroption"(or similar) field that points to the astropy documentation or the dataprotocols website.

@hamogu

hamogu commented Apr 16, 2014

Copy link
Copy Markdown
Member

I personally prefer the two-file approach or at least the option of a two-file approach (similar to CDS where the header can be on top of the datafile or in a separate file).
I see two advantages:

  • It makes it easy to use the same header format with a different table data format. Specifically, I would like (as a trivial PR if this APE is accepted) to save the data in a binary format with np.save(). Astropy tables can be quite large and a csv files is a very inefficient way to save any type of numbers. That is not as interoperable as this APE, but more far efficient, if I just need persistent storage or interoperability between different python users.
  • If the header metadata fills many lines of text (because the table has many columns) reading the file in e.g. Exel will look very ugly. Also some spreadsheet editors might not recognize a numerical type for the first column if there are many lines of strings above them.

@taldcroft

Copy link
Copy Markdown
Member Author

@hamogu - interesting point about allowing for an alternate data back end, though in this case wonder if you aren't better off just pickling the Table. In any case I'm coming around to the view that we will support the two-file format no matter what, so the question is about the one-file version.

Just one comment about using ASCII tables. As far as disk space goes they are actually not all that bad, since frequently int64 or float64 values will can be written in ASCII in around 8 or fewer bytes. The killer of course is speed since there is no getting around the high overhead of going from text repr <=> binary.

@taldcroft

Copy link
Copy Markdown
Member Author

I've updated the APE-6 standard proposal for transition to YAML and a general simplification (making the aims a bit less lofty).

@astrofrog astrofrog mentioned this pull request Oct 30, 2014
mwcraig pushed a commit to mwcraig/astropy-APEs that referenced this pull request Dec 14, 2014
Comment thread APE6.rst Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this?

@PaulKuin

Copy link
Copy Markdown

Yes, true enough, but what about other metadata? The idea was to have a
"ReadMe" file to put the metadata. I think that a compromise would be to
limit the header metadata in a data file to the names of the column and a
pointer to another file, like the 'ReadMe', containing the rest of the
metadata.

In other words, what you are proposing is to restrictive (trying to put all
the metadata in the file header) and not extensible (something that the
ReadMe file of metadata could do). In my honest opinion you are
reinventing the wheel.

Sorry to have to say that.

Paul

On Thu, Jan 15, 2015 at 11:12 PM, Tom Aldcroft notifications@github.com
wrote:

@PaulKuin https://github.com/PaulKuin - thanks for the feedback. CDS is
discussed in the APE-6 text (
https://github.com/taldcroft/astropy-APEs/blob/ape6/APE6.rst#cds--apj-machine-readable-table).
This format does has a long heritage and a wide user base, but in my
opinion it also has limitations in this context. I think that making a new
format which addresses some of the limitations while being interoperable
with CDS would be difficult and end up counter to the goal of being
CSV-compliant and simple. Using XML is counter to the idea of
human-readable/writeable files.

One of the features which I think is quite valuable in ECSV is that in the
simple cases the header reduces down to something very short and easily
readable. E.g.

%ECSV 1.0

---

columns:

- {name: a, unit: m / s, type: int64, format: '%03d'}

- {name: b, unit: km, type: int64, description: This is column b}

a b
001 2
004 3

In all cases the format is pretty straightforward for machines to read as
well, which would hopefully work toward wider adoption.


Reply to this email directly or view it on GitHub
#7 (comment).

  • * * * * * * * http://www.mssl.ucl.ac.uk/~npmk/ * * * *
    Dr. N.P.M. Kuin (n.kuin@ucl.ac.uk)
    phone +44-(0)1483 (prefix) -204927 (work)
    mobile +44(0)7806985366 skype ID: npkuin
    Mullard Space Science Laboratory – University College London –
    Holmbury St Mary – Dorking – Surrey RH5 6NT– U.K.

@hamogu

hamogu commented Jan 16, 2015

Copy link
Copy Markdown
Member

I don't see that the CDS format and APE6 are in competition. CDS is a general format for astronomical tables that is sanctioned by major astronomical institutions (CDS, ApJ, A&A) and it is set up to contain metadata that follows a certain standard and is of long-term interest.

The format suggested in APE6 is can store a table I am currently working on between two python sessions or I can quickly email it to a collaborator who uses the exact same software as I do. It handles arbitrary metadata in an automatic and machine readable way, e.g. I can have a column whose metatdata is a nested list of python dictionaries. Such a structure would be hard to encode in a CDS table and even if I managed to put it in the README file, this specific file would only be readable for me and my specific code.

In my opinion, APE6 and CDS seem to address opposite requirements: CDS has a standardized format with fixed conventions (e.g. all CDS table use "e_x" as column names for the uncertainty on column "x") while APE6 allows for total free form meta data - that is great for data exchange between collaborators or persistent storage between my python sessions. APE6 tries to go a little bit beyond that by defining a handful of "standard keywords" (e.g. "name" and "unit") and keeping the data in a CSV format that many programs can read, but it does not aim for the complete meta-data header that is found in most CDS tables.

@taldcroft

Copy link
Copy Markdown
Member Author

@PaulKuin - As @hamogu has mentioned, the allowed metadata is essentially unrestricted. There is a meta tag for the entire table and a meta tag for each column (in addition to a few predefined meta attributes like name, format, unit, description). This meta can be any data structure that is serializable by YAML, so the metadata set in a CDS file could be easily encoded. Then it can be read back by anyone with a YAML parser (i.e. most common languages).

To my knowledge none of the available astronomical ASCII data formats support (essentially) arbitrary meta data and place the burden of parsing header meta on a standard and widely available library (YAML). In that sense I would argue that this is not entirely re-inventing the wheel.

Of course there is the Tabular Data Package (TDP) that uses JSON. This preceded ECSV and shares much of the design philosophy. But after some discussion here we decided that YAML would be a better fit since it is far easier to read/write for humans. On the point of re-inventing TDP as ECSV, definitely guilty.

@PaulKuin

Copy link
Copy Markdown

@notifications@github.com , @hamogu https://github.com/hamogu -

In 1993 we were faced with Tables entered by hand that needed to go through
a process of verification. The CDS format provided restrictions so that the
process could be pipelined. restricting the name space was important for
CDS in order to automate loading the databases used for SIMBAD etc. . A
tight standard also allows for applications to be written to use the data.

But quit rightly you mention good reasons to use YAML parsing, some of
which are very similar to those we had when developing the CDS format. The
main drawback of using the CDS format now is that it is providing barriers
rather then take them away for the kind of use that you envision. Frankly,
if the table generation works as it does now, with options to add metadata
(+metadata field) a bit like HDS allows, that sounds to me to be the way to
go. If data structures can be supported that sounds even better.
The challenge there will be the readability.

A separate application could be made to enforce the kind of restrictions
that the CDS format needs so that the data can be converted with relatively
the same effort as it takes now to do it by hand. The astropy tables would
thus only partly overlap with CDS tables, could make CDS table creation not
more difficult, perhaps even easier, and add lots of needed functionality.

@PBarmby

PBarmby commented Jan 16, 2015

Copy link
Copy Markdown

I am no kind of tabular-data-interchange-format expert, but I have spent quite a bit of time wrestling data out of IPAC/CDS/LaTeX etc formats into something usable. The standard proposed here seems like it would make my life easier, so I'm all for it. I would argue strongly in favour of having all metadata in the same file as the actual tabular data; I think there is too much potential for trouble with a separate ReadMe/header file.

As for "is this astronomy-specific enough?", well, it addresses a need that astronomers have, so that seems like enough to me.

@taldcroft

Copy link
Copy Markdown
Member Author

One possible change here relates to having a format that is easily embeddable in ASDF. This was part of the driver for using YAML in ECSV (formerly known as DTIF which used JSON). In the current asdf-standard doc there is an example of an ndarray table with a header like:

!core/ndarray
  source: 0
  shape: [64]
  dtype:
    - name: coordinate
      dtype:
        - name: ra
          dtype: float64
        - name: dec
          dtype: float64
    - name: kernel
      dtype: float32
      shape: [3, 3]
  byteorder: little

The YAML for ECSV looks like:

  columns:
   - {name: a, unit: m / s, type: int64, format: '%5.2f', description: Column A}
   - name: b
     type: int64
     meta:
       column_meta: {a: 1, b: 2}

So one possibility is to change the top-level columns to dtype and also the column type to dtype. I chose the original names for human readability and to be a bit more implementation agnostic (where dtype obviously comes from numpy). However, the schema proposed in ASDF naturally supports nested tables via multiple layers of dtype. I think that in theory the same could be achieved in ASDF using columns to introduce a tabular element while reserving type to specify only a simple data type, but it isn't clear that is the best way.

@mdboom @embray @perrygreenfield ?

@mdboom

mdboom commented Jan 20, 2015

Copy link
Copy Markdown
Contributor

I can see some advantages to both approaches to describing the layout of the data. I think the ASDF one is more flexible in what structures it can describe, however ECSV allows for the extra metadata. Maybe the thing to do is to follow ASDF's pattern, but add a place for arbitrary metadata and a unit?

Of course, both of these are limited to what Numpy can do -- for an example of where it might go in the future, Continuum IO's datashape project captures a lot more flexibility and real-world needs when describing data that might be closer to where all this ends up if we start thinking outside of the Numpy box. But it's probably premature to make that leap yet. My long-term plan is that a future version of ASDF may support a more general data description "language", but what we have now is 80% there and easy to support in a number of languages.

And lastly, on the term dtype. I guess I was unintentionally following a Numpy bias there. Since ASDF is by no means set in stone, maybe we change it to datatype, which seems sufficiently generic, but also more specific than just type which I find maybe a bit too ambiguous/overloaded.

@taldcroft

Copy link
Copy Markdown
Member Author

Ah, I like datatype as being descriptive and neutral. If nobody raises objections I'll change ECSV to use datatype in place of the current columns and type.

@taldcroft

Copy link
Copy Markdown
Member Author

Made the change in 9480c56.

@mdboom

mdboom commented Jan 20, 2015

Copy link
Copy Markdown
Contributor

I've made the dtype -> datatype rename for ASDF in asdf-format/asdf#52 and asdf-format/asdf-standard#46

Comment thread APE6.rst

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's useful to add some words about why order preservation for metadata is useful. I've had to explain that a surprising number of times in the context of ASDF. Some people think it's optional, but I think there's a strong case to be made that it is an essential feature.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you usually say as your first example of why it's needed? For me it is crucial for human readability and file round-tripping, but I don't know of examples where there is a real contextual dependence on keyword order.

@taldcroft taldcroft changed the title APE6 - Data-table Text Interchange Format APE6 - Enhanced Character Separated Values table format Jan 22, 2015
@taldcroft

Copy link
Copy Markdown
Member Author

OK, I think that I've addressed all comments except one.

There is an idea from @PaulKuin to allow a configurable comment character (e.g. !), but I think for the initial 0.9 release I would like to stick with the minimal idea of a fixed '#' comment character. If it turns out in practice that this is a real limitation then we can expand the standard. This will be easier than restricting it later.

So ... is this ready? The last commit would be updating the final Disposition section. This PR should be merged before astropy/astropy#2319 so that I can put links to APE-6 in the astropy docs.

@taldcroft

Copy link
Copy Markdown
Member Author

I've pushed commit 1963a2f which updates the Decision rationale and status to reflect the final document. This is obviously predicated on actual acceptance.

Coordinating committee - is this ready for a decision today?

@astrofrog @eteq @perrygreenfield

@eteq eteq merged commit 2b6895f into astropy:master Jan 26, 2015
eteq added a commit that referenced this pull request Jan 26, 2015
@eteq

eteq commented Jan 26, 2015

Copy link
Copy Markdown
Member

As of right now, this APE has been accepted.

@taldcroft - FYI for future APEs: normally the coordination committee member who does the merging fills out the decision rationale and related bookeeping. I'm not sure it says that anywhere, so it's fine that you did that, but I made some additional changes in the final merged version to flesh out the discussion a bit more.

@taldcroft

Copy link
Copy Markdown
Member Author

@eteq @astrofrog @perrygreenfield - thanks!

And yes, I would have been quite happy to leave writing the decision rationale to someone else. => PR #9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants