APE6 - Enhanced Character Separated Values table format#7
Conversation
There was a problem hiding this comment.
Can you change the PR references to links for the convenience of the reader?
E.g. astropy/astropy#2319 in this case.
|
@cdeil - done. |
|
I Googled From a very quick it looks to me like the main difference is that they put the JSON metadata in a separate file (I'm sure there are other differences, as I said I didn't look in detail at either spec). mixed with the JSON? Do you think that other existing format is an option? |
|
@cdeil - Very nice that you found this existing standard for specifying fields and tabular data. Nicer that it largely coincides with what I made up and amplifies the arguments I made for both JSON and CSV. My only strong disagreement with the standards you found is splitting the data out into a separate file. I don't see any really compelling reason to suffer the problems associated with using two files instead of one. I believe most CSV readers support the concept of a comment line. About the non-metadata comments, the original version (astropy/astropy#683) did actually allow for non-metadata comments by encapsulating the JSON part in a section delimited by special START and STOP lines. This is reasonable, but I took that out to simplify things overall. It would be easy enough to put back in if that was the consensus. |
|
Agreed ... one file is nicer than two! But putting the JSON in a separate file instead of in the
It could be worth adding a link to http://dataprotocols.org/ to the APE for now and maybe starting a mailing list thread or github issue with the guys from http://dataprotocols.org/ what they think about extending their format specification to optionally allow the one-file format where the JSON is in a comment in the CVS file? |
|
Having done a little bit of googling e.g. I've read the http://dataprotocols.org/ standard now and it looks generally quite good. I'm planning to modify APE6 to reflect using this to define the JSON content. There is one issue that the type system http://dataprotocols.org/json-table-schema/#field-types doesn't include support for specific bit lengths which is needed for round-tripping scientific data tables. We might be able to work to extend the standard accordingly. So the big question is one file or two. Since a primary goal is interoperability, I guess we do need to consider whether the two-file solution is more effective and worth it. |
|
Links and discussion of the Tabular Data Package has been added. |
|
Contacted the Data Protocols org: |
|
For the single-file protocol (if this gets accepted) we might consider something like: It's not 100% bulletproof, but probably 99.99% OK. This seems a lot cleaner than START STOP lines an other special markers. |
There was a problem hiding this comment.
Since json is also human readable we should leverage that and add a "schema_descroption"(or similar) field that points to the astropy documentation or the dataprotocols website.
|
I personally prefer the two-file approach or at least the option of a two-file approach (similar to CDS where the header can be on top of the datafile or in a separate file).
|
|
@hamogu - interesting point about allowing for an alternate data back end, though in this case wonder if you aren't better off just pickling the Table. In any case I'm coming around to the view that we will support the two-file format no matter what, so the question is about the one-file version. Just one comment about using ASCII tables. As far as disk space goes they are actually not all that bad, since frequently int64 or float64 values will can be written in ASCII in around 8 or fewer bytes. The killer of course is speed since there is no getting around the high overhead of going from text repr <=> binary. |
|
I've updated the APE-6 standard proposal for transition to YAML and a general simplification (making the aims a bit less lofty). |
Final APE cleanup
|
Yes, true enough, but what about other metadata? The idea was to have a In other words, what you are proposing is to restrictive (trying to put all Sorry to have to say that. On Thu, Jan 15, 2015 at 11:12 PM, Tom Aldcroft notifications@github.com
|
|
I don't see that the CDS format and APE6 are in competition. CDS is a general format for astronomical tables that is sanctioned by major astronomical institutions (CDS, ApJ, A&A) and it is set up to contain metadata that follows a certain standard and is of long-term interest. The format suggested in APE6 is can store a table I am currently working on between two python sessions or I can quickly email it to a collaborator who uses the exact same software as I do. It handles arbitrary metadata in an automatic and machine readable way, e.g. I can have a column whose metatdata is a nested list of python dictionaries. Such a structure would be hard to encode in a CDS table and even if I managed to put it in the README file, this specific file would only be readable for me and my specific code. In my opinion, APE6 and CDS seem to address opposite requirements: CDS has a standardized format with fixed conventions (e.g. all CDS table use "e_x" as column names for the uncertainty on column "x") while APE6 allows for total free form meta data - that is great for data exchange between collaborators or persistent storage between my python sessions. APE6 tries to go a little bit beyond that by defining a handful of "standard keywords" (e.g. "name" and "unit") and keeping the data in a CSV format that many programs can read, but it does not aim for the complete meta-data header that is found in most CDS tables. |
|
@PaulKuin - As @hamogu has mentioned, the allowed metadata is essentially unrestricted. There is a To my knowledge none of the available astronomical ASCII data formats support (essentially) arbitrary meta data and place the burden of parsing header meta on a standard and widely available library (YAML). In that sense I would argue that this is not entirely re-inventing the wheel. Of course there is the Tabular Data Package (TDP) that uses JSON. This preceded ECSV and shares much of the design philosophy. But after some discussion here we decided that YAML would be a better fit since it is far easier to read/write for humans. On the point of re-inventing TDP as ECSV, definitely guilty. |
|
@notifications@github.com , @hamogu https://github.com/hamogu - In 1993 we were faced with Tables entered by hand that needed to go through But quit rightly you mention good reasons to use YAML parsing, some of A separate application could be made to enforce the kind of restrictions |
|
I am no kind of tabular-data-interchange-format expert, but I have spent quite a bit of time wrestling data out of IPAC/CDS/LaTeX etc formats into something usable. The standard proposed here seems like it would make my life easier, so I'm all for it. I would argue strongly in favour of having all metadata in the same file as the actual tabular data; I think there is too much potential for trouble with a separate ReadMe/header file. As for "is this astronomy-specific enough?", well, it addresses a need that astronomers have, so that seems like enough to me. |
|
One possible change here relates to having a format that is easily embeddable in ASDF. This was part of the driver for using YAML in ECSV (formerly known as DTIF which used JSON). In the current asdf-standard doc there is an example of an ndarray table with a header like: The YAML for ECSV looks like: So one possibility is to change the top-level |
|
I can see some advantages to both approaches to describing the layout of the data. I think the ASDF one is more flexible in what structures it can describe, however ECSV allows for the extra metadata. Maybe the thing to do is to follow ASDF's pattern, but add a place for arbitrary metadata and a unit? Of course, both of these are limited to what Numpy can do -- for an example of where it might go in the future, Continuum IO's datashape project captures a lot more flexibility and real-world needs when describing data that might be closer to where all this ends up if we start thinking outside of the Numpy box. But it's probably premature to make that leap yet. My long-term plan is that a future version of ASDF may support a more general data description "language", but what we have now is 80% there and easy to support in a number of languages. And lastly, on the term |
|
Ah, I like |
|
Made the change in 9480c56. |
|
I've made the dtype -> datatype rename for ASDF in asdf-format/asdf#52 and asdf-format/asdf-standard#46 |
There was a problem hiding this comment.
Perhaps it's useful to add some words about why order preservation for metadata is useful. I've had to explain that a surprising number of times in the context of ASDF. Some people think it's optional, but I think there's a strong case to be made that it is an essential feature.
There was a problem hiding this comment.
What do you usually say as your first example of why it's needed? For me it is crucial for human readability and file round-tripping, but I don't know of examples where there is a real contextual dependence on keyword order.
|
OK, I think that I've addressed all comments except one. There is an idea from @PaulKuin to allow a configurable comment character (e.g. So ... is this ready? The last commit would be updating the final |
|
I've pushed commit 1963a2f which updates the Decision rationale and status to reflect the final document. This is obviously predicated on actual acceptance. Coordinating committee - is this ready for a decision today? |
|
As of right now, this APE has been accepted. @taldcroft - FYI for future APEs: normally the coordination committee member who does the merging fills out the decision rationale and related bookeeping. I'm not sure it says that anywhere, so it's fine that you did that, but I made some additional changes in the final merged version to flesh out the discussion a bit more. |
|
@eteq @astrofrog @perrygreenfield - thanks! And yes, I would have been quite happy to leave writing the decision rationale to someone else. => PR #9. |
APE6 is primarily a specification of a new standard for the interchange of tabular data in a text-only format. The proposed format handles the key issue of serializing column specifications and table metadata by using a JSON-encoded data structure which is included in the file as a series of #-prefixed comment lines. The actual tabular data are stored in a standard comma-separated-values (CSV) format, giving compatibility with a wide variety of non-specialized CSV table readers. Using JSON makes it extremely easy for applications to read both the standardized data format elements (e.g. column name, type, description) as well as complex metadata structures. Support for schemas to describe and validate the metadata is part of the standard along with support for non-ASCII unicode character encoding.
The implementation in astropy.io.ascii is relatively straightforward and will provide a significant benefit of allowing text serialization of most astropy Table objects, persistent storage, and subsequent interchange with other users.