Skip to content

Implement support for the ECSV format proposed in APE6#2319

Merged
eteq merged 37 commits into
astropy:masterfrom
taldcroft:ascii-dtif
Jan 26, 2015
Merged

Implement support for the ECSV format proposed in APE6#2319
eteq merged 37 commits into
astropy:masterfrom
taldcroft:ascii-dtif

Conversation

@taldcroft

Copy link
Copy Markdown
Member

APE6 proposes a new standard Data-table Text Interchange Format for storing data tables in a text-only format. This PR provides a demonstration implementation of that in astropy.io.ascii. This is by no means complete and should not be merged.

@taldcroft

Copy link
Copy Markdown
Member Author

DTIF is Not dead yet... see http://nbviewer.ipython.org/gist/taldcroft/a13b670ab15db5684f49

This iteration of the DTIF reader/writer now uses YAML and is simplified from the original APE6 idea. Some points:

  • This is all still proof of concept.
  • For simple tables without much meta, the header definition is quite simple and is reasonably close to what @eteq requested in the venerable Include units in Table column descriptions #756.
  • Not to state the obvious, but PyYaml is a dependency for this.
  • It uses custom Loader and Dumper classes that handle OrderedDict nicely by using the !!omap tag. I got this from a gist via http://pyyaml.org/ticket/29. This should also (I think) allow using a safe loader.
  • One decision would be whether to keep this as a new format or make this extra header be option argument to the io.ascii read/write routines.
  • The hope is that by going to YAML it should be easier to integrate an ASCII table as a data block in ASDF (@mdboom, @embray).

@mhvk

mhvk commented Sep 5, 2014

Copy link
Copy Markdown
Contributor

👍 to the format! Definite improvement for sending some small table to collaborators.

One small item: would it be possible to ensure that in the output file, the column name always is the first entry? Since it is an unordered dict, I would guess it does not matter for reading, but for human viewing it is good. Indeed, a fixed order for output is probably best, say name, unit, type, format, anything else.

EDIT: well, probably format before type, since it often implies it anyway.

@taldcroft

Copy link
Copy Markdown
Member Author

Maintaining the ordering would probably require using !!omap for the column attributes. By default this would serialize to something like:

columns:
- !!omap
  - {name: a}
  - {unit: m / s}
  - {type: float32}
- !!omap
  - {name: b}
  - {unit: km}
  - {type: float64}

Possibly there is a clever way to compactify this, but I'm not sure. Note that in the YAML output the keys are in alphabetical (not random) order, so for most use cases name will be first. (When there is no format or description).

@mhvk

mhvk commented Sep 5, 2014

Copy link
Copy Markdown
Contributor

That output looks substantially less nice... But is there a requirement to have the items be in alphabetical order? I.e., could one just postprocess the column lines and put name first?

@taldcroft

Copy link
Copy Markdown
Member Author

Indeed, a fixed order for output is probably best, say name, unit, type, format, anything else.

OK, I figured out a clean way to do this. As for the question of order, I think I prefer your original of having type before format. First, type will always be there, while format is somewhat rare, so the ordering will be more consistent. Also, I think of type being a more fundamental property, so it should be higher priority (more to the left).

@astrofrog

Copy link
Copy Markdown
Member

Just a quick comment - what is a way to unambiguously identify a file as DTIF? I'm thinking maybe we could consider using the first line as a file format signature, optionally with a format version? The nice thing about e.g. HDF5 is that if you read the first 8 bytes, you know it's an HDF5 file. So having a format signature would be nice.

@astrofrog

Copy link
Copy Markdown
Member

I'm thinking something like:

# format: DTIF1
# columns:
# - {name: a, type: float32, unit: m / s}
# - {name: b, type: uint8}
a b
1.0 2

@astrofrog

Copy link
Copy Markdown
Member

Just another comment - if we go ahead with this, I think we should straight away provide dtiflint, a command-line tool to validate DTIF tables, to make sure that anyone else writing custom writers can test it straight away. Note that a linter can be stricter than the reader.

@astrofrog

Copy link
Copy Markdown
Member

Another request - I think DTIF should be very clear on how to mask values and the output in the file should preferably be e.g. - rather than the usual paradigm of 'fill' values and null values in the header.

@mdboom

mdboom commented Sep 5, 2014

Copy link
Copy Markdown
Contributor

@astrofrog: YAML has a standard for specifying the file type, which is a line starting with %. So in this case:

# %DTIF-1.0

@astrofrog

Copy link
Copy Markdown
Member

@mdboom - perfect! I'd highly recommend doing this.

@taldcroft

Copy link
Copy Markdown
Member Author

@mdboom - when I put in the %DTIF-1.0 at the front it gave:

ScannerError: while scanning for the next token
found character '%' that cannot start any token
  in "<string>", line 1, column 2:
     %DTIF-1.0
     ^

Do I need to register this or something? I couldn't find anything in a quick scan of the pyyaml docs, but maybe I didn't look hard enough.

@taldcroft

Copy link
Copy Markdown
Member Author

OK, got the ordering fixed in the last commit.

@mdboom

mdboom commented Sep 5, 2014

Copy link
Copy Markdown
Contributor

@taldcroft: It seems these metadata lines only work if you have a "document start marker" (---) following them. I don't know if you want to require that for such a simple format. This may just have to strip off that first line before passing to pyyaml instead. Ideally, it should be something that can be tested without doing a full YAML parse anyway, so that's not necessarily a bad thing.

@taldcroft

Copy link
Copy Markdown
Member Author

As suggested, I have added a DTIF header line and check for its presence manually, then strip it before YAML parsing.

@mdboom - now that this YAML, what do you think should be done to make DTIF most closely integrate with ASDF? One idea was to make it very easy to drop a DTIF file in as a support data block format. In the current ASDF-standard docs I don't see anything defining how data column meta (type, unit, format, etc) are going to be encoded. DTIF does kind of the simplest possible thing, so do you think that will be a legal subset of what ASDF defines?

Plan B is to purposely keep DTIF as a simple and somewhat specialized "standard" that doesn't necessarily follow the ASDF conventions? It still should be straightforward to write a DTIF encoder/decoder outside of the Python reference implementation (io.ascii.dtif).

@taldcroft

Copy link
Copy Markdown
Member Author

The notebook has been updated accordingly: http://nbviewer.ipython.org/gist/taldcroft/a13b670ab15db5684f49

@taldcroft

Copy link
Copy Markdown
Member Author

BTW, what about rebranding DTIF as ASCI Table with Meta (ATM)? Maybe "Data Table Interchange Format" overstates the scope of what this really is.

@eteq

eteq commented Sep 11, 2014

Copy link
Copy Markdown
Member

👍 from me on this, with a rebranding like you suggested, @taldcroft.

On the rebranding: it's actually not necessarily limited to ASCII, right? That is, unicode is also possible for column names? So maybe instead "Text Table with Meta" (TTM)? That also has the advantage of being a less overloaded acronym, while still being 3 characters so it looks good as a file extension.

@taldcroft

Copy link
Copy Markdown
Member Author

Unicode is not possible for column names because numpy doesn't accept them.

In [35]: np.array([(1,)], dtype=[('a', int)])
Out[35]: 
array([(1,)], 
      dtype=[('a', '<i8')])

In [36]: np.array([(1,)], dtype=[(u'a', int)])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-3200fb1aa3d7> in <module>()
----> 1 np.array([(1,)], dtype=[(u'a', int)])

TypeError: data type not understood

The fact that astropy Table accepts unicode column names is trickery on our part. They are encoded to ascii.

@taldcroft

Copy link
Copy Markdown
Member Author

@eteq - on further reflection you are completely right that the format
definition shouldn't be limited by our current implementation. There is no
reason to limit the format to ascii. TTM could work but I am starting to
think that the "meta" reference might be lost on many users and not quite
sink in. Just to toss out an idea I had (on overnight flight... ) what
about ECSV for extended (or enhanced) CSV. It kinda rolls of the tongue
and brings some known context to make the concept more immediately
understandable. I also imagine proposing a PR to pandas and/or numpy to
implement read_ecsv as a way of promoting adoption.

@astrofrog

Copy link
Copy Markdown
Member

+1 to ECSV :)

@eteq

eteq commented Sep 24, 2014

Copy link
Copy Markdown
Member

👍 to ECSV from me too. (And it looks like either ".esv" or ".ecv" are currently unused extensions.)

@eteq

eteq commented Jan 26, 2015

Copy link
Copy Markdown
Member

APE6 has been accepted, so I'm merging this. Thanks @taldcroft !

eteq added a commit that referenced this pull request Jan 26, 2015
Implement support for the ECSV format proposed in APE6
@eteq eteq merged commit 9585a8b into astropy:master Jan 26, 2015
@astrofrog

Copy link
Copy Markdown
Member

Thanks @taldcroft! 🎉

@taldcroft

Copy link
Copy Markdown
Member Author

Long live the meta!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants