Skip to content

Conversation

@jcrist
Copy link
Owner

@jcrist jcrist commented Aug 27, 2022

This adds support for constrained types, implementing the following constraints from json-schema:

  • Numeric constraints ge, gt, le, lt, and multiple_of correspond to minimum, exclusiveMinimum, maximum, exclusiveMaximum, and multipleOf respectively
  • pattern corresponds to pattern
  • min_length and max_length correspond to minLength/maxLength or minItems/maxItems or minProperties/maxProperties depending on if the annotated type is a string, sequence, or mapping.

Constraints are added by wrapping a type with Annotated, and the special msgspec.Meta annotation. This is nice, since it works natively with mypy or pyright, and can be applied anywhere in the type hierarchy, not just on a struct field. These can be used to build up some fairly complicated validations.

For example, here we define a type with:

  • a field x that takes a list of at least 3 integers greater than or equal to 0
  • a field y that takes a string matching a regex for a valid unix username
from typing import Annotated, List

from msgspec import Struct, Meta, json


class Example(Struct):
    x: Annotated[
        List[Annotated[int, Meta(ge=0)]],
        Meta(min_length=3),
    ]
    y: Annotated[
        str,
        Meta(pattern="^[a-z_]([a-z0-9_-]{0,31}|[a-z0-9_-]{0,30}\$)$")
    ]


dec = json.Decoder(Example)

Validation happens during decoding, and except for the pattern validation has negligible overhead. Users should feel comfortable adding constraints to types as needed without worrying about performance impacts.

Errors are raised for values not matching the specified constraints, just as they are for value not matching the specified types:

>>> dec.decode(b'{"x": [1, 2, 3], "y": "alice"}')  # ok
Example(x=[1, 2, 3], y='alice')

>>> dec.decode(b'{"x": [1, 2], "y": "ben"}')  # not enough items
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
msgspec.ValidationError: Expected `array` of length >= 3 - at `$.x`

>>> dec.decode(b'{"x": [1, 2, -1], "y": "carol"}')  # items must be >= 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
msgspec.ValidationError: Expected `int` >= 0 - at `$.x[2]`

>>> dec.decode(b'{"x": [1, 2, 3], "y": "dave\'s username"}')  # doesn't match regex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
msgspec.ValidationError: Expected `str` matching regex '^[a-z_]([a-z0-9_-]{0,31}|[a-z0-9_-]{0,30}\\$)$' - at `$.y`

The Meta objects also contain support for json-schema metadata annotations (title, description, ...), but currently nothing is done with it. Generation of a json-schema spec from a type (#125) will happen in a follow-up PR.

Fixes #154 (kind of). I'm reading the bulk of that issue as "support value constraints", which this adds. The remaining bit of supporting the annotated-types library isn't done, and I'm not 100% sure I want to support it yet. I'll open a follow-up issue. to address that, rather than continuing discussion in #154.

Still needs docs, but otherwise this should be done.

jcrist added 12 commits August 12, 2022 10:02
- Switches to using a uint64_t for a bitmask. On 64bit platforms this
shouldn't result in increased memory usage, since these types already
had to be 64 bit aligned anyway.
- Moves to using a union to contain type details, instead of a `void *`.

With these changes the TypeNode structure is now able to contain all the
constraint information we'll eventually want to encode in it.
Adds a `Meta` object for holding extra metadata for a specific TypeNode.
Also sets up framework for other constraints.
Instead of having a single callback after decoding, this moves the
type-specific checks to be called directly from the type-specific
decoding routines. This mainly speeds up numeric constraint checking,
and also allows msgpack length constraints to happen _before_ the object
is decoded.

Also fixes a bug in constraints on dict keys.
@jcrist
Copy link
Owner Author

jcrist commented Aug 27, 2022

cc @adriangb. This is part of what you asked for.

We still don't integrate with the annotated-types library itself. Right now I'm +0.1 on making the Meta objects produce valid annotated-types (using __iter__ or something), but am not sold on msgspec consuming annotated-types annotations yet. There's a maintenance and understandability cost on supporting multiple ways of spelling something, and I'm not fully convinced it's worth it here. I'd want to hear a request from other users before spending time on this.

@adriangb
Copy link

adriangb commented Aug 27, 2022

First thing I'll say is that this is amazing. The C part is way over my head. But it's really cool that that you got so much of this working at once.

w.r.t. annotated-types, I'm curious to hear the cons to integrating with it. I can imagine several, but I want to make sure we're on the same page as to which before I try to think of ways to mitigate them.

There's a maintenance and understandability cost on supporting multiple ways of spelling something, and I'm not fully convinced it's worth it here.

From the developers perspective or the user's? I definitely understand from a developers perspective, but I think it'd be transparent for a user's perspective.

Also were you thinking of having msgspec understand annotated-types annotations, having it create annotated-types annotations (via Meta.__iter__) or both?

@jcrist
Copy link
Owner Author

jcrist commented Aug 27, 2022

First thing I'll say is that this is amazing. The C part is way over my head. But it's really cool that that you got so much of this working at once.

Thanks! It's been fun to work on. There's some nuance to making these validations fast, but I'm really happy with how this has turned out.

Also were you thinking of having msgspec understand annotated-types annotations, having it create annotated-types annotations (via Meta.__iter__) or both?

I'd be open to making msgspec create annotated-types annotations (via Meta.__iter__), since this is self contained and easier to do. I'm less open to making it understand annotated-types annotations, since this would complicate the implementation and provide multiple ways to spell the same thing, which may confuse users (more on this below). I'd want to see other users requesting annotated-types integration before spending any time on this myself, since I'm not convinced of the use case.

I'm curious to hear the cons to integrating with it. I can imagine several, but I want to make sure we're on the same page as to which before I try to think of ways to mitigate them.

There are two main concerns:

Multiple ways of spelling the same thing.

I want msgspec to be easy to use and understand, which means limiting the number of concepts and configurations. We intentionally don't support all the flexibility of pydantic, since the number of config options there can be overwhelming and confusing to users. If we support annotated-types, then we need to demonstrate their usage in the docs, and explain why a user might choose one or the other.

annotated-types also supports some constraints we don't (Predicate), and doesn't support some annotations we do (the json-schema metadata things like title, description, ...). This would force a user to import several different annotations from different places, which doesn't seem as ergonomic to me as a single Meta import.

There's also some incompatibilities in meaning that might trip a user up (max_length here is inclusive, while it's exclusive in annotated-types).

These are surmountable issues given proper documentation, but it'd require some work to present in a clear and concise way.

Complicated implementation

I want msgspec classes to be fast to define (see the import time benchmark). This means limiting the amount of work done at import time, and limiting the amount of sub-packages imported. We also do (almost) all conversion from python type annotations into TypeNode objects in C. If we support annotated-types, then this implementation would get a lot more complicated. The easiest way to do it would be to define a utility in python code that converts annotated-types annotations into our own internal Meta objects, so the C code only needs to understand Meta annotations. This would be possible, but would require some care to ensure it doesn't slow down type construction time.

Neither of these issues is a for-sure "no", but I'm hesitant to spend time on this for now until more users ask for it.

@jcrist
Copy link
Owner Author

jcrist commented Aug 28, 2022

Not sure why coverage is failing, if you look at the diff, all of this patch is covered 🤷. Merging!

@jcrist jcrist merged commit 7a9c79d into main Aug 28, 2022
@jcrist jcrist deleted the constraints branch August 28, 2022 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support annotated-types for metadata and constraint specification

3 participants