Create your codec

The purpose of this section is to provide a tutorial for creating new codecs accordingly.

As explained in this section, codext provides the possibility to add new codecs in two ways:

add: using this function, the encode and decode functions must be given as arguments.
add_map: using this function, an encoding map must be given but can be formatted in different ways to handle various use cases.

In both cases, a pattern is given in argument and aims to define the set of all strings that aim to select this codec.

Codec precedence

codext uses a local registry that is queried first before attempting native codecs lookups. This means that a native codec can be overridden with a pattern that matches the same strings.

The remainder of this section explains how to successfully create a new codec and/or how to make so that it can be added to the library.

Contributions welcome !

Remember that you can always submit a request for a new codec or submit your own with a PR for improving codext !

Generic arguments¶

Whatever solution is chosen, the following arguments shall be considered:

ename (first positional argument): Choose the shortest possible encoding name. If it clashes with another codec, always remember that codext resolves codecs in order of registry, that is from the first added. Also, it resolves codecs based on the given pattern. So, a codec with a clashing name could still be selected if the pattern does not match for the codec with the precedence but matches for this codec.
pattern (keyword-argument): If not defined, it defaults to the encoding name. It can be a regular expression ; in this case, it should not be too broad. A codec decode or encode function can be parametrized through the pattern using the first capture group. It is important to note that the first capture group is used and not any other. This means that any other group definition shall use the do-not-capture specifier, that is "(?:...)".

Too broad pattern

Let us consider the following ; we add a codec that handles every character in any number of occurrence. It will then capture anything in the given encoding name and will then always resolve to this codec, preventing any other codec added afterwards to resolve.

>>> import codext
>>> identity = lambda text, errors="strict": (text, len(text))
>>> codext.add("everything", identity, identity, pattern=r".*")
>>> codext.encode("test string", "test-encoding-name")  # r".*" matches anything, thus including "test-encoding-name"
'test string'
>>> codext.decode("test string", "test-encoding-name")
'test string'
>>> codext.encode("test string", "morse")               # "morse" has the precedence on codec "everything" we just added
'- . ... - / ... - .-. .. -. --.'
>>> test = lambda text, errors="strict": ("TEST", len(t))
>>> codext.add("test", test)                            # no pattern given ; should then be matched by encoding name "test"
>>> codext.encode("test string", "test")                # should give "TEST" if codec "test" was selected
'test string'                                           # gives the output of codec "test-encoding-name",
                                                        #  which has precedence on "test" and a too broad pattern

Which `add` function ?¶

At this point, it is necessary to determine what kind of codec you want. If it is a simple map of characters, you should definitely use add_map. If it is more complex and cannot be handled using add_map's options, then you should use add and define the encode/decode functions yourself.

A few examples:

morse is a simple map that does not handle case ; it then uses add_map with ignore_case set to "encode" (not "both" for encoding and decoding as it does not matter anyway for decoding)
whitespace has 2 codecs defined ; the simple one is a simple bit encoding map, therefore using add_map with intype set to "bin" (for pre-converting characters to bits before applying the encoding map), and the complex one uses add with its specific endocde/decode functions
atbash defines a dynamic map with a "factory" function, that creates the encoding map according to the parameters supplied in the codec name

So, before going further, determine the following:

What does the new codec map from and to ? E.g. if binary input and ordinal output, you can use add_map with intype="bin" and outype="ord".
Is this codec ignoring case ? If so, you can use add_map and specify which operation(s) should ignore case, e.g. ignore_case="both" or ignore_case="decode".
Should this codec handle no error ? If so, you can use add_map do not forget to specify no_error=True.
Does the codec yields variable-length encoded tokens ? If so, you can still use add_map but you should define sep (separator) as codext will not be able to handle ambiguities.

If you find aspects that are not covered in these questions, you shall use add, then refering to Case 1. Otherwise, you can use add_map and refer to Case 2.

Case 1: Generic encoding definition¶

This uses: codext.add

This applies when the codec is more complex than a mapping, as defined in Case 2: Encoding map.

Examples: crypto/barbie, crypto/railfence, stegano/resistor, stegano/whitespace

The following shall be considered:

encode (keyword-argument ; defaults to None): when left None, it means that the codec cannot encode.
decode (keyword-argument ; defaults to None): when left None, it means that the codec cannot decode.

Both functions must take 2 arguments and return 2 values (in order to stick to codec's encode/decode function format):

Inputs: text, errors="strict" ; respectively the text to encode/decode and the error handling mode.
Outputs: encoded text and length of consumed input text.

Error handling mode

strict: this is the default ; it means that any error shall raise an exception.
ignore: any error is ignored, adding nothing to the output.
replace: any error yields the given replacement character(s).
leave: any error yields the erroneous input token in the output.

This last mode is an addition to the native ones. It can be useful for some encodings that must cause no error while encoding and can therefore have their original characters in the output.

Also, while defining the encode and/or decode functions, codext.handle_error can be used as a shortcut to handle the different modes. It returns a wrapped function that takes token and position as arguments (see excess3 for an example).

>>> help(codext.handle_error)
Help on function handle_error in module codext.__common__:

handle_error(ename, errors, sep='', repl_char='?', repl_minlen=1, decode=False, item='position')
    This shortcut function allows to handle error modes given some tuning parameters.

    :param ename:       encoding name
    :param errors:      error handling mode
    :param sep:         token separator
    :param repl_char:   replacement character (for use when errors="replace")
    :param repl_minlen: repeat number for the replacement character
    :param decode:      whether we are encoding or decoding
    :param item:        position item description (for describing the error ; e.g. "group" or "token")

>>> err = codext.handle_error("test", "strict")
>>> help(err)
Help on function _handle_error in module codext.__common__:

_handle_error(token, position)
    This handles an encoding/decoding error according to the selected handling mode.

    :param token:    input token to be encoded/decoded
    :param position: token position index

Case 2: Encoding map¶

This uses: codext.add_map

This applies when the codec can be defined a simple mapping between source and destination tokens.

Examples: languages/braille, languages/morse, languages/southpark, stegano/klopf, stegano/rick

The following options shall be considered:

encmap (second positional argument): This defines the encoding map and is the core of the codec ; 4 subcases are handled and explained hereafter.
repl_char (keyword-argument ; default: "?"): The replacement character can be tuned, especially if the default one clashes with a character from the encoding.
sep (keyword-argument ; default: ""): The separator between encoded tokens can be useful to tune, especially when the encoded tokens have a variable length.
ignore_case (keyword-argument ; default: None): This defines where the case shall be ignored ; it can be one of the followings: "encode", "decode" or "both".
no_error (keyword-argument ; default: False): This sets if errors should be handled as normal or if no error should be considered, simply leaving the input token as is in the output.
intype (keyword-argument ; default: None): This specifies the type the input text should be converted to before applying the encoding map (pre-conversion before really encoding) ; this can be one of the followings: str, bin or ord.
outype (keyword-argument ; default: None): This specifies the type the output text of the encoding map should be converted from (post-conversion after really encoding) ; this can be one of the followings: str, bin or ord.

Input/Output types

By default, when intype is defined, outype takes the same value if left None. So, if the new encoding uses a pre-conversion to bits (intype="bin") but maps bits to characters (therefore binary conversion to text is not needed), outype shall then be explicitely set to "str" (or if it maps bits to ordinals, use outype="ord").

encmap can be defined as follows:

Simple map: In this case, the encoding map is a dictionary mapping each input character to an output one (see radio for an example).
List of maps: In this case, encoding maps are put in a list and referenced by their order number starting from 1, meaning that the pattern shall define a capture group with values from 1 to the length of this list (see dna for an example).
Parametrized map: This variant defines a dictionary of regex-selected encoding maps, that is, a dictionary of dictionaries with keys matching the captured groups from codec's pattern.
Map factory function: This one is implemented by a function that returns the composed encoding map. This function takes a single argument according to the capture group from the pattern (see affine for an example).

Mapping one input character to multiple output characters

In some particular cases (e.g. the navajo codec), a single input character can be mapped to multiple output ones. It is possible to define them in a map by simply putting them into a list (e.g. a map with {'A': ["B", "C", "D"]}). In this case, while encoding, the output character is randomly chosen (e.g. "A" will map to "D", another time to "B", ...).

Self-generated tests¶

In order to facilitate testing, a test suite can be automatically generated from a set of examples. This is defined in the __examples__ dunder inside codec's source file (see sms for an example). By default, the add/add_map function will get __examples__ from the global scope but this behavior can be overridden by specifying the keyword-argument examples (e.g. add(..., examples=__examples1__) ; see ordinal for an example).

A set of examples is a dictionary specifying the test cases to be considered. The keys are the descriptions of the test cases and the values can be either dictionaries of input texts and their output encoded texts or lists of input texts. Each key has the format "operation(encodings)". Operations can be:

enc: This is for testing the encoding of the nested values (that is, a dictionary of input/outputs).
dec: This is for testing the decoding of the nested values (that is, a dictionary of input/outputs). If this is not specified, the test suite automatically tries to decode from what is defined in enc.
enc-dec: This is for testing the encoding AND decoding of the nested values (that is, a list of inputs) ; this one does not enforce what should be the output of the encoding but checks that encoding AND decoding leads to the same input text. This is particularly useful when encoding can yield randomly chosen tokens in the encoded output.

The encodings are a |-separated list of encoding names, compliant or not with tested codec's pattern. Faulty names can also be tested as of the examples hereafter.

Examples of __examples__ test suites:

__my_examples__ = {
    'enc(BAD)': None
}

Observations

__my__examples__ is not the standard dunder, therefore requiring to be specified as the examples keyword-argument of add/add_map.
BAD is assumed to be a bad encoding name, therefore having a dictionary value of None, meaning that the test should raise a LookupError.

__examples__ = {
    'enc(codec)': {'string': None}
}

Observations

__examples__ is the standard dunder, therefore NOT requiring to be specified as the examples keyword-argument of add/add_map.
codec is assumed to be a valid encoding name, therefore having a dictionary as its value, but in this special case "string" is assumed not to be encoded, its corresponding value is then None, meaning that the test should raise a ValueError.

__examples__ = {
    'enc-dec(codec)': ["test string", "TEST STRING", "@random", "@random{1024}"]
}

Observations

__examples__ is the standard dunder, thus not specified in add/add_map.
enc-dec is used, meaning that a list of inputs is defined.
So, whatever its encoded output, the input string shall give the same while applying encoding then decoding.
The special values @random and @random{1024}, meaning that test strings are generated from any possible byte-character with a specified length (512 when not specified, otherwise specified with {...}).

__examples__ = {
    'enc(codec)': {"test string": "..."}
}

Observations

__examples__ is the standard dunder, thus not specified in add/add_map.
enc only is used, meaning that a dictionary of inputs/outputs is given and dec is automatically handled while requiring the exact encoded text but recovering the exact same input while decoding.

__examples__ = {
    'enc(codec)': {"Test String": "..."},
    'dec(codec)': {"...": "test string"},
}

Observations

__examples__ is the standard dunder, thus not specified in add/add_map.
enc and dec are used, meaning that dictionaries of inputs/outputs are given and the input texts are not necessarily the same (i.e. if text case is not handled by the codec).

Codec names for the guessing mode¶

The __guess__ list of codec names is used to limit the possibilities in the tree search from the guessing mode. Especially when the codec is dynamic and may have a large (or even infinite) number of dynamic names, it is necessary to set a limited number in order to avoid exponentially increasing computation time. This list, when relevant, shall be used with due care.

Mapping one input character to multiple output characters

As a best practice, static names for the guessing mode should be limited to 16, in order to avoid exponential computation time in the search tree algorithm.

Adding a new codec to `codext`¶

As a checklist when making a codec for addition in codext, please follow these steps:

Create your codec file (i.e. starting with a copy of an existing similar one)
Place it into the right category folder (when a category cannot be put in one of the folders under the root of codext, it shall be put by default in others)
Add it to the list in README.md
Add its documentation in the right Markdown file
If self-generated tests are not enough, add manual tests in the related file

Create your codec

Generic arguments¶

Which add function ?¶

Case 1: Generic encoding definition¶

Case 2: Encoding map¶

Self-generated tests¶

Codec names for the guessing mode¶

Adding a new codec to codext¶

Which `add` function ?¶

Adding a new codec to `codext`¶