Skip to content

Array data types#616

Closed
lpw25 wants to merge 14 commits intoocaml:trunkfrom
lpw25:array-data-types
Closed

Array data types#616
lpw25 wants to merge 14 commits intoocaml:trunkfrom
lpw25:array-data-types

Conversation

@lpw25
Copy link
Copy Markdown
Contributor

@lpw25 lpw25 commented Jun 14, 2016

This PR adds support for defining fresh array types:

type int_array = [| mutable int |]

Motivation

The benefit of fresh array types over instances of the standard 'a array type is that they can use optimised representations without resorting to a non-uniform representation. OCaml already has a couple of such types builtin: string and bytes, in addition to using a non-uniform representation to optimise float array.

The intention is that in the future one will be able to write definitions such as:

type point = { x: int; y : int }

type point_array = [| mutable point [@unboxed] |]

to create an array type using an unboxed representation of record and tuple types (e.g. point).

This patch only adds support for unboxing float contents (which is done without an annotation just like in records). The support for an [@unboxed] attribute would be better done as part of a large patch to support it for all data types (i.e. records fields, variant constructors, and array contents).

FloatArray

The addition of fresh array types allows an implementation of the FloatArray.t type from #163. This provides an unboxed float array type, without the dynamic boxing of float array. This means it is more efficient than float array (due to fewer dynamic checks) and would serve as a good first step to eliminating the dynamic boxing of float array.

Array primitives.

Since these fresh array types are not equal to 'a array they require their own primitives for construction, mutating and destructing them. As with the other data types (variants and records) these primitives are scoped and support type-based disambiguation. The primitives are as follows:

  1. Field access: x.(5).
  2. Field mutation: x.(5) <- "hello"
  3. Array literals (both expression and pattern): [|1; 2; 3|]
  4. Array length: x.length
  5. Array comprehension: [| x * 2 for x = 0 to 10 |]

The first three are syntactically identical to the OCaml's existing array primitives. The last two fill the roles traditionally taken by Array.length and Array.init. They are needed because we cannot have a function which is polymorphic across all array data types, and they are sufficient for implementing the remaining functions in the Array module.

I imagine some people will be against the array comprehensions on various aesthetic grounds. I would argue that they provide exactly the required primitive, and that they are already well known (and liked) by programmers familiar with other languages. Both for and arrays are from the imperative half of OCaml, so it seems natural to me to use a for-based syntax in connection with arrays.

Notes

  • Whilst part of the intention of this PR is to make the builtin optimised array types less magic, we do not actually use it for bytes and string due to the difficulties around handling -safe-string.
  • The default representation for array data types does not perform the float array optimisation. A [@dynamic_boxing] attribute is used to get the traditional array behaviour. This is not really intended for public use: it is mostly just there so that the Array module can reexport the definition of array.
  • This patch allows the creation of immutable array types. However, they are currently translated identically to mutable array types. I'm going to add another commit to change them to use the new support for immutable arrays added as part of the flambda work.

@Drup
Copy link
Copy Markdown
Contributor

Drup commented Jun 14, 2016

Random question: are immutable polymorphic arrays (type 'a array = [| 'a |]) covariants ?

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 14, 2016

Good question. I can't remember if I remembered to fix that.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 14, 2016

Looks like I was on the ball this time: they are indeed covariant.

@alainfrisch
Copy link
Copy Markdown
Contributor

Even if the proposed set of primitives is complete in theory, it does not allow for an efficient implementation of e.g. the blit operation, except if we rely on the compiler to remove bound checks on such a function implemented naively (with explicit assertions on the lengths) -- but this is another story.

What about polymorphic algorithm such as sorting, or even map? Would the recommended approach be to pass a dictionary of functions (explicitly or with modular implicits)?

An alternative design could be to have a new construction at the module expression level, which would produce not only a fresh array type, but also a (larger) set of operations to work on it as regular functions (which would hopefully be inlined when used directly). E.g.

  module IntArray = [|mutable int|]

This would produce a module matching module type MUTABLE_ARRAY with type elt = int (containing functions such as init, but also possibly blit, iter, etc). This seems less intrusive in the core language than your proposal, which introduces a new kind of core type that can only be used through type-based selection.

In your proposal, you reuse the syntax of field projection for the length. It seems you could similarly reuse the syntax of constructors for the init operation, with the same type-based lookup:

   (Array (11, fun i -> i * 2) : int_array)

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 15, 2016

Even if the proposed set of primitives is complete in theory, it does not allow for an efficient implementation of e.g. the blit operation

Indeed. To be complete whilst keeping efficiency in terms of bound checks and use of memcpy we would require additional primitives for handling subarrays: x.(2 .. 5) and x.(2 .. 5) <- y including optimisation of the blit case: x.(2 .. 5) <- y.(2..5). These would be reasonable primitives and are inline with what other languages provide for arrays, but I didn't include them in this PR as they are not strictly necessary.

What about polymorphic algorithm such as sorting, or even map? Would the recommended approach be to pass a dictionary of functions (explicitly or with modular implicits)?

Yes. Operations that are polymorphic over arrays with different representations are essentially ad-hoc polymorphism and so they would be best handled with modular implicits.

introduces a new kind of core type that can only be used through type-based selection.

Actually, the PR allows you to specify the appropriate constructors and destructors using paths, just like with records and variants: x.FloatArray.(5), FloatArray.[|1.2|], etc.

In your proposal, you reuse the syntax of field projection for the length. It seems you could similarly reuse the syntax of constructors for the init operation, with the same type-based lookup:

Labels are a reasonable syntax for length because it is a projection, the field just happens to be stored in the block header. Whereas Array (11, fun i -> i * 2) implies that we are constructing a value which contains the function, whilst we are actually going to run the function multiple times and then discard it.

@alainfrisch
Copy link
Copy Markdown
Contributor

Operations that are polymorphic over arrays with different representations are essentially ad-hoc polymorphism and so they would be best handled with modular implicits.

Would you need to create the instances manually for each fresh array type? What about my proposal of extending the module layer to return directly the instance?

PR allows you to specify the appropriate constructors and destructors using paths

Ok, but this is already at the level of the module system. Two array types in the same structure cannot work without type-based selection. This is already the case with two record types sharing the same labels, but the user is free to choose the labels; for array type, the overloading would mandated by the design.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 15, 2016

Would you need to create the instances manually for each fresh array type?

This is identical to generic operations over records and variants and the same solutions should be used. At the moment, that probably means ppx_deriving and similar.

What about my proposal of extending the module layer to return directly the instance?

Personally, that seems like a pretty ugly solution. It is completely inconsistent with how other data types are handled. I also don't see how it would work for patterns or with type-based disambiguation. What module type are you expecting to give these modules? It wouldn't really help you avoid ppx_deriving: it is not like the set of functions in the standard library's Array module is the last word on arrays, if anything it would make dealing with them more awkward.

Two array types in the same structure cannot work without type-based selection.

True, but I would be surprised if that was a common use-case.

@alainfrisch
Copy link
Copy Markdown
Contributor

I also don't see how it would work for patterns or with type-based disambiguation.

I don't think this needs to be supported; arrays are so rare (strings excluded, but they would keep some ad hoc treatment anyway). Same for array literals.

type-based disambiguation

We don't need that, we just use IntArray.get, etc.

What module type are you expecting to give these modules?

MUTABLE_ARRAY with type elt = ... or IMMUTABLE_ARRAY with type elt = ... with built in definitions for these two named signatures, like:

module IMMUTABLE_ARRAY = sig
  type t
  type elt
  val init: int -> (int -> elt) -> t
  val get: t -> int -> elt
  (* and perhaps some more fields, unsafe_get, sub, blit, etc *)
end

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 15, 2016

We don't need that, we just use IntArray.get, etc.

This would mean anyone hoping to switch from float array to FloatArray.t would need to rewrite their code into a completely different style.

To be honest, I see no real benefit in the approach you are proposing, whilst there is an obvious cost in the usability of the array types.

@yallop
Copy link
Copy Markdown
Member

yallop commented Jun 15, 2016

It looks like you're not storing anything in the object tag. So could this approach support "inline" array types, like this?

type 'a tree =
   Leaf of 'a
 | Branch of [| 'a |]

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 15, 2016

It looks like you're not storing anything in the object tag. So could this approach support "inline" array types, like this?

I don't immediately see a reason why not, and that certainly seems like a useful feature.

@alainfrisch
Copy link
Copy Markdown
Contributor

would need to rewrite their code into a completely different style

It's local syntactic changes, not a different style. Getting rid of the syntax for literals and array access seems very minor to me (and a new indexing syntax could be introduced, and combined with modular implicits, if really desired). Switching a few a.(i) to FloatArray.get a i is not really a big deal; we already need to rewrite Array.length a to FloatArray.length a or a.length anyway.

With your proposal, one would already need to add type annotations and replace Array by FloatArray. And this works because FloatArray comes with a lot of hand-written code. For another array type, one would need to use a new syntax for init and we loose unsafe array access and efficient blit (except by piling up some more syntax).

Relying on the module system seems much more natural to me. This is the standard way to create ad hoc monomorphic data structures (Set.Make, Hashtbl.Make). The only difference here is that the Make functor for arrays behaves differently according to the input type and cannot be implemented in the language, so it needs to be built-in.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 15, 2016

For another array type, one would need to use a new syntax for init

Or just use ppx_deriving to build the functions from the Array module. If you want generic functions you should use a mechanism for implementing generic functions.

The only difference here is that the Make functor for arrays behaves differently according to the input type and cannot be implemented in the language, so it needs to be built-in.

That's not some minor difference: it is fundamental. Set and Hashtbl are parametric in their parameter, whilst these array types aren't -- that is the whole point. Using the module system for something like this is completely alien to the rest of the language.

Adding expression syntax is cheap, the only cost is that people need to know what the new syntax means. That is why I have been emphasising that the new syntax is common in existing languages: people already know what it means. This makes the cost of the new syntax very low, whilst there are obvious benefits in terms of usability to having array syntax (otherwise why have array syntax in the first place).

@alainfrisch
Copy link
Copy Markdown
Contributor

That's not some minor difference: it is fundamental.

I don't think it's fundamental. For instance one can write a functor which chooses a custom representation of arrays based on a type witness:

type _ ty =
  | Char: char ty
  | Pair: 'a ty * 'b ty -> ('a * 'b) ty
  | Other: 'a ty

module type TY = sig
  type t
  val ty: t ty
end

module type ARRAY = sig
  type t
  type elt
  val init: int -> (int -> elt) -> t
  val get: t -> int -> elt
end

let rec make_array: type t. t ty -> (module ARRAY with type elt = t) =
  function
  | Char ->
    (module struct
      type t = string
      type elt = char
      let init = String.init
      let get = String.get
    end)

  | Other ->
    (module struct
      type elt = t
      type t = elt array
      let init = Array.init
      let get = Array.get
    end)

  | Pair (ty1, ty2) ->
    let module A1 = (val make_array ty1) in
    let module A2 = (val make_array ty2) in
    (module struct
      type elt = A1.elt * A2.elt
      type t = A1.t * A2.t
      let init n f =
        let res = Array.init n f in
        A1.init n (fun i -> fst res.(i)),
        A2.init n (fun i -> snd res.(i))
      let get (a1, a2) i =
        A1.get a1 i,
        A2.get a2 i
    end)


module MyArray(X : TY)() : ARRAY with type elt = X.t = (val make_array X.ty)

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Jun 15, 2016

Fair enough. Generative functors do allow the creation of fresh non-parametric types.

This discussion seems to come down to whether the syntaxes [| x for x= 0 to 10 |] and a.(1..10) are easier for people to understand than module FooArray = [| mutable foo |].

Personally, I think the expression syntaxes are much easier for people to learn since they already exist in other languages and are very similar to existing syntaxes. Whereas, I would expect most people to respond to the module syntax with some variation on "WTF is that".

Combined with the additional benefit of supporting standard array syntax on fresh array types, I do not see the advantage of the module-based approach.

@alainfrisch
Copy link
Copy Markdown
Contributor

Can you give a concrete example on how we would create "instances" (for modular implicit, or explicit passing of dictionnaries) for generic array-based algorithms, in your proposal? My current understanding is that you would need to write something like (in the explicit case):

  type int_array = [| mutable int |]
  module IntArray = struct
    type t = int_array
    type elt = int
    let init n f = [| f x for x = 0 to n - 1 |]
    let length (a : t) = a.length
    let get (a : t) i = a.(i)
    let set (a : t) i x = a.(i) <- x
  end
  ....
  let sort_int_array = generic_sort (module IntArray)

and you would need to duplicate the code above for each instance.

This discussion seems to come down to whether the syntaxes [| x for x= 0 to 10 |] and a.(1..10) are easier for people to understand than module FooArray = [| mutable foo |].

Rather between

  type int_array = [| mutable foo |]
  let a0 = ([| x for x = 0 to 10 |] : int_array)
  module IntArray = struct 
    (* see above *)
  end
  let a1 = generic_sort (module IntArray) a0

and

  module IntArray = [| mutable foo |]
  let a0 = IntArray.init 11 (fun x -> x)
  let a1 = generic_sort (module IntArray) a0

@rleonid
Copy link
Copy Markdown
Contributor

rleonid commented May 30, 2017

As a pretty heavy array user, I'd encourage further progress on this issue. I find @lpw25 arguments convincing.

@mrvn
Copy link
Copy Markdown
Contributor

mrvn commented Jun 27, 2017

Maybe I missed something but what's the status of passing (mutable) array elements to other functions?

Last I checked it is not be possible to pass unboxed records or tuples to other functions the way values are passed. Instead a pointer + index would be needed. Or copying. And copying doesn't work with mutable and destroys physical equality.

Overall I see this only working when no functions are called on elements or those functions can be inlined. Otherwise there is a performance penalty and breakage.

@alainfrisch
Copy link
Copy Markdown
Contributor

Maybe I missed something but what's the status of passing (mutable) array elements to other functions?

We don't pass a "mutable array element " to functions anymore than we do today. We can get an element out of an array and pass it to a function. If we want to pass a reference to an element so that the function can mutate it, we pass the array + the index, as we do today.

@mrvn
Copy link
Copy Markdown
Contributor

mrvn commented Jul 18, 2017

Maybe I missed something but what's the status of passing (mutable) array elements to other functions?

We don't pass a "mutable array element " to functions anymore than we do today. We can get an element out of an array and pass it to a function. If we want to pass a reference to an element so that the function can mutate it, we pass the array + the index, as we do today.

Except that is not true. Till now an array contains immutable primitive values or pointers and that pointer is simply passed to other functions. If the pointer is to a record or object with mutable fields it can simply be mutated in place. As you say, for array data types, you have to pass array + index to functions to get that behaviour and that needs something new. A function 'a -> unit doesn't handle 'a = 'b array * 'b index on its own.

@alainfrisch
Copy link
Copy Markdown
Contributor

Till now an array contains immutable primitive values or pointers and that pointer is simply passed to other functions.

I'm not sure what you're talking about, but no, one cannot pass a pointer to some middle of an array to a function. One extracts the value stored in some array cell and pass this value to a function. And the current proposal does not change anything in this respect.

@mrvn
Copy link
Copy Markdown
Contributor

mrvn commented Jul 18, 2017

It's not a pointer to the middle of an array that is passed, it is the pointer stored in the array cell that is passed. Problem is that with the unboxing suggested there is no more pointer. And "extracting" the array cell then means copying the record or object stored there. And that breaks mutability. Or is the array data type now restricted to immutables?

@alainfrisch
Copy link
Copy Markdown
Contributor

Ok, the problem is not the mutability of the arrays, but the mutability within its elements. This PR (as far as I remember) does not implement the unboxing representation yet. When/if this is added, it is likely that this unboxed representation cannot be used for a record type with mutable fields (or alternatively that we will just document the copying semantics in that case).

@alainfrisch
Copy link
Copy Markdown
Contributor

Note that you started the discussion with this example:

type point = { x: int; y : int } type point_array = [| mutable point [@unboxed] |]

here there is no problem. The array is mutable but the elements are not. So there is no observable change in semantics compared to today's regular arrays.

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented May 9, 2018

In the interest of getting the actual feature merged, I think the notion of adding all this syntax to the core language needs to be abandoned and debated independently. I'd much rather see the

type (...) t = [| ... |]
[@@deriving array]

solution implemented, whereby magic functions are created for t. This is the only solution that will be agreed upon simply because it doesn't intrude into the rest of the language. Once we have this solution in place, we can discuss whether introducing syntax elements make sense on a case-by-case basis.

@strega-nil
Copy link
Copy Markdown

What's the status of this? I would really like immutable arrays in my compiler.

@lpw25 lpw25 mentioned this pull request Sep 9, 2019
@mjambon
Copy link
Copy Markdown
Contributor

mjambon commented Sep 11, 2020

Last year, I accidentally raised a similar issue (#8927) which now redirects to this PR. I'm not competent to discuss the implementation requirements, or whether it would require unreasonable maintenance costs from the ocaml maintainers. That said, as a user, I like the following:

  • knowing that a private array-like type is indeed an extension of the base array type and benefits from much of the same interface (syntax, primitives) and properties (algorithmic complexity of primitives, exceptions).
  • knowing that private arrays work like other private types.
  • not having to learn new things.

I don't mind new syntax if it's natural, i.e. it's easy to guess what it means. I tend to dislike solutions that require memory efforts from the author or from the reader. Simply said, if we have to apply a functor to benefit from read-only arrays, I probably would not use it. Overall, I would like the benefit of not writing a custom module on my end to have a private array in my application. Consistency of private arrays with other private types was the main thing I had in mind originally and still today.

@mrvn
Copy link
Copy Markdown
Contributor

mrvn commented Sep 26, 2020

I just looked over this PR again and I still like it. The syntax with comprehension seems clear as well.

But one problem remains and has not been addressed: @unboxed. @unboxed can not work with elements that have mutable fields.

Will you make sure the compiler only allows @unboxed when the elements have no mutable field? Or what is the plan there? Without a solution for how @unboxed is going to work in the future or abandoning unboxing for generic types in arrays I think this PR is stuck.

PS: I really would love to have arrays of unboxed records, mutable fields or not, even if that means code has to be rewritten to work with 'a array * 'a index as accessor. I'm just not sure how you would implement that so the 'a never escapes its scope.

@lpw25
Copy link
Copy Markdown
Contributor Author

lpw25 commented Sep 26, 2020

re. unboxed types: the plan is to integrate this with the results of the unboxed types proposal.

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Sep 27, 2020

Nice! Good to see this being borrowed from haskell.

@mrvn
Copy link
Copy Markdown
Contributor

mrvn commented Sep 29, 2020

That seems to still be problematic although it will just fail to compile I think or be inefficient.

Say you have

type s = { foo : int; mutable bar : #t }
val mutate : #t -> #t

s.bar <- mutate s.bar

The RFC then says mutate is called with each field of #t passed in a register. If #t is large that will put a lot of data on the stack. Then mutate changes something and returns all the fields in registers (and the stack) again. And then it has to be copied over s.bar.

There is no way I can see to make this efficient if mutate is not inlined. Only the trivial cases of

s.bar <- {s.bar with some = simple_value; }

will produce efficient code.

I think that effectively means you can only unbox types with up to 4 fields. Beyond that boxed is better.

PS: That's still a big step forward to more efficient code and probably acceptable.

stedolan pushed a commit to stedolan/ocaml that referenced this pull request May 24, 2022
* Refactor elf notes (amd64)

* Emit labels for trap-handling code locations (push, pop, enter)

Record the labels in a new dedicated trap notes sections

* Emit at most one .stapsdt.base symbol in each compilation unit

Assembler error even though the symbols are both .weak
@lpw25 lpw25 closed this Jan 13, 2023
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Co-authored-by: Cuihtlauac ALVARADO <cuihtmlauac@tarides.com>
@yallop yallop mentioned this pull request Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.