[RFC/Discussion] Text to shape / Text metrics API

Adding some form of text metrics API to libass has been in discussion for a long time (see #87 and #348), but (as far as I can see) never with any too concrete proposals. I've been planning to draft out a more detailed proposal for a long time now, but never really had the time for it. I have a bit more time now and was reminded of this issue by a discussion today, so I thought I could start by listing my thoughts so far and see if anyone else has relevant ideas or concerns. Apologies in advance for the long issue.

### Use Cases
Maybe it's best to start with the API user's perspective, to see what kinds of metrics would be needed in such an API.
There are two big classes of requests:
- **Text metrics**: These are, of course, needed to lay out text, especially when splitting it by character or by syllable. Most importantly, they are used in Aegisub's karaoke templater to split a line into syllables or characters and position them correctly. In my fork of Aegisub, they are also used in the [perspective tool](https://github.com/arch1t3cht/Aegisub/releases/tag/feature_09) to determine how much a line will be shifted by `\fax`. Similarly, they're used by any Lua scripts that deal with perspective.

  In Lua scripts (including the karaoke templater), these metrics are obtained from the [`aegisub.text_extents`](https://aegisub.org/docs/latest/automation/lua/miscellaneous_apis/#aegisubtext_extents) function:
  ```
  width, height, descent, ext_lead = aegisub.text_extents(style, text)
  ```
  On Windows, this function uses GDI (implemented [here](https://github.com/Aegisub/Aegisub/blob/6f546951b4f004da16ce19ba638bf3eedefb9f31/src/auto4_base.cpp#L72-L120)) - `width` and `height` are obtained using `GetTextExtentPoint32`, while `descent` and `ext_lead` are obtained from `GetTextMetrics` (and are independent of the given text), i.e. it returns the same metrics that VSFilter uses for layouting.

  However, on Linux, where GDI is not available, it has to use wxWidgets, which (usually) internally uses gtk, which in turn uses pangocairo to compute the metrics. The resulting metrics are accurate some of the time (though iinm even then there are differences in rounding resulting in shifts on the order of one PlayResY unit), but at other times they can be very wrong. This can make advanced typesetting on Linux very hard (and some typesetters use strange workarounds like running Aegisub through wine, which has a better, albeit not perfect, emulation of GDI)
- **Text to shape**: This is not used anywhere in Aegisub or its Lua API, but it is provided by two third-party Lua libraries [YUtils](https://github.com/TypesettingTools/Yutils) (the older one) and [ILL](https://github.com/TypesettingTools/ILL-Aegisub-Scripts/) (the newer one) via ffi. Similarly to before, they work very well on Windows (where they use GDI's `GetPath`), but can be very inaccurate on Linux (where YUtils uses pangocairo, and ILL uses freetype and tries to emulate libass). Text to shape is used in some advanced typesetting, e.g. when clipping grain or some other effect to text, or when warping or distorting text in some fashion.

With this in mind, it would of course be great to have some unified OS-independent method of obtaining both of these using libass. Naturally, there can be no guarantee that the metrics/shape data returned by libass would fit vsfilter, or that it would stay the same in future versions of libass (so when e.g. clipping grain to text it could still be a good idea to also convert the text to a shape), but nevertheless such an API would be better than the current situation where different authoring tools reimplement libass logic with different degrees of accuracy.

### Soft Proposal
In my opinion, instead of having several API functions for different requests (e.g. text metrics, text to shape, possibly additional ones), it is easier (from an API user's perspective, not necessarily from libass's perspective) to have a single function that, analogously to `ass_render_frame` takes an `ass_track` and a timestamp and outputs metrics and shape data of every glyph involved. This data could just be a sensible subset of the respective `GlyphInfo`, i.e. contain metrics like `bbox`, `advance`, `asc`, and `desc`, as the `outline` and metadata like `symbol` and the event the glyph belongs to. This would offer a great amount of flexibility (e.g. also allowing users to get the layout of events with changing fonts, font sizes, font spacing, etc) without too much added complexity (since the user is free to also just plug text without any tags into the function). One downside is that it turns certain parts of libass internals into public API, but I imagine that fields like these aren't really at risk of changing internally (and, of course, the API does not need to give any guarantees of the output *values* staying stable, only of their format).

As far as I can see, the biggest question is at what point in the rendering chain these `GlyphInfo` values should be returned. My feeling is that, as long as it is feasible to implement in libass, the best option would be to go through the entire rendering process and only "skip" the rasterization step and clip blending (though I guess that with this line of thinking, the rasterization could even be performed too, and its output could also be returned if desired. But this would lock even more libass internals into public API.).  This way, collision detection would also be part of the resulting metrics. This might be the hardest method to implement, though, since every step after rasterization would need to be modified to also be able to track all the `GlyphInfo` metadata that needs to be returned.

Another option would be to return the data directly before the rasterization step, i.e. around [here](https://github.com/libass/libass/blob/694143b001414f17834812be7ea6dd65397e447c/libass/ass_render.c#L1532). This way, the API would also apply perspective transformations and text stroking (there exist well-working Lua implementations for both of these, but for a simple "bake any ASS event into a shape" function this would be much more convenient). In this case, all events would be treated completely separately, so the function could also just take one single event, ~~the corresponding style~~ a set of styles, and a script header. (When rereading this I realized that only passing the used style does not cover all cases, since the event could contain `{\rStyle2}`.)

There are various other cutting points imaginable, like right before `render_and_combine_glyphs`, but the two above are the ones which I think make the most sense.

---

These are my thoughts so far. Unless someone else offers to, I can probably one day PR an addition like this. For now, I just wanted to see if you think an API like this would be generally feasible, and have any opinions about the specifics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC/Discussion] Text to shape / Text metrics API #825

Use Cases

Soft Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC/Discussion] Text to shape / Text metrics API #825

Description

Use Cases

Soft Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions