Been trying to make paths deterministic for a few days, it is quickly becoming much more difficult than aniticipated.
What's a path in vector graphics? My understanding is: it is a formalization of a generic shape in vector graphics, directly influenced by how to direct digital milling machine or drawing machine. It is encoded in the svg format and is the model in many vector graphics software, because it is also memory efficient (compact and easily vectorizable).
In my line of work, a path is equivalent to a sequence of segments: line(p1, p2), line(p2, p3),
cubic_bezier(p3, p4, p5, p6) where p1 through p6 are points. If you don't know of cubic beziers,
I won't explain them, but you can start here. What's notable in
my example is the memory waste: p2 for example is written twice. A slightly denser representation would be:
begin(p1), line(p2), line(p3), cubic_bezier(p4, p5, p6). Note that now path elements (we don't call them segments
anymore) don't write their starting point, as they would be redundant with the end point of the preceding segment. This led us
to create a new type of elements begin.
The final piece of the puzzle is that paths can actually have several sub-paths. This is important for a dotted line to still
be represented in code by one object, or for an area to have "holes" (one sub path for the perimeter, one sub path for the hole).
This leads us to have an end path element as well. Here's an example: begin(p1), line(p2), line(p3),
cubic_bezier(p4, p5, p6), end, begin(p7), line(p8), end. Note that some software omit the end keyword as it
is redundant with the following begin.
One thing you can notice from all this is that several paths can represent the same drawing: begin(p1), line(p2),
end and begin(p2), line(p1), end are equivalent. If any
computer graphics algorithms computes a path, a user might wonder which of the representation will be returned, or if it is
arbitrary. Hence the word deterministic: my problem is making path generation deterministic. This makes these algorithms
more predictable and testable, and has performance implications; we might want to update some data only if the path really
changed, something difficult to detect with non-deterministic paths.
What are some algorithms producing paths non deterministically? As usual, anything depending on hash maps or hash sets. In my line of work, the worst place is boolean path operations. Those operations usually build a graph of all the intersection points of many paths, producing results via iterating on nodes of the graph. A lot of temporary structures and performance hacks rely on hashing stuff in various places. I might do a post about it some day! Point is, it doesn't produce reliably the same output (though describing always the same geometry).
Since paths may contain several sub-paths, there are 2 levels of this problem: making a single sub-path deterministic, and making the whole path deterministic. The path in its entirety is really a list of sub-paths, so the obvious solution here is some kind of "sorting" that produces list of sub-paths in a reliable order. The order of the sorting is meaningless in and of itself, it's only used to create a sorting order. In the rest of this post I'll talk about "sorting" and "ordering" in this abstract sense.
Using this sorting/ordering concept to sub-paths can help make them deterministic as well: it'll allow me to ouput the "minimal" path among all the representations, according to whatever ordering I put together. This is not that trivial of a task: a correct ordering implies it can be used for sorting, and my programming language, Rust, is very picky about what is a correct ordering, or sorting might crash the program. Even floating point numbers aren't so easy to sort.
So. Sorting floating point numbers. Correct orderings. What is this?
The Rust standard library documention explains in the definition of the `Ord` trait what a correct mathematical ordering should be, in particular:
From the above and the requirements of PartialOrd, it follows that for all a, b and c:Mathematically speaking, the < operator defines a strict weak order. In cases where == conforms to mathematical equality, it also defines a strict total order.
- exactly one of a < b, a == b or a > b is true; and
- < is transitive: a < b and b < c implies a < c. The same must hold for both == and >.
This is not just mathematical snobbiness, those assumptions are underlying most of the sorting algorithms we use every day! A type breaking those assumtions can be sorted randomly, triggering infinite loops, non-signaling false results and all sorts of nonesense. As an aside, Rust used to not require this, then in a new release, made the following change: if during sorting an array, a type was recognized to not be strictly ordered, the program would panic (meaning crash). This suddenly broke production code! Kind of a d*ck move tbh, but hey, that did surface some bug.
More concerning to us, floating point numbers, as they are defined by the IEEE-754
standard, do not respect this, because of "Nan", a special family of values that always return false in any comparison,
including Nan == Nan. If you'd like to get at the bottom of why, this and this are good starting points.
What this means is that writing [1., 2., 3., 4.].sort() in rust will not compile! As a workaround, the standard library
provides the total_cmp method on floating point types, which produces an ordering that's a strict ordering, but doesn't
respect the IEEE-754 standard, and where Nan values do not behave as they should.
Now that we know how to sort floats correctly, it is more or less trivial to sort points in any dimension: the usual way is to do a lexicographic sort. It won't have much of a geometrical meaning, but we don't care about that. This can easily help us define an ordering for sub-paths as well: a lexicographical sort of all points constituting the sub-path!
In actuality, one must also use the tags that are used to define the segments, like begin, line and such; they can be used for sorting after the points (meaning when comparing 2 sub-paths, we'll use verbs only if the 2 point lists are equal). This will simplify algorithms that try to minimize a single sub-path.
Now that we have a strict ordering on sub-paths (lexicographical ordering of the array points), to make them deterministic we need a way to "minimize" them: to turn a single sub-path into the lowest possible sub-path describing the same geometry. It is of course not the only way to make them deterministic, but that unifies concepts nicely with the "sorting" that will happen inside whole paths.
Now is a good time to interrogate: what are the different sub-paths describing the same geometry? Let's call them versions fo the same sub-paths. Here we have to distinguish between "opened" and "closed" sub-paths: closed sub-paths form a "loop" with no clear start or end. Opened sub-paths are the opposite. Naturally an opened sub-path has only 2 versions: itself and its reverse. A closed path, with no clear extremity, still has a starting point in its underlying data representation; this starting point is arbitrary, and many versions of a closed path exist, where any of its point can be its starting point. All these versions' reverse paths also describe the same geometry obviously.
Naturally, opened paths are a lot simpler to deal with, as they have only 2 versions. Minimizing an opened sub-path in a naive wasteful way is producing the 2 versions, compare them, and copy the data of the lowest one. Since we don't want to allocate anything, a smarter way is to check the first and last point: if the last is lower, reverse the sub-path. If they are equal, check the second and second-to-last points, etc.
Closed paths can start with any point, so we first have to iterate through all points to find the minimal one and set it as the starting point. Reversing the sub-path may also be necessary, in the same way as opened sub-paths.
We can now zoom out and re-look at our initial goal: reliably make a whole path into a "canonical" version of itself. Paths are just a list of sub-paths, so once sub-paths themselves are turned into a predictable version of themselves, we can just "sort" these sub-paths, that'll always produce the same result.
Only subtleties remaining are the way sub-paths are stored and accessed: in my codebase for example, all points of all sub-paths are saved next to each other in the same allocation for performance reasons, so sub-paths can't easily be accessed like an element in a list. I had to make special methods to swap sub-paths, and can only use sorting algorithms that "swap" elements. Also since sub-paths can be big, one should think about using a sorting alogrithm that minimizes the number of swaps, but that's a conversation for another day
So that's was a fun rabbit whole about producing deterministic paths! This is very generic: any algorithm producing any path can be made deterministic via these methods, but probably the most efficient way is not to use this, but to make the path-producing-algorithm deterministic whenever possible.
One cool thing is that every operation is happening in-place: no allocation required! Also linear time guaranteed!