How Unicode Collation Works (2025)

(theobeers.com)

12 points | by theobeers 5 hours ago ago

4 comments

$theobeers 5 hours ago

Submission statement:
Figuring out how to develop a Unicode collator from scratch for a research group that I working with in Berlin was one of my formative experiences as a programmer. Ever since then, I've wanted to write something to collect my thoughts on the Unicode Collation Algorithm and the process of building a conformant implementation. Last summer I had a good excuse to do this, when I decided to adapt my collator to Zig as a way of learning that language.
The Unicode standards, and the (relatively) low-level software libraries based on them, do a lot of things for us to make computing possible. We have the luxury of not needing to worry about most of those things most of the time. I find it humbling whenever I do peek under the hood.
[-]
- $adaptit 5 hours ago
  
  Probably a naive question, but: couldn't you precompute some vector representation of the string once, and reduce collation to a vector comparison? Basically move the cost upfront and get back to the "fast" byte-comparison case?
  [-]
  - $theobeers 4 hours ago
    
    Well, that's sort of like what the "sort key" of a given string is: a vector of its nonzero primary-level collation weights, then secondaries, then tertiaries... And once you know what collation options you're using (tailoring etc.), you could just compute the sort key of each string as you encounter it and cache that. Then you would be able to reach a collation decision rapidly for any two strings whose sort keys you already have. It does boil down to vector comparison at that point.
    I experimented with adding an LRU cache to my Rust UCA implementation, but I saw essentially no performance benefit (on the workloads I had), and I decided the feature wasn't worth the complexity and removed it.
    Something I found about Unicode collation is that, once the fast paths are added, they get hit a surprisingly large percentage of the time. I'm thinking in particular of the way that performant UCA implementations build sort keys lazily, stopping once a collation decision is reached. The average "point of difference" is at the primary level and within a few characters of the start of each string. Only a small portion of the sort keys ever get built.
    I am definitely interested in finding more ways to avoid work in the collation routine. Many times, I've had what I thought was a clever idea and found that it didn't pan out in benchmarks. Thank you for your comment!
    [-]
    - $adaptit 3 hours ago
      
      Ah ok, the lazy construction part is what I was missing, if you basically never build the full key, there's nothing to cache. Makes sense now why the LRU didn't help. I'll think about it over the nextdays.