The Unbelievable Universe of Unicode

Holy hell, Unicode is fascinating and incredible! I recently found a list I’ve been keeping of my favorite typographical symbols, then got curious and spent hours today scrolling through the complete Unicode character set and reading up on the project. It was an eye-opening experience.

I used the awesome program Ultra Character Map (Mac, $10, well worth it for type and language nerds) which lets you save characters as “favorites” and I now have WAY MORE favorite characters than I started with.

Some highlights from what I learned:

First, TL;DR on Unicode if you haven’t heard of it before, from the main Unicode website:

The Unicode Standard provides a unique number for every character, no matter what platform, device, application or language. It has been adopted by all modern software providers and now allows data to be transported through many different platforms, devices and applications without corruption. Support of Unicode forms the foundation for the representation of languages and symbols in all major operating systems, search engines, browsers, laptops, and smart phones—plus the Internet and World Wide Web…

There are over 136,000 characters in the Unicode Standard. Which, by the way, is an enormous document, over 1,000 pages long, exhaustively documenting not only the languages / characters themselves, but all the concepts and guidelines and standards that go into making the implementation usable for everyone.

Of these, a little over *half* make up the set of the ideograph characters used in Chinese, Japanese, and Korean languages (collectively called CJK), which all make use of Chinese characters.

Unicode includes not only scripts representing an astounding array of languages represented (currently ~139) but also a ton of interesting symbols and miscellany. Sifting through the whole lot of it was at times tedious, but also full of small delights and discoveries.

Google has an open source project, Noto, which aims to be a unified font project supporting as many languages as possible. It currently supports 90+ scripts, 300+ languages, and 100k+ characters…not quite all of Unicode, but an impressive portion of it! It’s not exactly a single font, since most font formats are limited to 64,000 unique characters, but it’s a family of fonts that aim to both capture the unique character of each language and maintain compatibility across the entire family.

There are lots of weird aspects of how Unicode works, partly for complex technical reasons and partly because of how it’s evolved over time. This article by Ben Frederickson lays out some interesting tidbits, for example:

  • Unicode even includes code points for the Ancient Greek ‘Linear A’ script…which hasn’t actually even been deciphered yet…meaning “there are characters in Unicode that no-one knows what they actually represent!”
  • Unicode has a fair amount of redundancy — for reasons of both semantic distinction and legacy compatibility, there are “lots of different characters that are visually identical to one another. As an example, the letter ‘V’ and the Roman Numeral Five character (U+2164) look identical in most fonts.”
  • I won’t include details here, but there are lots of unexpected behaviors that can happen when manipulating Unicode characters in programming applications…

One important thing the experience of browsing through Unicode made me realize:

What constitutes either a given language, or “language” in general, isn’t static and well-defined. Unicode is evolving, and probably always will be, ever playing catch-up to the realities of how human language is used. I learned there’s a distinction between a “language” and a “script”. For example, Latin and Chinese characters make up distinct scripts, but those scripts are each used by a wide variety of languages. Also, there are all sorts of things in Unicode that I wouldn’t consider part of any language, but also kind of are in a way — things like sets of mathematical symbols, or visual representations of Braille characters, or the universal symbol language we all know and love…emoji!

The CJK characters go on seemingly forever: cascades of characters, grouped first by radical (which, from my very limited understanding, is basically the word root, in both symbol and meaning), then by increasing number of strokes, which made the screen ebb and flow with waves of complexity as I scrolled.

So many languages I never knew existed. So many beautiful, elegant characters in these languages. So many unexpected things where I’d love to know the story of how they made it into Unicode. So many questions on how this whole thing will continue to evolve…

A few remarkable things you probably had no idea Unicode included:

  • Hexagrams (and trigrams and other *grams)
  • “Punctuation lotuses”, “poetry marks” and tons of other symbols I’d never even heard of
  • Mahjong and domino tiles; playing cards
  • Alchemical symbols (a whole section — very cool!)
  • “Private use areas” that apparently are just like empty placeholders in case extra spots are needed

Cuneiform is particularly awesome — super visually dynamic. I hadn’t seen these characters before; some of them unfold like a maze of fractals! Other languages like Arabic and Chinese have some super complex characters but these cuneiform characters are on a whole other level.

Unicode also includes a whole set of dope hieroglyphs. These, I realize, are basically the original emoji, but with really a shocking quantity of bird representations.

The whole of Unicode is organized into a complex structure of code blocks, and given all the legacy structure it’s evolved with, it’s no surprise this organization doesn’t always make sense. For example, there isn’t one single “emoji” section, ordered like how you’d find them on an iPhone keyboard. Rather, some emoji are found in the “Miscellaneous Symbols and Pictographs” section. Faces and related ones belong in the “Emoticons” block; others in “Transport and Map Symbols”. And the section called “Supplemental Symbols and Pictographs” is home to most of the newer emoji, for example the beautifully rendered avocado.

As I initially mentioned, this whole excursion came about as I began to explore the actual names and typographical representations of favorite characters and symbols I’ve come across — things like the pilcrow (¶), schwa (ə), asterism (⁂), komejirushi or reference mark (※), interrobang (‽), fermata (𝄐) and more.

I want to figure out some fun side project sort of thing to continue researching and celebrating all these fascinating and beautiful characters and symbols. I have a couple ideas in mind to play with! I may continue adding more stuff here as I learn…