Dumb Unicode question:
I know that 'code points' are not characters / glyphs. They include control characters etc.
Does there exist any standardised abstraction within Unicode, or systems handling Unicode, for anything approximating 'actual glyph'?
Eg: a UTF-8 string, or a series of 32-bit codepoints, that together unambiguously define a visual glyph?
Are there standard ways of isolating and dealing with such a thing, which is roughly the equivalent of 'character'?
@natecull elixir does this. They refer to the final represented glyphs as `graphemes`.
@cooler_ranch It seems like we need a WHOLE lot of new 'string normalisation' standards for Unicode. To detect canonical forms of visually-equivalent characters.
sorta like 'lowercasing' a string, but for non-Latin chars.
I like how the Zhong-Wen website works, showing hierarchies of Chinese characters. That makes the relationship between primary symbols and their compositions evident.
I've thought that perhaps Unicode shouldn't really be a table, but a trie.
@h @cooler_ranch Yeah, Zhongwen is interesting; though it's a different decomposition to CJK-DECOMP, and I think different again to Unicode Ideographic Description Sequence decomposition. And it's based on Traditional characters, so not very useful for the Mainland.
For simplicity, I'm focusing on Simplified chars, but I'm aware that Traditionals have their own unique decompositions.
Once again I *wish* I knew the provenance of the CJK-DECOMP data. It's just sorta.. there. 80,000 chars worth.
Just suggesting that a similar method (though not taking the Zhongwen literally, only as inspiration) would be a good approach.
For Chinese, but also for other languages where composition is prevalent. Less so in languages that use the base of the Latin alphabet, but it's still true for many European languages that use a lot of accents.
And we keep building tables with only one nesting level (provided by escape sequences), instead of trees with proper hierarchies.
@h @cooler_ranch What would be super cool would be if we had sort of standardised 'rich character' objects, where each character (glyph/grapheme/rune) was an object containing a whole bunch of metadata:
* unicode codepoint (or series of codepoints)
* type/language of character
* alternate visual forms
* the canonical/simplest form
* decomposition
* case
* etc
At the moment, most of that important data is scattered in separate databases and not always available to text parsing languages.
@cooler_ranch @h and then this links to parsing, too.
I probably need to beef up my regex-fu - but is there a way in regexes to *name* patterns and reference them by name?
Eg, just looking visually at, say, https://github.com/cjkvi/cjkvi-ids/blob/master/ids-analysis.txt
I can see I could parse this as a sequence of TEXT_ROW, where each TEXT_ROW contains
UNICODE_ID WHITESPACE CHINESE_CHARACTER WHITESPACE IDS_SEQUENCE WHITESPACE CHINESE_CHARACTER_SEQUENCE WHITESPACE NUMERIC_ID_INCLUDING_AT_SYMBOL etc
Can I express this in regex?
@natecull @h The interactive regex tools available online are really helpful when forming these things. I used Rubular (http://rubular.com/) and Elixre (http://www.elixre.uk/) and like both of them quite a bit, but there are others available in most languages.
I think some regex implementations support composable regexes, but that might be my own wishful thinking.