Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32-bit glyph IDs: what and why #10

Open
PeterConstable opened this issue Sep 16, 2020 · 16 comments
Open

32-bit glyph IDs: what and why #10

PeterConstable opened this issue Sep 16, 2020 · 16 comments

Comments

@PeterConstable
Copy link

PeterConstable commented Sep 16, 2020

This is to start some high-level discussion regarding the possibility of supporting 32-bit glyph IDs, but in (high-level) terms of what it would entail and potential reasons / benefits for it.

Benefits (why)

CJK

A primary motivation for 32-bit GIDs that's been around for some time is for CJK fonts: there are more than 64K Han ideographs encoded in Unicode, so currently it's not possible to create a single font to support even default glyphs for all Han ideographs. (Then there are all the other characters needed in typical CJK fonts, "horizontal" (market-specific) variants, glyphs for Ideographic Variation Sequences, etc.)

Up to now, it's been necessary to split characters/glyphs into separate fonts. E.g, Simsun and Simsun-ExtB. Such font pairs can potentially be packaged together into a single TTC file, but that is only one limitation. Bigger problems pertain to the relevant characters being divided between two font names—in fact, two font families! When authoring a document, the font applied to a CJK might have to change back and forth. A app could perhaps handle that automatically... if there were a way for it to know how. Nothing in a font currently can indicate it's a unicode-range complement to some other font. (In Web content, @fontface rules do provide a way to work around this in CSS; but there's nothing like that in DTP/word-processing apps generally.) So, in the general case, it's up to the user to handle.

A further issue is that there cannot be any OpenType Layout (OTL) interaction between glyphs in different fonts. So, for example, if a spacing adjustment is needed in certain contexts, a lookup for that context can't be created if the glyphs from from two different fonts.

Implementation scenarios requiring many glyph IDs

There are some scenarios in which implementation requires extra glyph IDs.

For example, in some OTL font implementations, it might be necessary to have a GSUB derivation in successive lookups in which earlier lookups map glyph sequence into transient sequences that are then mapped to final sequences in later lookups. In those middle sequences, virtual glyphs (GID without real glyph data) might be used to distinguish the transitional states.

Colour font implementations are another scenario in which many GIDs may get consumed. For example, a colour depiction for emoji character might be comprised of many elements. In a COLR implementation, graphic elements can be combined into a single glyph if the have exactly the same fill and can be placed in to same layer of a z-order stack. Otherwise, different glyph IDs are required. If the colour depiction requires 10 different fills, then at least 10 GIDs will be needed.

Pan-Unicode fonts / fallback

In the 1990s, early in the Unicode era, some pan-Unicode fonts began to appear. Two well-known fonts were Arial Unicode MS and James Kass' Code2000. In the 2000s, interest in pan-Unicode fonts waned for various reasons, such as greater availability of other fonts covering the many scripts in Unicode. The 64K glyph limit may also have been a contributing factor.

Pan-Unicode fonts can still be useful, however, particularly for font fallback. A pan-Unicode font can provide fallback for all Unicode characters with a coherent design, and without needing to handle many separate fonts.

What

The simple idea is that glyph IDs would all become 32-bit. But this should be un-packed some.

By far, the biggest hurdle for breaking the 64K glyph limit is in OTL tables: to support 32-bit GIDs in GPOS, GSUB and GDEF would entail a very major change affecting many dozens of structures. That would require a large engineering investment for implementers.

GIDs are also used in cmap, COLR and MATH. These wouldn't be nearly as difficult to rev: the 'cmap' table is extensible and new subtable formats can easily be defined. For COLR or MATH, there are only a small number of table-internal structures that would be affected.

The 'glyf' table is not limited in the number of glyphs for which data is included except by the table length (in the font's table directory) is a 32-bit value. Only the 64K glyph limit has been identified as a problem, not the size of glyph data. The only issue for 'glyf' would be in relation to component GIDs within composite glyph descriptions. There are a few composite flags left, so there may be a way to leverage flags to extend to 32-but GIDs.

The 'loca' table formats has an inherent glyph limit of 2^28 glyph IDs (max elements in the offsets[] array due to max table length in TableDirectory). The 'hmtx' is similarly limited to 2^28 long metrics. Presumably more than enough. For the 'loca' and 'hmtx' tables, the only real limitation (for practical purposes) is maxp.numGlyphs being uint16.

I'll assume that glyph names in production fonts for >64 glyphs is not a goal.

Compatibility: transitional, or none?

Technical design for supporting 32-bit GIDs could one of two approaches toward compatibility with existing software implementions:

  • Not a goal for new fonts to have any functionality in existing applications
  • It is a goal that new fonts would have some level of functionality in new fonts—for a subset of characters supported by the font, but not all.

(These are comparable to what was decided for variable fonts wrt TrueType vs. CFF outlines: a transitional compatibility for TT, but no compatibility for CFF.)

What goals?

Key questions to ask are what goals should be set, and what the nature of the goal is: a technical goal? a user-experience goal? a business outcome goal?

Obviously, a technical goal would be that fonts can be made that could support all Unicode 13 (or 14...) characters; or that complex COLR/whatever implementations never have to worry about running out of GIDs.

But then, why? What opportunities will be created? What demonstrably-significant UX or business problems will be solved? (E.g., in Windows, Simsun and Simusun-ExtB have existed side-by-side for 15 years, but I don't recall every hearing that was causing particular problems for anyone or anything.)

@behdad
Copy link

behdad commented Sep 16, 2020

By far, the biggest hurdle for breaking the 64K glyph limit is in OTL tables: to support 32-bit GIDs in GPOS, GSUB and GDEF would entail a very major change affecting many dozens of structures. That would require a large engineering investment for ipmlementers.

Not really. Everything is relative of course...

By far the biggest hurdle for breaking the 64K glyph limit is the broken process for contributing to OFF as monopolized by you, Peter while at Microsoft and continuing to do so. Until you apologize for your 2016 intimidating email (https://gist.github.com/behdad/d08f958f8a5e2cb6badf6e32598427df) and commit to addressing the problems you have created in this process, I don't see how we can make anything happen.

@HinTak
Copy link

HinTak commented Sep 17, 2020

I wouldn't want to comment on other aspects/usages of 64k+ glyphs, but as far as CJK is concerned, while it is true that there are more than 64k Han ideographs, the typical educated person only use about 3k of that. e.g. newspapers can operate quite happily with just a specific set of ~4K glyphs. Many of them are rarely used historical forms etc. Few people complain about the simsun / simsun extB split, because few actually need or use the latter.

Perhaps some ideas involving split tables / sub - tables of up to 64k glyphs, perhaps by unicode ranges, would be useful and less disruptive?

@rsheeter
Copy link

I currently have four reasons to want to break the 64k limit, enumerated @ https://rsheeter.github.io/more_gids.

@HinTak
Copy link

HinTak commented Sep 17, 2020

Also, on subsetting, I have always held the view that the Japanese people did it right - they sorted JIS encoding based on frequency of usage, and essentially have a contiguous 3k code blocks of "frequently used characters" and another contiguous 3k block of rarely used character.

The older Chinese encodings were essentially dictionary order (or by character "complexity"). It is like a kitchen sink. That have extremely rarely used characters in the middle, and that got inherited by the unicode encoding.

@alerque
Copy link
Contributor

alerque commented Sep 17, 2020

@HinTak Glyph IDs are a different issue than content encodings (Unicode or otherwise). Fonts provide a map between these two separate things and one of the jobs of a shaper is to lookup what glyph(s) to use for given codepoint(s) in the input. How the input is encoded is completely out of scope for this issue, and how Glyph IDs are organized is completely up to a given font and hence also irrelevant to this issue. The only point here is the theoretical cap on how many glyphs can be stuffed in a single font file.

@HinTak
Copy link

HinTak commented Sep 17, 2020

@alerque encoding (and unfortunately kitchensink-style encoding...) matters, because a large part of argument for large glyph IDs is CJK/ Han usage. As I mentioned at the beginning, that's the part I am commenting on.

I think I remembered that the unicode CJK submission was based on, or often citing the Kangxi dictionary, an 18th-century encyclopedia of Chinese characters compiled by imperial china at its height. While it is a piece of indisputable scholarship, it is also a kitchen sink, in a way, like any encyclopedia, full of rare and outdated items.

@PeterConstable
Copy link
Author

encoding (and unfortunately kitchensink-style encoding...) matters, because a large part of argument for large glyph IDs is CJK/ Han usage.

@HinTak Encoding certainly is important... though it's not clear to me what bearing it has on the number of available glyph IDs. The ordering of code point positions for characters and the ordering of glyphs in a font are independent. So, for this topic,, it doesn't matter how CJK characters were assigned to code positions in Unicode or any other encoding. I agree with Caleb in this regard. What does matter, in terms of CJK characters is how many of them someone wants to support in a font.

@PeterConstable
Copy link
Author

@rsheeter

I currently have four reasons...

I see three scenarios. Your fourth point gives a mitigation (both relevant and valuable) to concerns about really large fonts and the Web or, potentially, devices with limited storage but connectivity.

Pan-Unicode fonts were of interest in the 1990s, but then fell out of interest. In part, that may have been because of file size (Arial Unicode MS is 22MB), but I think bigger issues were:

  • Coercing all scripts in Unicode into a single design doesn't make good design sense.
  • There's actually not much need in practice for pan-Unicode fonts.

Now, while those points are valid, "not much need" isn't the same as "no need". Font fallback that covers all Unicode characters is a valid and important need. A pan-Unicode font can certainly make sense for that purpose. And for UI/fallback purposes, uniformity in design makes sense.*

(* In a universe with unlimited resources, different fallback fonts with different designs to match different designs according to the primary UI language might be desirable in some situations.)

But fallback doesn't require a single, pan-Unicode font. You mentioned 150+ Noto fallback fonts. The current 64K glyph limit isn't causing there to be 150+ fonts. Rather, it's a design choice to have one script per font. The count could be reduced to a handful even with a 64K limit.

So, I think pan-Unicode fonts can be considered another potential use for 32-bit GIDs. But I'm not convinced they strongly add to any business case for 32-bit GIDs.

@HinTak
Copy link

HinTak commented Sep 17, 2020

Encoding matters, when a font or a font vendor claims to be pan-unicode etc. Imagine a world where the CJK code blocks are called CJK level 1, 2, sup 1, sup 2, sup 3, sup 4 sup 5, (totalling to 21k) more like the Japanese JIS model of 3 k each, instead of a single block of 21k CJK unified ideograph, and another extension B taking up 43k.

If you only claim to support CJK level 1/2, the normal daily usage, there is a lot more room for other glyphs. The 15k extra in CJK unified ideographs over JIS level 1 and 2 is in a way, dead weight as they don't have shaping requirements for GPOS/GSUB and purely taking up code space.

@PeterConstable
Copy link
Author

I see what you're saying: If CJK characters were organized in "levels" based on frequencies or contexts of usage, that would allow having a font with smaller character (hence glyph) repertoires that would meet the needs for particular scenarios.

Ordering of characters in the encoding could make it easier to define sets of characters or to assign CJK characters into levels, but it doesn't enable or prevent it.

@HinTak
Copy link

HinTak commented Sep 17, 2020

@PeterConstable yes, that's my idea. Unfortunately the CJK inclusion to unicode was less usage-driven than politics/marketing driven, and instead of having 4-6k glyphs sufficient for daily usage, plus extensions, we have a system where one have to carry 15k glyphs which nobody uses, to be able to claim to support basic CJK.

What are the technical requirements for having 32-bit glyph ids? I suppose many of the tables need to have new versions with expanded fields, and there needs to be software prototypes that support the expanded fields.

Perhaps asking a different question: how about finding a way of letting font vendors claiming some kind of "CJK daily usage" subset compliance? If one loses the dead weight of trying to cover 21k, but instead say, 6k, one gets a lot of id space back for other kind of maneuvers like shaping and color fonts. (I know this requires perhaps government co-operation, so probably not going to happen...)

@PeterConstable
Copy link
Author

ISO/IEC 10646 defines character collections including (among others)

  • collection 370, IICORE: 9,810 characters
  • collection 375, Japanese Core Kanji: 2136 characters

@rsheeter
Copy link

WRT to CJK we looked at frequency of character use over millions of pages when looking at shipping CJK on Google Fonts. The reality is you get pages with mostly high usage characters plus a few rarer ones. If you ship just the high frequency ones a lot of pages will have a few characters that break out of the font. This is a broken solution.

Pan-Unicode fonts were of interest in the 1990s, but then fell out of interest

I would dispute this; Noto is a pan-unicode font. It ships on Android, ChromeOS, and even Apple platforms. My team owns this for Google and actively wants to break the 16 bit gid limit for it.

So, I think pan-Unicode fonts can be considered another potential use for 32-bit GIDs. But I'm not convinced they strongly add to any business case for 32-bit GIDs.

To be clear, my team has a need to break this limit for the reasons I outlined. This isn't hypothetical, it's a real issue for real operating systems and use cases. Can we collapse to fewer files today? - Yes. Can we collapse as far as we'd like? - no. Are there workarounds? - Yes. Are they suboptimal? - also yes.

I find it confusing this would be dismissed as not adding to a business case.

@HinTak
Copy link

HinTak commented Sep 17, 2020

ISO/IEC 10646 is fine and good if everybody subscribes to it... I am talking about the existence of GB18030, and a country which has a tendency of "doing its own thing" for years :-P .

@PeterConstable
Copy link
Author

PeterConstable commented Sep 17, 2020

To be clear, my team has a need to break this limit for the reasons I outlined. This isn't hypothetical, it's a real issue for real operating systems and use cases. Can we collapse to fewer files today? - Yes. Can we collapse as far as we'd like? - no. Are there workarounds? - Yes. Are they suboptimal? - also yes.

I find it confusing this would be dismissed as not adding to a business case.

I didn't dismiss it or say pan-unicode fonts don't add to a business case. I just said I'm not sure they add strongly to the business case because pan-unicode fonts are of much interest beyond fallback fonts and there are workarounds for fallback scenarios.

But it certainly adds something to the business case and deserves consideration.

@PeterConstable
Copy link
Author

I've added pan-unicode/fallback into my opening comment.

Btw, my opening comment for the issue was intended only to seed discussion, not as a draft business case statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants