Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-text-3] Segment Break Transformation Rules for East Asian Width property of A #337

Closed
hax opened this issue Jul 21, 2016 · 55 comments
Closed
Labels
css-text-3 Current Work i18n-clreq Chinese language enablement i18n-jlreq Japanese language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Testing Unnecessary Memory aid - issue doesn't require tests Tracked in DoC

Comments

@hax
Copy link
Member

hax commented Jul 21, 2016

https://drafts.csswg.org/css-text-3/#line-break-transform

Otherwise, if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.

As this rule, common use cases of quotation marks in Chinese

简体中文的
“引号”
两边不应该有空格。

will have unexpected spaces, because quotation marks are A.

Ideally, we should consider the language information of the context. If the context is East Asian language, A should be treat as W. Even in the unknown language context, if any side of the line feed is A and other side is F, W or H, the segment break should also be removed.

@gregwhitworth gregwhitworth added the css-text-3 Current Work label Jul 27, 2016
@r12a r12a added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Jul 28, 2016
@fantasai
Copy link
Collaborator

fantasai commented Aug 16, 2016

My concerns here are:

  • Removing spaces where they currently aren't removed can break existing pages.
  • The proposed behavior is more complex to understand and more complex to implement for what is a fairly low-level operation.

I'm happy to make the change if i18n recommends it and implementors agree, but I am hesitant to do so for these reasons.

@hax
Copy link
Member Author

hax commented Aug 22, 2016

@fantasai Current segment break rule in the draft already change the traditional behavior, and up to now no browser implement it.

Do you mean you just want to drop the rule totally?

And I don't think my proposal is much complex than current.

Current rule:
if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, or H (not A)

My proposal:
if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, H, or A, except both is A.

@astearns astearns removed the Agenda+ label Aug 24, 2016
@astearns
Copy link
Member

@r12a waiting on i18n feedback before we get this on the CSSWG agenda again

@fantasai
Copy link
Collaborator

fantasai commented Sep 24, 2016 via email

@hax
Copy link
Member Author

hax commented Sep 28, 2016

@fantasai In fact there are two proposals:

  1. If the context is East Asian language, A should be treat as W.
  2. If context is not available (or if proposal 1 is not accepted), modify current Segment Break Transformation Rules for A:
    If one side of line break is A and other side is F/W/H (which means it's very likely in east asian context), then treat A as W.

@r12a r12a added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. and removed i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. labels Nov 17, 2016
@upsuper
Copy link
Member

upsuper commented Mar 5, 2018

I think the motivation is reasonable, and some A should be treated as W in CJ context, especially quotations in Chinese. But I am concerned about A+A case, especially given that there are lots of letters in A.

The safest thing to do is probably this: if the context is Chinese or Japanese, and one side of line break is a punctuation in A, and the other side is F/W/H, then the segment break is removed.

@fantasai
Copy link
Collaborator

@upsuper @kojiishi @hax Checked in a fix, based on Xidorn's suggestion. A+A still keeps the space, but A+F or F+A will delete the space if the A's language context is Chinese/Japanese/Yi. This is more conservative than the original request, because we don't want to break existing pages and A+A is reasonably common on non-CJK pages. An interesting question is, should we be checking the language on the segment break instead of on the A?

@hax
Copy link
Member Author

hax commented Mar 19, 2018

@fantasai
In my original suggestion I also didn't think A+A should delete space if we don't know whether we are in East-Asian context.

An interesting question is, should we be checking the language on the segment break instead of on the A?

It's basically same as my "If the context is East-Asian language, A should be treat as W", but I believe checking language on the segment break is much more precise and clear.

@upsuper
Copy link
Member

upsuper commented Mar 19, 2018

@fantasai I think our discussion was concluded that we do that only for punctuations in A in that language context? It doesn't seem to me other A should have that behavior.

@fantasai
Copy link
Collaborator

fantasai commented Apr 9, 2018

OK, switched to checking the language of the segment break (rather than the A character), and restricted that rule to punctuation only.

More fun: Unicode decided to categorize emoji as Wide for some reason. >:[

fantasai added a commit that referenced this issue Apr 9, 2018
…, and restrict Ambiguous characters we care about to punctuation/symbols. #337
@fantasai
Copy link
Collaborator

fantasai commented Apr 9, 2018

Fixed to treat Emoji the same as an Ambiguous character: a6aa4d8
Why it's not Ambiguous to begin with, I don't know.

@kojiishi
Copy link
Contributor

kojiishi commented Apr 9, 2018

Fixed to treat Emoji the same as an Ambiguous character

I think Emoji is too much; it's sometimes surprising and unexpected. The data here:
http://unicode.org/Public/emoji/latest/emoji-data.txt

U+0023, U+002A, U+0030-0039 are probably not desired.

@r12a r12a added the i18n-clreq Chinese language enablement label Apr 30, 2018
fergald pushed a commit to fergald/csswg-drafts that referenced this issue May 7, 2018
…, and restrict Ambiguous characters we care about to punctuation/symbols. w3c#337
@fantasai
Copy link
Collaborator

@kojiishi Can you explain what you think the spec should say about this? Definitely we can't rely on EAW for emoji, they are totally inconsistent. E.g. U+1F600 Grinning Face is EAW=Wide while U+263A Smiling Face is EAW=Neutral. Our rules need to treat them the same somehow, and definitely we can't treat emoji as Wide here.

@kojiishi
Copy link
Contributor

I prefer not to mention. It has historical reasons to be inconsistent afaiu, Emoji is hard because it's sometimes Emoji but sometimes is not, depends on fonts. In this case, it affects only when author inserted a segment break before or after. Also even though there might be cases where it looks strange, it's interoperable, right?

@frivoal
Copy link
Collaborator

frivoal commented Dec 5, 2018

To avoid the problem mentioned by @kojiishi (U+0023, U+002A, U+0030-0039), we could for this purpose treat W emoji and N emoji as A.

N emoji
W emoji

The only ones in all that that don't seem to me to really be "emoji" as commonly understood by people are:

  • U+00A9 COPYRIGHT SIGN
  • U+203C DOUBLE EXCLAMATION MARK
  • U+2049 EXCLAMATION QUESTION MARK

But even then, U+00AE REGISTERED SIGN and U+2122 TRADE MARK SIGN are A as well, so lumping COPYRIGHT SIGN with them doesn't bother me.

As for U+203C, U+2049, they both are sentence ending punctuation. In Chinese and Japanese typesetting, spaces are generally not inserted around sentence ending punctuation, so treating them as A and discarding the spaces seems OK too.

@faceless2
Copy link

We need to add CJK Unified Ideographs Extension G to that list, which is in the upcoming Unicode 13.0.

I'm also told that 99.9% of Korean text that you will find on the web uses Western (aka ASCII) punctuation, so perhaps not as much as an issue as first thought, at least in practical terms.

@faceless2
Copy link

The "ghost of christmas past", who whispered in my ear about CJK Unified Ideographs Extension G, also suggests that anything Lisu and Khitan Small Script, the latter of which is new in Unicode Version 13.0, should be on the list. 👻

@kojiishi
Copy link
Contributor

kojiishi commented Feb 26, 2020

Thank you Mike for the investigation, this is really helpful. This code snippet in WebKit may help developing the list too.

Maybe we should agree on the expected accuracy first. My basic idea is:

  1. I do not expect the perfectly accuracy in all cases. It is a heuristic algorithm, which can't be perfect anyway. This is ok, as long as browsers are interoperable, it should not be too hard for authors to find the failures of the heuristic algorithm.
  2. I would like this algorithm to be simple and fast. This logic is a bit "hot"; runs on every source line for all pages including English pages, consumes battery and CPU cache spaces for all users.
  3. In my opinion, I think we should avoid excessive removal for non-CJK scripts. This means that when ambiguous, choose not to remove. Not ideal for CJK authors, and may not look consistent in some cases, but still a great improvement, and authors can learn as long as implementations are interoperable.
  4. The basic philosophy we applied for text-orientation is to use one Unicode property, or ask Unicode to add a new property. If you found other properties than the Unicode Block that can produce better results, I don't insist on the Unicode Block at all, but I prefer to stick on using one property.

Do these look reasonable? Any opinions, additions, or change suggestions?

@kojiishi
Copy link
Contributor

kojiishi commented Feb 26, 2020

The VerticalOrientation property could also be a good data to develop the list.

@kojiishi
Copy link
Contributor

A wild idea came up, maybe the VerticalOrientation property is better than the Unicode Block for this purpose, rather than just using it as a reference to develop a list of blocks?

@faceless2
Copy link

I presume the reason we're discussing heuristics at all, and not simply adding another value to text-space-collapse in css-text-4 which says "collapse segment breaks to nothing" is because (like #4576) this behaviour is supposed to be context-dependent - i.e. it depends on the characters on either side?

And the intention is, roughly, if the segment break is between two CJK ideographs, or between a CJK ideograph and punctuation, collapse it to nothing?

I ask because - if this behaviour must remain context-dependent - I'm wondering if it would be easier to add a value to text-space-collapse to turn this behaviour on, and instead list the contexts where it wouldn't apply. So make the rule "if this flag is on, collapse all segment breaks except those either side of characters of class (AL|HL|SA)".

Alternatively: it looks like Gecko is currently collapsing segment breaks between two ideographs, but not between ideographs and punctuation, and Blink/Webkit are doing neither. Perhaps segment breaks could always collapse between two ideographs (an easy and unambiguous test for UAX#14 class ID), but only collapse according to the more complex heuristics if the appropriate property was set? In other words, "always collapse segment breaks between ID characters. Only collapse other (some? all?) segment breaks if text-space-collapse: collapse-break is set"

(both of these are attempts to reduce both the processing cost of evaluating the heuristic, and the cost of getting the heuristics wrong).

@kojiishi
Copy link
Contributor

I'm ok to add a new value, but what are the benefits of the new value? Is it to prevent regressing non-CJK content? How is it different from choosing conservative heuristics?

A wild idea came up, maybe the VerticalOrientation property is better...

I take this back. I remember VerticalOrientation is still too aggressive.

@faceless2
Copy link

faceless2 commented Feb 26, 2020

Is it to prevent regressing non-CJK content?

Yes, exactly.

How is it different from choosing conservative heuristics?

To my non-expert eyes, this particular heuristic appears to be quite hard to get right. That's purely based on the discussion in La Coruña, and re-reading all the comments on this issue (from the last four years!). So I figured it's worth exploring if there's a way to remove the heuristic, or at least drastically reduce its scope.

@fantasai
Copy link
Collaborator

For what it's worth we've just implemented this as currently specified, and it makes a real mess of some tests, e.g. CSS2/generated-content/content-counter-004-ref.xht - spaces between U+25FE (black square) are removed, due to the EAW property being "W".

This is because Unicode changed the EAW of a lot of characters in an effectively random and backwards-incompatible way when it introduced Emoji. The results based on e.g. Unicode 6, when these rules were written, would have been quite sensible. :/ Trying to compensate for this change is one of the reasons the rules became too complicated...

I've committed an initial draft of the Unicode block-based approach. I think the interesting questions remaining are:

  • Bopomofo
  • Yijing Hexagram Symbols / Tai Xuan Jing Symbols / Counting Rod Numerals
  • Enclosed ideographics

I'm leaning towards yes on enclosed ideographics, no on the symbols, and I don't know enough about Bopomofo when it is used as a stand-alone script to say.

Lisu and Khitan both use spaces; they should not therefore discard them during collapsing. Small forms etc. are primarily used with Chinese and Japanese, not Korean, so I think it's reasonable to include them here. (Keep in mind also that both sides of the break need to belong to the set in order to discard, and Hangul is excluded.)

@frivoal
Copy link
Collaborator

frivoal commented Apr 13, 2020

Lisu and Khitan both use spaces

Are we talking about one or both of these Khitan (presumably the former, as I don't think the later is in Unicode):

https://en.wikipedia.org/wiki/Khitan_small_script
https://en.wikipedia.org/wiki/Khitan_large_script

If yes, do they really use spaces? Where can I learn more about that?

If not, what are we talking about?

@fantasai
Copy link
Collaborator

@frivoal https://www.unicode.org/versions/Unicode13.0.0/ch18.pdf

@fantasai
Copy link
Collaborator

The WG resolution to switch to Unicode Blocks has been edited in. I opened up #4993 and #4993 as follow-up issues. Closing out discussion here, since we've veered pretty far off the original topic.

@fantasai
Copy link
Collaborator

Refiled the OP as #5017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
css-text-3 Current Work i18n-clreq Chinese language enablement i18n-jlreq Japanese language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Testing Unnecessary Memory aid - issue doesn't require tests Tracked in DoC
Projects
None yet
Development

No branches or pull requests