[css-text-3] Segment Break Transformation Rules for East Asian Width property of A #337

hax · 2016-07-21T17:42:22Z

https://drafts.csswg.org/css-text-3/#line-break-transform

Otherwise, if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.

As this rule, common use cases of quotation marks in Chinese

简体中文的
“引号”
两边不应该有空格。

will have unexpected spaces, because quotation marks are A.

Ideally, we should consider the language information of the context. If the context is East Asian language, A should be treat as W. Even in the unknown language context, if any side of the line feed is A and other side is F, W or H, the segment break should also be removed.

The text was updated successfully, but these errors were encountered:

fantasai · 2016-08-16T20:49:15Z

My concerns here are:

Removing spaces where they currently aren't removed can break existing pages.
The proposed behavior is more complex to understand and more complex to implement for what is a fairly low-level operation.

I'm happy to make the change if i18n recommends it and implementors agree, but I am hesitant to do so for these reasons.

hax · 2016-08-22T04:36:37Z

@fantasai Current segment break rule in the draft already change the traditional behavior, and up to now no browser implement it.

Do you mean you just want to drop the rule totally?

And I don't think my proposal is much complex than current.

Current rule:
if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, or H (not A)

My proposal:
if the East Asian Width property [UAX11] of both the character before and after the line feed is F, W, H, or A, except both is A.

astearns · 2016-08-24T17:06:39Z

@r12a waiting on i18n feedback before we get this on the CSSWG agenda again

fantasai · 2016-09-24T22:33:55Z

Just to clarify, the proposal is that if lang=zh|ja|yi then A->W otherwise A->N for the purpose of line-break transformations? I think that should probably be okay. I would be against making A->W the general case.

hax · 2016-09-28T07:27:12Z

@fantasai In fact there are two proposals:

If the context is East Asian language, A should be treat as W.
If context is not available (or if proposal 1 is not accepted), modify current Segment Break Transformation Rules for A:
If one side of line break is A and other side is F/W/H (which means it's very likely in east asian context), then treat A as W.

upsuper · 2018-03-05T03:56:03Z

I think the motivation is reasonable, and some A should be treated as W in CJ context, especially quotations in Chinese. But I am concerned about A+A case, especially given that there are lots of letters in A.

The safest thing to do is probably this: if the context is Chinese or Japanese, and one side of line break is a punctuation in A, and the other side is F/W/H, then the segment break is removed.

…segment break if language context is CJY. #337 <https://lists.w3.org/Archives/Public/www-style/2016Oct/0068.html>

fantasai · 2018-03-16T22:47:42Z

@upsuper @kojiishi @hax Checked in a fix, based on Xidorn's suggestion. A+A still keeps the space, but A+F or F+A will delete the space if the A's language context is Chinese/Japanese/Yi. This is more conservative than the original request, because we don't want to break existing pages and A+A is reasonably common on non-CJK pages. An interesting question is, should we be checking the language on the segment break instead of on the A?

hax · 2018-03-19T08:40:11Z

@fantasai
In my original suggestion I also didn't think A+A should delete space if we don't know whether we are in East-Asian context.

An interesting question is, should we be checking the language on the segment break instead of on the A?

It's basically same as my "If the context is East-Asian language, A should be treat as W", but I believe checking language on the segment break is much more precise and clear.

upsuper · 2018-03-19T08:52:49Z

@fantasai I think our discussion was concluded that we do that only for punctuations in A in that language context? It doesn't seem to me other A should have that behavior.

fantasai · 2018-04-09T14:43:54Z

OK, switched to checking the language of the segment break (rather than the A character), and restricted that rule to punctuation only.

More fun: Unicode decided to categorize emoji as Wide for some reason. >:[

…, and restrict Ambiguous characters we care about to punctuation/symbols. #337

fantasai · 2018-04-09T14:49:46Z

Fixed to treat Emoji the same as an Ambiguous character: a6aa4d8
Why it's not Ambiguous to begin with, I don't know.

kojiishi · 2018-04-09T16:06:50Z

Fixed to treat Emoji the same as an Ambiguous character

I think Emoji is too much; it's sometimes surprising and unexpected. The data here:
http://unicode.org/Public/emoji/latest/emoji-data.txt

U+0023, U+002A, U+0030-0039 are probably not desired.

…, and restrict Ambiguous characters we care about to punctuation/symbols. w3c#337

fantasai · 2018-09-16T09:31:01Z

@kojiishi Can you explain what you think the spec should say about this? Definitely we can't rely on EAW for emoji, they are totally inconsistent. E.g. U+1F600 Grinning Face is EAW=Wide while U+263A Smiling Face is EAW=Neutral. Our rules need to treat them the same somehow, and definitely we can't treat emoji as Wide here.

kojiishi · 2018-09-18T00:40:30Z

I prefer not to mention. It has historical reasons to be inconsistent afaiu, Emoji is hard because it's sometimes Emoji but sometimes is not, depends on fonts. In this case, it affects only when author inserted a segment break before or after. Also even though there might be cases where it looks strange, it's interoperable, right?

frivoal · 2018-12-05T00:52:13Z

To avoid the problem mentioned by @kojiishi (U+0023, U+002A, U+0030-0039), we could for this purpose treat W emoji and N emoji as A.

N emoji
W emoji

The only ones in all that that don't seem to me to really be "emoji" as commonly understood by people are:

U+00A9 COPYRIGHT SIGN
U+203C DOUBLE EXCLAMATION MARK
U+2049 EXCLAMATION QUESTION MARK

But even then, U+00AE REGISTERED SIGN and U+2122 TRADE MARK SIGN are A as well, so lumping COPYRIGHT SIGN with them doesn't bother me.

As for U+203C, U+2049, they both are sentence ending punctuation. In Chinese and Japanese typesetting, spaces are generally not inserted around sentence ending punctuation, so treating them as A and discarding the spaces seems OK too.

faceless2 · 2020-02-25T14:23:03Z

We need to add CJK Unified Ideographs Extension G to that list, which is in the upcoming Unicode 13.0.

I'm also told that 99.9% of Korean text that you will find on the web uses Western (aka ASCII) punctuation, so perhaps not as much as an issue as first thought, at least in practical terms.

faceless2 · 2020-02-25T15:02:42Z

The "ghost of christmas past", who whispered in my ear about CJK Unified Ideographs Extension G, also suggests that anything Lisu and Khitan Small Script, the latter of which is new in Unicode Version 13.0, should be on the list. 👻

kojiishi · 2020-02-26T03:39:28Z

Thank you Mike for the investigation, this is really helpful. This code snippet in WebKit may help developing the list too.

Maybe we should agree on the expected accuracy first. My basic idea is:

I do not expect the perfectly accuracy in all cases. It is a heuristic algorithm, which can't be perfect anyway. This is ok, as long as browsers are interoperable, it should not be too hard for authors to find the failures of the heuristic algorithm.
I would like this algorithm to be simple and fast. This logic is a bit "hot"; runs on every source line for all pages including English pages, consumes battery and CPU cache spaces for all users.
In my opinion, I think we should avoid excessive removal for non-CJK scripts. This means that when ambiguous, choose not to remove. Not ideal for CJK authors, and may not look consistent in some cases, but still a great improvement, and authors can learn as long as implementations are interoperable.
The basic philosophy we applied for text-orientation is to use one Unicode property, or ask Unicode to add a new property. If you found other properties than the Unicode Block that can produce better results, I don't insist on the Unicode Block at all, but I prefer to stick on using one property.

Do these look reasonable? Any opinions, additions, or change suggestions?

kojiishi · 2020-02-26T03:52:17Z

The VerticalOrientation property could also be a good data to develop the list.

kojiishi · 2020-02-26T05:30:14Z

A wild idea came up, maybe the VerticalOrientation property is better than the Unicode Block for this purpose, rather than just using it as a reference to develop a list of blocks?

faceless2 · 2020-02-26T12:34:44Z

I presume the reason we're discussing heuristics at all, and not simply adding another value to text-space-collapse in css-text-4 which says "collapse segment breaks to nothing" is because (like #4576) this behaviour is supposed to be context-dependent - i.e. it depends on the characters on either side?

And the intention is, roughly, if the segment break is between two CJK ideographs, or between a CJK ideograph and punctuation, collapse it to nothing?

I ask because - if this behaviour must remain context-dependent - I'm wondering if it would be easier to add a value to text-space-collapse to turn this behaviour on, and instead list the contexts where it wouldn't apply. So make the rule "if this flag is on, collapse all segment breaks except those either side of characters of class (AL|HL|SA)".

Alternatively: it looks like Gecko is currently collapsing segment breaks between two ideographs, but not between ideographs and punctuation, and Blink/Webkit are doing neither. Perhaps segment breaks could always collapse between two ideographs (an easy and unambiguous test for UAX#14 class ID), but only collapse according to the more complex heuristics if the appropriate property was set? In other words, "always collapse segment breaks between ID characters. Only collapse other (some? all?) segment breaks if text-space-collapse: collapse-break is set"

(both of these are attempts to reduce both the processing cost of evaluating the heuristic, and the cost of getting the heuristics wrong).

kojiishi · 2020-02-26T16:15:43Z

I'm ok to add a new value, but what are the benefits of the new value? Is it to prevent regressing non-CJK content? How is it different from choosing conservative heuristics?

A wild idea came up, maybe the VerticalOrientation property is better...

I take this back. I remember VerticalOrientation is still too aggressive.

faceless2 · 2020-02-26T17:31:41Z

Is it to prevent regressing non-CJK content?

Yes, exactly.

How is it different from choosing conservative heuristics?

To my non-expert eyes, this particular heuristic appears to be quite hard to get right. That's purely based on the discussion in La Coruña, and re-reading all the comments on this issue (from the last four years!). So I figured it's worth exploring if there's a way to remove the heuristic, or at least drastically reduce its scope.

…sformation. #337

fantasai · 2020-04-12T22:31:24Z

For what it's worth we've just implemented this as currently specified, and it makes a real mess of some tests, e.g. CSS2/generated-content/content-counter-004-ref.xht - spaces between U+25FE (black square) are removed, due to the EAW property being "W".

This is because Unicode changed the EAW of a lot of characters in an effectively random and backwards-incompatible way when it introduced Emoji. The results based on e.g. Unicode 6, when these rules were written, would have been quite sensible. :/ Trying to compensate for this change is one of the reasons the rules became too complicated...

I've committed an initial draft of the Unicode block-based approach. I think the interesting questions remaining are:

Bopomofo
Yijing Hexagram Symbols / Tai Xuan Jing Symbols / Counting Rod Numerals
Enclosed ideographics

I'm leaning towards yes on enclosed ideographics, no on the symbols, and I don't know enough about Bopomofo when it is used as a stand-alone script to say.

Lisu and Khitan both use spaces; they should not therefore discard them during collapsing. Small forms etc. are primarily used with Chinese and Japanese, not Korean, so I think it's reasonable to include them here. (Keep in mind also that both sides of the break need to belong to the set in order to discard, and Hangul is excluded.)

frivoal · 2020-04-13T01:35:59Z

Lisu and Khitan both use spaces

Are we talking about one or both of these Khitan (presumably the former, as I don't think the later is in Unicode):

https://en.wikipedia.org/wiki/Khitan_small_script
https://en.wikipedia.org/wiki/Khitan_large_script

If yes, do they really use spaces? Where can I learn more about that?

If not, what are we talking about?

fantasai · 2020-04-14T00:23:46Z

@frivoal https://www.unicode.org/versions/Unicode13.0.0/ch18.pdf

…they're used in bopomofo-only texts. #337

fantasai · 2020-04-28T02:29:34Z

The WG resolution to switch to Unicode Blocks has been edited in. I opened up #4993 and #4993 as follow-up issues. Closing out discussion here, since we've veered pretty far off the original topic.

fantasai · 2020-04-28T07:30:41Z

Refiled the OP as #5017

…sformation. w3c#337

…ndix. #337

gregwhitworth added the css-text-3 Current Work label Jul 27, 2016

r12a added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Jul 28, 2016

fantasai added the Agenda+ label Aug 16, 2016

astearns removed the Agenda+ label Aug 24, 2016

r12a mentioned this issue Sep 23, 2016

Segment Break Transformation Rules for East Asian Width property of A w3c/i18n-activity#224

Closed

r12a added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. and removed i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. labels Nov 17, 2016

frivoal added the Tracked in DoC label Mar 5, 2018

fantasai added Agenda+ F2F and removed Agenda+ F2F labels Mar 16, 2018

fantasai added a commit that referenced this issue Mar 16, 2018

[css-text-3] Allow one-side-ambiguous segment transformation to drop …

41043b0

…segment break if language context is CJY. #337 <https://lists.w3.org/Archives/Public/www-style/2016Oct/0068.html>

fantasai added the Commenter Response Pending label Mar 16, 2018

fantasai added a commit that referenced this issue Apr 9, 2018

[css-text-3] Check language on segment break, not Ambiguous character…

71cce33

…, and restrict Ambiguous characters we care about to punctuation/symbols. #337

r12a added the i18n-clreq Chinese language enablement label Apr 30, 2018

fergald pushed a commit to fergald/csswg-drafts that referenced this issue May 7, 2018

[css-text-3] Check language on segment break, not Ambiguous character…

5670007

…, and restrict Ambiguous characters we care about to punctuation/symbols. w3c#337

fantasai added Needs Design / Proposal Needs Edits labels Apr 3, 2020

fantasai added a commit that referenced this issue Apr 12, 2020

[css-text-3] Initial draft of Unicode Block -based segment break tran…

b223ecb

…sformation. #337

css-meeting-bot mentioned this issue Apr 15, 2020

[css-text-3] Ogham Space Mark needs to disappear at the end of a line #4893

Closed

This was referenced Apr 23, 2020

[css-text-3] Should enclosed ideographic blocks be space-discarding? #4992

Closed

[css-text-3] Should enclosed counting rods / tai xuan jing / yi jing hexagrams be space-discarding? #4993

Closed

fantasai added a commit that referenced this issue Apr 23, 2020

[css-text-3] Bobby confirms that Bopomofo should not discard spaces, …

e56978e

…they're used in bopomofo-only texts. #337

xfq mentioned this issue Apr 24, 2020

Space between characters after joining two lines w3c/clreq#293

Open

5 tasks

fantasai closed this as completed Apr 28, 2020

fantasai removed Needs Design / Proposal Needs Edits labels Apr 28, 2020

fantasai mentioned this issue Apr 28, 2020

[css-text-3] Discarding Line Breaks Adjacent to Ambiguous Characters #5017

Open

kidayasuo mentioned this issue May 8, 2020

Review Segment Break Transformation Rules (CSS Text Level 3) w3c/jlreq#211

Open

JTensai pushed a commit to JTensai/csswg-drafts that referenced this issue May 13, 2020

[css-text-3] Initial draft of Unicode Block -based segment break tran…

a7b7029

…sformation. w3c#337

jfkthame mentioned this issue Jun 3, 2020

[css-text-3] Segment Break Transformation Rules around CJK Punctuation #5086

Open

fantasai added a commit that referenced this issue Sep 28, 2020

[css-text-3] Add note explaining the purpose of space-discarding appe…

9e8f122

…ndix. #337

frivoal added the Testing Unnecessary Memory aid - issue doesn't require tests label Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[css-text-3] Segment Break Transformation Rules for East Asian Width property of A #337

[css-text-3] Segment Break Transformation Rules for East Asian Width property of A #337

hax commented Jul 21, 2016

fantasai commented Aug 16, 2016 •

edited

hax commented Aug 22, 2016

astearns commented Aug 24, 2016

fantasai commented Sep 24, 2016 via email

hax commented Sep 28, 2016

upsuper commented Mar 5, 2018

fantasai commented Mar 16, 2018

hax commented Mar 19, 2018

upsuper commented Mar 19, 2018

fantasai commented Apr 9, 2018

fantasai commented Apr 9, 2018

kojiishi commented Apr 9, 2018

fantasai commented Sep 16, 2018

kojiishi commented Sep 18, 2018

frivoal commented Dec 5, 2018

faceless2 commented Feb 25, 2020

faceless2 commented Feb 25, 2020

kojiishi commented Feb 26, 2020 •

edited

kojiishi commented Feb 26, 2020 •

edited

kojiishi commented Feb 26, 2020

faceless2 commented Feb 26, 2020

kojiishi commented Feb 26, 2020

faceless2 commented Feb 26, 2020 •

edited

fantasai commented Apr 12, 2020

frivoal commented Apr 13, 2020

fantasai commented Apr 14, 2020

fantasai commented Apr 28, 2020

fantasai commented Apr 28, 2020

[css-text-3] Segment Break Transformation Rules for East Asian Width property of A #337

[css-text-3] Segment Break Transformation Rules for East Asian Width property of A #337

Comments

hax commented Jul 21, 2016

fantasai commented Aug 16, 2016 • edited

hax commented Aug 22, 2016

astearns commented Aug 24, 2016

fantasai commented Sep 24, 2016 via email

hax commented Sep 28, 2016

upsuper commented Mar 5, 2018

fantasai commented Mar 16, 2018

hax commented Mar 19, 2018

upsuper commented Mar 19, 2018

fantasai commented Apr 9, 2018

fantasai commented Apr 9, 2018

kojiishi commented Apr 9, 2018

fantasai commented Sep 16, 2018

kojiishi commented Sep 18, 2018

frivoal commented Dec 5, 2018

faceless2 commented Feb 25, 2020

faceless2 commented Feb 25, 2020

kojiishi commented Feb 26, 2020 • edited

kojiishi commented Feb 26, 2020 • edited

kojiishi commented Feb 26, 2020

faceless2 commented Feb 26, 2020

kojiishi commented Feb 26, 2020

faceless2 commented Feb 26, 2020 • edited

fantasai commented Apr 12, 2020

frivoal commented Apr 13, 2020

fantasai commented Apr 14, 2020

fantasai commented Apr 28, 2020

fantasai commented Apr 28, 2020

fantasai commented Aug 16, 2016 •

edited

kojiishi commented Feb 26, 2020 •

edited

kojiishi commented Feb 26, 2020 •

edited

faceless2 commented Feb 26, 2020 •

edited