[css-text-3] Should enclosed ideographic blocks be space-discarding? #4992

fantasai · 2020-04-23T05:39:58Z

In #337 we decided to key line-break transformation behavior by Unicode Block. Most of the blocks are pretty straightforward: Han, Kana, Yi, and CJK punctuation blokcs discard, and everything else converts to a space. But there are a few interesting cases...

One interesting case is the enclosed ideographic blocks:
https://en.wikipedia.org/wiki/Enclosed_CJK_Letters_and_Months
https://en.wikipedia.org/wiki/Enclosed_Ideographic_Supplement

The numerics in the Letters and months block seem likely to be used outside of CJK context, also there are quite a few Hangul, and I wouldn't be surprised if at least some of the other characters are also used in Korean sometimes.

Note, however, that we only discard if both sides (before and after) the line break are part of the space-discarding character set.

xfq · 2020-04-28T07:35:38Z

Note, however, that we only discard if both sides (before and after) the line break are part of the space-discarding character set.

What's the rationale for this? If we only discard when both sides are part of the space-discarding character set, sometimes unintended behavior will appear. For example:

万维网联盟（World Wide Web Consortium, W3C）是Web
领域的国际标准化组织，开发开放Web标准，确保
Web的长期发展。欢迎您加入W3C的朋友计划，支持W3C
实现“尽展Web无限潜能”的使命，并为Web开发者提供更多工具。

will become:

万维网联盟（World Wide Web Consortium, W3C）是Web{SPACE}领域的国际标准化组织，开发开放Web标准，确保{SPACE}Web的长期发展。欢迎您加入W3C的朋友计划，支持W3C{SPACE}实现“尽展Web无限潜能”的使命，并为Web开发者提供更多工具。

({SPACE} stands for the inserted space. Perhaps there have been related discussions before, but I did not find it.)

kojiishi · 2020-04-28T20:21:07Z

Interesting indeed. I'm fine not to include if usages outside CJ are expected. Will you want a space when Latin context follows enclosed numerals? i.e., for "㉑Text", should there be a space?

@xfq

What's the rationale for this?

Please consider the other case:

是 Web
领域的国际标准化组织

then you would want:

是 Web 领域的国际标准化组织

xfq · 2020-04-29T04:09:13Z

Interesting indeed. I'm fine not to include if usages outside CJ are expected. Will you want a space when Latin context follows enclosed numerals? i.e., for "㉑Text", should there be a space?

I'm not sure. I will discuss with the clreq editors in w3c/clreq#293

xfq · 2020-04-29T04:10:15Z

Please consider the other case:

This can happen indeed. We need to answer at least two questions (maybe not in this issue):

Should we add spaces (U+0020) between between Han and Latin characters?
If spaces should be added, how to deal with the pages without spaces added? If spaces should not be added, how to deal with pages that have been added with space characters?

kojiishi · 2020-04-29T07:06:48Z

Should we add spaces (U+0020) between between Han and Latin characters?

@kidayasuo might want to jump in, he is brainstorming the same question in jlreq, which is more suitable way to define the typographic rules in future jlreq.

I'm fine either way for how the spec describes the typographic behavior, but that won't affect what authors actually do, both types of authors are most likely to keep their preferred behaviors.

For the segment transformation rules, whichever options we take, the other group will need to learn the rules, so I think going simpler is easier to remember and adopt.

r12a · 2020-04-29T14:56:29Z

I'm inclined to think that we need to expect authors to make adjustments sometimes to resolve ambiguities. So given:

万维网联盟（World Wide Web Consortium, W3C）是Web
领域的国际标准化组织，开发开放Web标准，确保
Web的长期发展。欢迎您加入W3C的朋友计划，支持W3C
实现“尽展Web无限潜能”的使命，并为Web开发者提供更多工具。

a small adjustment to the source text could fix the problem, such as:

万维网联盟（World Wide Web Consortium, W3C）是Web领
域的国际标准化组织，开发开放Web标准，确保Web的
长期发展。欢迎您加入W3C的朋友计划，支持W3C实
现“尽展Web无限潜能”的使命，并为Web开发者提供更多工具。

If the line breaking is done manually, this shouldn't be a common problem, once the author is aware of how things work. If line-breaking is done automatically, the author ought to expect some problems that will need to be rectified, although the application doing the line-breaking could also look out for situations where applying a line-break between certain characters would introduce ambiguity (eg. not splitting immediately after a space in CJ).

xfq · 2020-04-29T15:28:08Z

Just FYI - I'm not sure about other WYSIWYG editors, but BlueGriffon enables line wrapping (of the HTML source code) by default, and I often see unwanted spaces appear in Chinese (or between Chinese and English) when the web page is created with BlueGriffon.

r12a · 2020-05-01T12:31:25Z

Sounds like you should raise a bug report against that app.

xfq · 2020-05-02T01:37:30Z

Sounds like you should raise a bug report against that app.

I didn't find the bug tracker for it, so I'll friendly ping @therealglazou here.

kidayasuo · 2020-05-12T00:49:28Z

I agree with @r12a to handle ambiguous cases by not inserting a line break around such places. What it means is that for these "ambiguous" cases it does not really matter whichever we decide. Probably it is better to opt for whichever is more easier to remember of intuitive to end users.

One possible caveat is that what is truly ambiguous can sometimes be unclear because the ambiguity in this case is based on human expectations.

MurakamiShinyu · 2020-05-20T02:29:30Z

I agree, too, with @r12a to handle ambiguous cases by not inserting a line break around such places, and also agree that authors should not insert a line break between CJK and non-CJK letters when do not want a space between them (e.g., 是Web领).

However, I can't agree that authors should not insert a line break between a CJK punctuation and a non-CJK letter when do not want extra space between them (e.g., 是。Web). I opened #5086 for this specific issue "Segment Break Transformation Rules around CJK Punctuation".

fantasai · 2020-05-25T21:40:41Z

So. While this was some very interesting discussion on @xfq's side track, nobody seems to have commented about the actual issue, which is about whether enclosed ideographic should be space-discarding? :)

xfq · 2020-05-26T00:55:02Z

whether enclosed ideographic should be space-discarding

Well, we discussed this issue in a clreq meeting, and the general feeling was that that this issue was not important enough for us to discuss, because: 1) the enclosed ideographic blocks are rarely used and even less often appear at the beginning/end of a (hard-wrapped) line; 2) we have lots of high priority issues like (soft-wrapped) line breaking and text-spacing rules :)

They might be more often used in Japanese (since the ARIB STD B24 character set contains many such characters). Perhaps @himorin knows more?

kidayasuo · 2020-05-26T07:13:05Z

Also for Japanese the general feeling is that enclosed ideographic is not that important. To provide a bit more analysis, characters in these blocks are:

① enclosed number or kana ① ㈠㊂㋑
② enclosed days of week or other kanji ㈪㊊㈱㋿
③ ARIB (Association of Radio Industries and Businesses) - 🉇 🉈 🈙

I see category ① more often than others. They are used as list headings. As list headings they typically appear after explicit line breaks and therefore the transformation rules is not that relevant. Of couse they can be used as headers of inline lists, or other places in a line. In either case enclosed ideographic numbers should be treated the same way as enclosed Arabic numbers. Otherwise many ordinary people would be puzzled why a space is inserted around ① but not around ㊀.

The category ② is legacy combining characters. They typically come before or after a noun “12日㈪” (12th Monday) and ㈱アップル (Apple incorporated). As they form one noun block probably people would not insert a line break in between. I saw them in the past more often and I feel the use is decreasing in favour of fully spelling them like 月曜/月曜日 (Monday) or 株式会社 (corporation). Please refer to the usage counts below.

The category ③ is special purpose characters used by TV. They are not for general use. I googled them for these characters but most found pages were about these unicode character themselves.

Here is a non-scientific use-count obtained by google searching each character within quotation marks. The second circled number denotes the category I used above.
① ① 446M
㈱ ② 31M vs “株式会社” 1.3G
⑴ ① 8.3M
㈠ ① 6.0M
⒈ ① 3.4M
㋐ ① 2.3M
㍿ ② 1.3M
㈪ ② 0.85M vs “日月曜日” 22M, “月曜” 74M, or “月曜日” 225M
㊊ ② 0.34M
🉇 ③ 0.19M
🈙 ③ 0.19M
㊀ ① 0.15M
㍼ ② 0.15M
㋿ ② 0.054M

In sum, they are not frequent characters and if they used they are often used in a context where the transformation rule is not that important.

I believe it is more important that all enclosed numbers are treated the same way regardless of if the number is in Arabic style or in ideographic style. Actually probably all enclosed letters and numbers should be treated in a consistent manner.

MurakamiShinyu · 2020-05-29T08:57:51Z

I agree with @kidayasuo that "all enclosed letters and numbers should be treated in a consistent manner", and I want to use ①②③… in non-CJK text such as "There are three options: ①foo ②bar and ③baz". In such inline list cases, space between items should not be discarded. So I think they should not be space-discarding.

xfq · 2020-05-29T09:07:13Z

In non-CJK text, the spaces will not be discarded because (currently) we only discard if both sides of the line break are part of the space-discarding character set.

MurakamiShinyu · 2020-05-30T07:58:35Z

@xfq ok, my mistake

fantasai · 2020-07-01T06:13:18Z

Based on @kidayasuo and @MurakamiShinyu and @xfq 's comments, I'm closing this as "no change", i.e. treat these blocks the same as Enclosed Alphanumerics, i.e. not including as space-discarding characters.

fantasai added css-text-3 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. i18n-jlreq Japanese language enablement i18n-clreq Chinese language enablement i18n-klreq Korean language enablement labels Apr 23, 2020

w3cbot mentioned this issue Apr 23, 2020

[css-text-3] Should enclosed ideographic blocks be space-discarding? w3c/i18n-activity#890

Closed

xfq mentioned this issue Apr 24, 2020

Space between characters after joining two lines w3c/clreq#293

Open

5 tasks

w3cbot mentioned this issue Apr 24, 2020

[css-text-3] Should enclosed ideographic blocks be space-discarding? w3c/i18n-activity#893

Closed

r12a mentioned this issue Apr 29, 2020

[css-text-3] Discarding Line Breaks Adjacent to Ambiguous Characters #5017

Open

kidayasuo mentioned this issue May 8, 2020

Review Segment Break Transformation Rules (CSS Text Level 3) w3c/jlreq#211

Open

MurakamiShinyu mentioned this issue May 19, 2020

[css-text-3] Segment Break Transformation Rules around CJK Punctuation #5086

Open

fantasai closed this as completed Jul 1, 2020

fantasai added the Closed Accepted by Editor Discretion label Jul 1, 2020

frivoal added the Testing Unnecessary Memory aid - issue doesn't require tests label Dec 3, 2020

fantasai added the Tracked in DoC label Dec 15, 2020

xfq mentioned this issue Apr 20, 2022

[css-text-4] Make ideograph-alpha and ideograph-numeric part of text-spacing: normal #6950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[css-text-3] Should enclosed ideographic blocks be space-discarding? #4992

[css-text-3] Should enclosed ideographic blocks be space-discarding? #4992

fantasai commented Apr 23, 2020

xfq commented Apr 28, 2020 •

edited

kojiishi commented Apr 28, 2020

xfq commented Apr 29, 2020

xfq commented Apr 29, 2020 •

edited

kojiishi commented Apr 29, 2020

r12a commented Apr 29, 2020 •

edited

xfq commented Apr 29, 2020 •

edited

r12a commented May 1, 2020

xfq commented May 2, 2020

kidayasuo commented May 12, 2020

MurakamiShinyu commented May 20, 2020

fantasai commented May 25, 2020 •

edited

xfq commented May 26, 2020

kidayasuo commented May 26, 2020 •

edited

MurakamiShinyu commented May 29, 2020

xfq commented May 29, 2020 •

edited

MurakamiShinyu commented May 30, 2020

fantasai commented Jul 1, 2020

[css-text-3] Should enclosed ideographic blocks be space-discarding? #4992

[css-text-3] Should enclosed ideographic blocks be space-discarding? #4992

Comments

fantasai commented Apr 23, 2020

xfq commented Apr 28, 2020 • edited

kojiishi commented Apr 28, 2020

xfq commented Apr 29, 2020

xfq commented Apr 29, 2020 • edited

kojiishi commented Apr 29, 2020

r12a commented Apr 29, 2020 • edited

xfq commented Apr 29, 2020 • edited

r12a commented May 1, 2020

xfq commented May 2, 2020

kidayasuo commented May 12, 2020

MurakamiShinyu commented May 20, 2020

fantasai commented May 25, 2020 • edited

xfq commented May 26, 2020

kidayasuo commented May 26, 2020 • edited

MurakamiShinyu commented May 29, 2020

xfq commented May 29, 2020 • edited

MurakamiShinyu commented May 30, 2020

fantasai commented Jul 1, 2020

xfq commented Apr 28, 2020 •

edited

xfq commented Apr 29, 2020 •

edited

r12a commented Apr 29, 2020 •

edited

xfq commented Apr 29, 2020 •

edited

fantasai commented May 25, 2020 •

edited

kidayasuo commented May 26, 2020 •

edited

xfq commented May 29, 2020 •

edited