Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-text-3] Should enclosed ideographic blocks be space-discarding? #4992

Closed
fantasai opened this issue Apr 23, 2020 · 18 comments
Closed

[css-text-3] Should enclosed ideographic blocks be space-discarding? #4992

fantasai opened this issue Apr 23, 2020 · 18 comments
Labels
Closed Accepted by Editor Discretion css-text-3 Current Work i18n-clreq Chinese language enablement i18n-jlreq Japanese language enablement i18n-klreq Korean language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Testing Unnecessary Memory aid - issue doesn't require tests Tracked in DoC

Comments

@fantasai
Copy link
Collaborator

In #337 we decided to key line-break transformation behavior by Unicode Block. Most of the blocks are pretty straightforward: Han, Kana, Yi, and CJK punctuation blokcs discard, and everything else converts to a space. But there are a few interesting cases...

One interesting case is the enclosed ideographic blocks:
https://en.wikipedia.org/wiki/Enclosed_CJK_Letters_and_Months
https://en.wikipedia.org/wiki/Enclosed_Ideographic_Supplement

The numerics in the Letters and months block seem likely to be used outside of CJK context, also there are quite a few Hangul, and I wouldn't be surprised if at least some of the other characters are also used in Korean sometimes.

Note, however, that we only discard if both sides (before and after) the line break are part of the space-discarding character set.

@fantasai fantasai added css-text-3 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. i18n-jlreq Japanese language enablement i18n-clreq Chinese language enablement i18n-klreq Korean language enablement labels Apr 23, 2020
@xfq
Copy link
Member

xfq commented Apr 28, 2020

Note, however, that we only discard if both sides (before and after) the line break are part of the space-discarding character set.

What's the rationale for this? If we only discard when both sides are part of the space-discarding character set, sometimes unintended behavior will appear. For example:

万维网联盟(World Wide Web Consortium, W3C)是Web
领域的国际标准化组织,开发开放Web标准,确保
Web的长期发展。欢迎您加入W3C的朋友计划,支持W3C
实现“尽展Web无限潜能”的使命,并为Web开发者提供更多工具。

will become:

万维网联盟(World Wide Web Consortium, W3C)是Web{SPACE}领域的国际标准化组织,开发开放Web标准,确保{SPACE}Web的长期发展。欢迎您加入W3C的朋友计划,支持W3C{SPACE}实现“尽展Web无限潜能”的使命,并为Web开发者提供更多工具。

({SPACE} stands for the inserted space. Perhaps there have been related discussions before, but I did not find it.)

@kojiishi
Copy link
Contributor

Interesting indeed. I'm fine not to include if usages outside CJ are expected. Will you want a space when Latin context follows enclosed numerals? i.e., for "㉑Text", should there be a space?

@xfq

What's the rationale for this?

Please consider the other case:

是 Web
领域的国际标准化组织

then you would want:

是 Web 领域的国际标准化组织

@xfq
Copy link
Member

xfq commented Apr 29, 2020

Interesting indeed. I'm fine not to include if usages outside CJ are expected. Will you want a space when Latin context follows enclosed numerals? i.e., for "㉑Text", should there be a space?

I'm not sure. I will discuss with the clreq editors in w3c/clreq#293

@xfq
Copy link
Member

xfq commented Apr 29, 2020

Please consider the other case:

This can happen indeed. We need to answer at least two questions (maybe not in this issue):

  1. Should we add spaces (U+0020) between between Han and Latin characters?
  2. If spaces should be added, how to deal with the pages without spaces added? If spaces should not be added, how to deal with pages that have been added with space characters?

@kojiishi
Copy link
Contributor

Should we add spaces (U+0020) between between Han and Latin characters?

@kidayasuo might want to jump in, he is brainstorming the same question in jlreq, which is more suitable way to define the typographic rules in future jlreq.

I'm fine either way for how the spec describes the typographic behavior, but that won't affect what authors actually do, both types of authors are most likely to keep their preferred behaviors.

For the segment transformation rules, whichever options we take, the other group will need to learn the rules, so I think going simpler is easier to remember and adopt.

@r12a
Copy link
Contributor

r12a commented Apr 29, 2020

I'm inclined to think that we need to expect authors to make adjustments sometimes to resolve ambiguities. So given:

万维网联盟(World Wide Web Consortium, W3C)是Web
领域的国际标准化组织,开发开放Web标准,确保
Web的长期发展。欢迎您加入W3C的朋友计划,支持W3C
实现“尽展Web无限潜能”的使命,并为Web开发者提供更多工具。

a small adjustment to the source text could fix the problem, such as:

万维网联盟(World Wide Web Consortium, W3C)是Web领
域的国际标准化组织,开发开放Web标准,确保Web的
长期发展。欢迎您加入W3C的朋友计划,支持W3C实
现“尽展Web无限潜能”的使命,并为Web开发者提供更多工具。

If the line breaking is done manually, this shouldn't be a common problem, once the author is aware of how things work. If line-breaking is done automatically, the author ought to expect some problems that will need to be rectified, although the application doing the line-breaking could also look out for situations where applying a line-break between certain characters would introduce ambiguity (eg. not splitting immediately after a space in CJ).

@xfq
Copy link
Member

xfq commented Apr 29, 2020

Just FYI - I'm not sure about other WYSIWYG editors, but BlueGriffon enables line wrapping (of the HTML source code) by default, and I often see unwanted spaces appear in Chinese (or between Chinese and English) when the web page is created with BlueGriffon.

@r12a
Copy link
Contributor

r12a commented May 1, 2020

Sounds like you should raise a bug report against that app.

@xfq
Copy link
Member

xfq commented May 2, 2020

Sounds like you should raise a bug report against that app.

I didn't find the bug tracker for it, so I'll friendly ping @therealglazou here.

@kidayasuo
Copy link

I agree with @r12a to handle ambiguous cases by not inserting a line break around such places. What it means is that for these "ambiguous" cases it does not really matter whichever we decide. Probably it is better to opt for whichever is more easier to remember of intuitive to end users.

One possible caveat is that what is truly ambiguous can sometimes be unclear because the ambiguity in this case is based on human expectations.

@MurakamiShinyu
Copy link
Collaborator

I agree, too, with @r12a to handle ambiguous cases by not inserting a line break around such places, and also agree that authors should not insert a line break between CJK and non-CJK letters when do not want a space between them (e.g., 是Web领).

However, I can't agree that authors should not insert a line break between a CJK punctuation and a non-CJK letter when do not want extra space between them (e.g., 是。Web). I opened #5086 for this specific issue "Segment Break Transformation Rules around CJK Punctuation".

@fantasai
Copy link
Collaborator Author

fantasai commented May 25, 2020

So. While this was some very interesting discussion on @xfq's side track, nobody seems to have commented about the actual issue, which is about whether enclosed ideographic should be space-discarding? :)

@xfq
Copy link
Member

xfq commented May 26, 2020

whether enclosed ideographic should be space-discarding

Well, we discussed this issue in a clreq meeting, and the general feeling was that that this issue was not important enough for us to discuss, because: 1) the enclosed ideographic blocks are rarely used and even less often appear at the beginning/end of a (hard-wrapped) line; 2) we have lots of high priority issues like (soft-wrapped) line breaking and text-spacing rules :)

They might be more often used in Japanese (since the ARIB STD B24 character set contains many such characters). Perhaps @himorin knows more?

@kidayasuo
Copy link

kidayasuo commented May 26, 2020

Also for Japanese the general feeling is that enclosed ideographic is not that important. To provide a bit more analysis, characters in these blocks are:

① enclosed number or kana ① ㈠ ㊂ ㋑
② enclosed days of week or other kanji ㈪ ㊊ ㈱ ㋿
③ ARIB (Association of Radio Industries and Businesses) - 🉇 🉈 🈙

I see category ① more often than others. They are used as list headings. As list headings they typically appear after explicit line breaks and therefore the transformation rules is not that relevant. Of couse they can be used as headers of inline lists, or other places in a line. In either case enclosed ideographic numbers should be treated the same way as enclosed Arabic numbers. Otherwise many ordinary people would be puzzled why a space is inserted around ① but not around ㊀.

The category ② is legacy combining characters. They typically come before or after a noun “12日㈪” (12th Monday) and ㈱アップル (Apple incorporated). As they form one noun block probably people would not insert a line break in between. I saw them in the past more often and I feel the use is decreasing in favour of fully spelling them like 月曜/月曜日 (Monday) or 株式会社 (corporation). Please refer to the usage counts below.

The category ③ is special purpose characters used by TV. They are not for general use. I googled them for these characters but most found pages were about these unicode character themselves.

Here is a non-scientific use-count obtained by google searching each character within quotation marks. The second circled number denotes the category I used above.
① ① 446M
㈱ ② 31M vs “株式会社” 1.3G
⑴ ① 8.3M
㈠ ① 6.0M
⒈ ① 3.4M
㋐ ① 2.3M
㍿ ② 1.3M
㈪ ② 0.85M vs “日月曜日” 22M, “月曜” 74M, or “月曜日” 225M
㊊ ② 0.34M
🉇 ③ 0.19M
🈙 ③ 0.19M
㊀ ① 0.15M
㍼ ② 0.15M
㋿ ② 0.054M

In sum, they are not frequent characters and if they used they are often used in a context where the transformation rule is not that important.

I believe it is more important that all enclosed numbers are treated the same way regardless of if the number is in Arabic style or in ideographic style. Actually probably all enclosed letters and numbers should be treated in a consistent manner.

@MurakamiShinyu
Copy link
Collaborator

I agree with @kidayasuo that "all enclosed letters and numbers should be treated in a consistent manner", and I want to use ①②③… in non-CJK text such as "There are three options: ①foo ②bar and ③baz". In such inline list cases, space between items should not be discarded. So I think they should not be space-discarding.

@xfq
Copy link
Member

xfq commented May 29, 2020

In non-CJK text, the spaces will not be discarded because (currently) we only discard if both sides of the line break are part of the space-discarding character set.

@MurakamiShinyu
Copy link
Collaborator

@xfq ok, my mistake

@fantasai
Copy link
Collaborator Author

fantasai commented Jul 1, 2020

Based on @kidayasuo and @MurakamiShinyu and @xfq 's comments, I'm closing this as "no change", i.e. treat these blocks the same as Enclosed Alphanumerics, i.e. not including as space-discarding characters.

@fantasai fantasai closed this as completed Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closed Accepted by Editor Discretion css-text-3 Current Work i18n-clreq Chinese language enablement i18n-jlreq Japanese language enablement i18n-klreq Korean language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Testing Unnecessary Memory aid - issue doesn't require tests Tracked in DoC
Projects
None yet
Development

No branches or pull requests

7 participants