[Architecture] [Accessforall] Last call for comments: Codes for languages in ISO 24751 and the registry

Christophe Strobbe strobbe at hdm-stuttgart.de
Tue Oct 23 07:56:42 EDT 2012


Hi,

I was ill last week, so I'm catching up now. I have not seen any comments
on objections on this proposal, so I have marked it as accepted by common
consent on the wiki (with links to the calls for comments etc).
See
<http://wiki.gpii.net/index.php/Discussion_on_Profile_Structure#Language_Codes>.

When we have a separate Decisions page, we can reference this decision
from there.

Best regards,

Christophe


Am Do, 11.10.2012, 20:57 schrieb Christophe Strobbe:
>
> Hi,
>
> I have not seen any objections to the proposal to use IETF BCP 47 as the
> standard for the term "language" in the Registry. I have collected some of
> the content of the discussions (and some additional information) in the
> wiki page at
> <http://wiki.gpii.net/index.php/Discussion_on_Profile_Structure#Language_Codes>,
> and below in this message (for the links, please read the version in the
> wiki). I would like to give you some time to review this, so we can reach
> consensus on this. If there are no objections by Monday evening (15
> October) I will assume that we have reached consensus. If any
> clarifications are needed, please let me know as soon as possible.
>
> Best regards,
>
> Christophe Strobbe
>
>
> The text from the wiki page:
>
>
> One of the terms in the current version of the Registry is language
> (description: "a preference for the language of the user interface"). The
> value space is tentatively defined as the values defined by ISO 639-2/T.
> ISO 639-2/T identifies languages by means of three-letter codes (instead
> of the ISO 639-1 two-letter codes that are commonly used in HTML pages)
> without a means of identifying variants (see also the list of ISO 639-2
> codes on Wikipedia).
>
> Proposal:
>
> Use IETF BCP 47 instead of ISO 639-2/T as the format for identifying
> languages.
> * BCP 47 defines a language tag is consisting of a primary language
> subtag, followed by several optional subtags (especially for script,
> region and/or variant).
>  - Scripts can be identified by means of codes defined by ISO 15924:2004.
> For example, zh-Hans and zh-Hant have sometimes been used to distinguish
> between Chinese with Simplified Characters and with Traditional
> Characters, respectively. The registration authority for ISO 15924 tags
> is the Unicode Consortium; see Codes for the representation of names of
> scripts.
>  - Regions, including countries, can be identified by means of codes
> defined by ISO 3166-1. An ISO 3166-1 decoding table is available on the
> ISO website. The list of alpha-2 country codes (in TXT, HTML or XML) is
> available free of charge for internal use and non-commercial purposes.
> The full ISO 3166-1:2006, which also contains the alpha-3 codes and the
> numeric codes, is not available free of charge.
> * BCP 47 allows the use of three-letter codes for primary language tags
> defined by ISO 639-3. The registration authority for ISO 639-3 tags is SIL
> International; see ISO 639-3 Registration Authority. Using ISO 639-3 has
> several advantages:
>  - This list is more complete than ISO 639-1 and ISO 639-2.
>  - ISO 639-3 provides more precision for the identification of languages:
> some of the ISO 639-1 codes actually referred to macrolanguages, for
> example zh (Chinese) and ar (Arabic). The ISO 639-3 list distinguishes
> between macrolanguages and sublanguages, for example zho (Chinese) has
> sublanguages such as cmn (Mandarin), hak (Hakka) and yue (Yue or
> Cantonese). These distinctions can trigger different Braille conversion
> tables or text-to-speech engines (e.g. Ekho supports Cantonese, Mandarin
> and Zhaoan Hakka), so these distinctions are relevant to accessibility.
> See the ISO 639-3 Macrolanguage Mappings.
>  - Three letter codes also allow us to identify sign languages. ISO 639-2
> contains the tag "sgn" for sign language (which would need to be refined
> with subtags), and ISO 639-3 contains tags for individual sign languages,
> such as ase (American Sign Language), asf (Australian Sign Language) and
> sgg (Swiss-German Sign Language). ISO 639-1, by contrast, contained no
> tags to identify sign languages.
> * BCP 47 is also the standard for values of lang and xml:lang in HTML5.
> * ISO standards can use IETF RFCs and BCPs as normative references.
>
> Note:
> * While the set of languages supported by assistive technologies is only a
> very small subset of the (over 5000) living languages, it is also
> important to support the matching of resources in specific languages
> (including subtitles, captions, etc) with languages that a user
> understands, and this is probably a much wider range than what is
> supported by AT.
> * Implementations would need to synchronise their list of languages with
> the list maintained by SIL International (the registration authority for
> ISO 639-3), since language tags may be retired (see the Retired ISO 639-3
> Codes).
> * Implementations would need to synchronise their list of country codes
> with the list maintained the ISO 3166 Maintenance Authority, since country
> codes may be added or withdrawn (e.g. the country code for Yugoslavia was
> withdrawn).
> * There are a few special language codes:
>  - Content in an undetermined language can be tagged with 'und' (ISO 639-2
> and ISO 639-3). BCP 47 points out that this tag should only be used if a
> language tag is required.
>  - Content in an uncoded language can be tagged with 'mis' (ISO 639-2 and
> ISO 639-3), i.e. the language is known but has no language code.
>  - Non-linguistic content can be tagged with 'zxx' (ISO 639-2 and ISO
> 639-3), i.e. sound recordings with only nonverbal sounds, instrumental
> music, programming source code.
>  - Content in multiple languages can be tagged with 'mul' (ISO 639-2 and
> ISO 639-3). BCP 47 points out that this tag "SHOULD NOT be used when a
> list of languages or individual tags for each content element can be used
> instead".
>  - There is no "default country code" for languages, so if content is
> tagged with only "eng" (English), there is insufficient information to
> decide, for example, whether an American, Canadian, British or Australian
> Braille translation table should be used.
>  - The language tags described in IETF BCP 47 "are sequences of characters
> from the US-ASCII [ISO646] repertoire". (This does not prohibit the use
> of language tags in UTF-8 content. As Wikipedia points out: "The first
> 128 characters of Unicode, which correspond one-to-one with ASCII, are
> encoded using a single octet with the same binary value as ASCII, making
> valid ASCII text valid UTF-8-encoded Unicode as well.")
>
>
>
>
> Am Fr, 5.10.2012, 17:09 schrieb Christophe Strobbe:
>>
>> Am Do, 4.10.2012, 21:23 schrieb Gregg Vanderheiden:
>>> Great discussion
>>>
>>> We need to have someone who will own this issue and manage it through
>>> to
>>> resolution.
>>>
>>> Christophe, can you take ownership of this  -- and work with everyone
>>> to
>>> find a resolution?
>>
>>
>> OK.
>> I currently consider IETF BCP 47 <http://tools.ietf.org/html/bcp47> the
>> most appropriate standard to use for the "language" term in the
>> registry.
>> In addition to what I wrote in the last two days, BCP 47 is also the
>> format for the lang and xml:lang attributes in the current HTML5 draft:
>> <http://www.w3.org/TR/html5/global-attributes.html#the-lang-and-xml:lang-attributes>.
>> If anybody wants to speak against using IETF BCP 47 to define the value
>> space for "language" in the registry, please do so by Tuesday evening
>> next
>> week (10 October).
>>
>> Best regards,
>>
>> Christophe Strobbe
>>
>>
>>>
>>>
>>> Gregg
>>> --------------------------------------------------------
>>> Gregg Vanderheiden Ph.D.
>>> Director Trace R&D Center
>>> Professor Industrial & Systems Engineering
>>> and Biomedical Engineering
>>> University of Wisconsin-Madison
>>>
>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>> Co-Director, Raising the Floor - International
>>> and the Global Public Inclusive Infrastructure Project
>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Oct 4, 2012, at 6:48 AM, Christophe Strobbe
>>> <strobbe at hdm-stuttgart.de>
>>> wrote:
>>>
>>>>
>>>> A few things to bear in mind before making this decision:
>>>> 1. ISO 639-2 (or any other part of ISO 639) just covers the codes for
>>>> the
>>>> identification of languages, not subcodes for countries, scripts, etc.
>>>> 2. IETF RFC 4646 describes how to combine ISO 639 language codes with
>>>> ISO
>>>> 3166 country codes (and other optional subtags), but prefers
>>>> two-letter
>>>> language codes over three-letter codes if the former type of code is
>>>> available. So that would gives us en-CA instead of eng-CA. So if we
>>>> want
>>>> to use codes like en-CA, we should refer to IETF RFC 4646; in order to
>>>> use
>>>> tags like eng-CA, we would need to invent our own "standard" for
>>>> language
>>>> codes. If we prefer IETF RFC 4646 tags, we will need to check if ISO
>>>> standards can use IETF RFCs as normative references.
>>>> 3. The two-letter language code is what you find in HTML pages, the
>>>> OpenDocument format, and many other formats. That might be the reason
>>>> why
>>>> this type of code was in the sample preference sets. If we use
>>>> three-letter codes, some parts of the GPII/Cloud4all architecture will
>>>> need to refer to a table that maps two-letter codes to three-letter
>>>> codes,
>>>> because the two-letter codes seem to be the dominant convention (but
>>>> that
>>>> might change; e.g. Dublin Core seems to accept both types of codes).
>>>>
>>>>
>>>> I am not speaking against using codes like eng-CA, but we should know
>>>> what
>>>> the impact of this decision would be.
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Christophe
>>>>
>>>> Am Do, 4.10.2012, 07:18 schrieb Gregg Vanderheiden:
>>>>> OK
>>>>>
>>>>> 	Does anyone want to SPEAK AGAINST doing as Colin outlined which
>>>>> seems
>>>>> to
>>>>> be in line with everyone else's comments.
>>>>>
>>>>> 	  If so please post any counter thoughts in the next few days.    We
>>>>> have
>>>>> everyone I think on the two lists attached so we can make a decision
>>>>> if
>>>>> there are no counter proposals to consider
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>> Gregg
>>>>> --------------------------------------------------------
>>>>> Gregg Vanderheiden Ph.D.
>>>>> Director Trace R&D Center
>>>>> Professor Industrial & Systems Engineering
>>>>> and Biomedical Engineering
>>>>> University of Wisconsin-Madison
>>>>>
>>>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>>>> Co-Director, Raising the Floor - International
>>>>> and the Global Public Inclusive Infrastructure Project
>>>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>>>
>>>>>
>>>>> On Oct 3, 2012, at 10:44 PM, Colin Clark <colinbdclark at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We should be using ISO 639-2 language codes throughout the system.
>>>>>> If
>>>>>> not, it's a bug.
>>>>>>
>>>>>> If I remember correctly, this was probably introduced by the UI
>>>>>> Options
>>>>>> team who were integrating at very short notice with the GPII
>>>>>> framework.
>>>>>> I believe UI Options can support both two- and three-character
>>>>>> language
>>>>>> codes (as is often the case).
>>>>>>
>>>>>> As a speaker of "eng-CA", I don't see any reason not to simply use
>>>>>> ISO
>>>>>> 639-2 from the start and to also support country codes, as
>>>>>> Christophe
>>>>>> suggests. I also think it's probably worth supporting the
>>>>>> two-character
>>>>>> subset for interoperability if possible.
>>>>>>
>>>>>> Colin
>>>>>>
>>>>>> On 2012-10-03, at 1:18 PM, Gregg Vanderheiden wrote:
>>>>>>
>>>>>>> I think that having language and country codes is a great idea.
>>>>>>>
>>>>>>> Wd DO need to decide which codes to use.  I think the square
>>>>>>> brackets
>>>>>>> were because an official decision was not made yet
>>>>>>>
>>>>>>> But I think using the ISO codes for both would be the right thing
>>>>>>> to
>>>>>>> do.  I added the arch list to see if someone knows  why two letter
>>>>>>> codes are currently used.  (W3C?)
>>>>>>>
>>>>>>> We also should say something like  "if no country is specified then
>>>>>>> ...."
>>>>>>> (is there a default country for all languages specified somewhere?)
>>>>>>> we might say the country of origin -- but I'm not sure all
>>>>>>> languages
>>>>>>> have an (existing) country of origin anymore.
>>>>>>>
>>>>>>> Good catch Christophe.
>>>>>>> Lets get a decision and then record it in the Glossary.
>>>>>>>
>>>>>>> I wonder if we should have a decision registry somewhere since we
>>>>>>> have
>>>>>>> so many people involved.
>>>>>>>
>>>>>>>
>>>>>>> Gregg
>>>>>>> --------------------------------------------------------
>>>>>>> Gregg Vanderheiden Ph.D.
>>>>>>> Director Trace R&D Center
>>>>>>> Professor Industrial & Systems Engineering
>>>>>>> and Biomedical Engineering
>>>>>>> University of Wisconsin-Madison
>>>>>>>
>>>>>>> Technical Director - Cloud4all Project - http://Cloud4all.info
>>>>>>> Co-Director, Raising the Floor - International
>>>>>>> and the Global Public Inclusive Infrastructure Project
>>>>>>> http://Raisingthefloor.org   ---   http://GPII.net
>>>>>>>
>>>>>>>
>>>>>>> On Oct 3, 2012, at 11:43 AM, Christophe Strobbe
>>>>>>> <christophestrobbe at yahoo.co.uk> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> While creating a preference set for one of the personas in the
>>>>>>>> Cloud4all smarthouse simulation
>>>>>>>> <http://wiki.gpii.net/index.php/SmartHouses_Preference_Sets>, I
>>>>>>>> looked
>>>>>>>> into language codes and found the following:
>>>>>>>> (1) ISO/IEC 24751:2008 (all subparts) refer to ISO 639-2:1998 for
>>>>>>>> language codes. In the registry, the value space for "language" is
>>>>>>>> [ISO 639-2/T] (I don't know the reason for the square brackets).
>>>>>>>> According to
>>>>>>>> <https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes>
>>>>>>>> and <http://www.loc.gov/standards/iso639-2/php/code_list.php>, the
>>>>>>>> ISO
>>>>>>>> 639-2 codes are three-letter codes (e.g. "eng" for English, "dut"
>>>>>>>> or
>>>>>>>> "nld" for Dutch, "fre" or "fra" for French, etc). However, the
>>>>>>>> JSON
>>>>>>>> preference sets I've seen so far (I mean those by the
>>>>>>>> GPII/Cloud4all
>>>>>>>> Architecture team) use two-letter codes (see Carla's, Nisha's and
>>>>>>>> Timothy's preference sets). Am I misreading the information I
>>>>>>>> found
>>>>>>>> about ISO 639-2?
>>>>>>>> (2) Related to this is the absence of country information, i.e.
>>>>>>>> combining a language code with a country code from ISO 3166 (see
>>>>>>>> <http://www.loc.gov/standards/iso639-2/faq.html#22>). This is
>>>>>>>> relevant
>>>>>>>> to text-to-speech engines and Braille. For example for Dutch, not
>>>>>>>> many
>>>>>>>> people in Flanders are keen on TTS that uses pronunciation rules
>>>>>>>> from
>>>>>>>> the Netherlands. Braille conventions also vary between countries
>>>>>>>> that
>>>>>>>> use the same official language (well, they even vary between
>>>>>>>> Braille
>>>>>>>> centres, but let's not go into that).
>>>>>>>> (3) Note that IETF RFC 4646 <http://tools.ietf.org/html/rfc4646>
>>>>>>>> gives
>>>>>>>> preference to the shortest ISO 639 code (2 or three letters) that
>>>>>>>> is
>>>>>>>> available for a language (check the ABNF syntax under
>>>>>>>> <http://tools.ietf.org/html/rfc4646#section-2.1>). This base code
>>>>>>>> can
>>>>>>>> then be combined with an ISO 3166 country code, to create tags
>>>>>>>> like
>>>>>>>> en-US (American English) and en-GB (British English). However,
>>>>>>>> IETF
>>>>>>>> RFC 4646 is referenced neither by ISO 24751 nor by the registry.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Christophe Strobbe
>>>>>>>>
>>>>>>
>>>>>> ---
>>>>>> Colin Clark
>>>>>> Technical Lead, Fluid Project
>>>>>> http://fluidproject.org
>>>>
>>>> --
>>>> Christophe Strobbe
>>
>>
>> --
>> Christophe Strobbe
>>
>> _______________________________________________
>> Accessforall mailing list
>
>
> --
> Christophe Strobbe


-- 
Christophe Strobbe
Akademischer Mitarbeiter
Adaptive User Interfaces Research Group
Hochschule der Medien
Nobelstraße 10
70569 Stuttgart
Tel. +49 711 8923 2749



More information about the Architecture mailing list