L2/01-301
Analysis of Character Deprecation in the Unicode Standard
Ken Whistler
August 1, 2001
Mark Davis has suggested that a character property of "deprecated" be
added to the Unicode Character Database, to track those characters
that have been deprecated in the standard.
The problem I see is that to date there are many different kinds
of deprecation and "discouragement" of various characters, so that it
isn't exactly clear what we mean by deprecation and which exact list
of characters should be included in it.
The *definition* of deprecation currently given in the standard
is:
D7a: Deprecated character: a coded character whose use is strongly
discouraged. Such characters are retained in the standard, but
should not be used.
(Chapter 3, page 41)
This needs to be compared with the definition and notes for "compatibility
character", as well:
D21 Compatibility character: a character that has a compatibility
decomposition.
* ... They support transmission and processing of legacy data.
Their use is discouraged other than for legacy data.
===================================================================
Here is the complete list of characters that have, so far, been
labeled, indicated, or implicated as "deprecated" or "discouraged"
in the standard.
Rick McGowan originally compiled this list, and I have rearranged
and annotated it.
A. Labelled as "deprecated"
1. Vietnamese combining tone marks
0340 COMBINING GRAVE TONE MARK (Vietnamese)
0341 COMBINING ACUTE TONE MARK (Vietnamese)
These were belatedly recognized as mistaken, duplicate encodings,
and were formally deprecated by the UTC.
2. Alternate format controls inherited from 10646
206A INHIBIT SYMMETRIC SWAPPING
206B ACTIVATE SYMMETRIC SWAPPING
206C INHIBIT ARABIC FORM SHAPING
206D ACTIVATE ARABIC FORM SHAPING
206E NATIONAL DIGIT SHAPES
206F NOMINAL DIGIT SHAPES
These were recognized as "really bad" and were formally deprecated
by the UTC when they first went into the Unicode Standard.
B. Labelled as "strongly discouraged"
1. 3-part Tibetan vowel signs with a-chung's
0F77 TIBETAN VOWEL SIGN VOCALIC RR
0F79 TIBETAN VOWEL SIGN VOCALIC LL
These multi-part vowels are not needed, and have canonical decompositions
involving another multi-part vowel 0F81 which itself is "discouraged".
C. Labelled as "discouraged"
1. 2-part Tibetan vowel signs with a-chung's
0F73 TIBETAN VOWEL SIGN II
0F75 TIBETAN VOWEL SIGN UU
0F81 TIBETAN VOWEL SIGN REVERSED II
These 2-part Tibetan vowels are not needed. Their canonical decompositions
are to sequences of combining marks.
2. 2-part Greek accent
0344 COMBINING GREEK DIALYTIKA TONOS
Its canonical decomposition is to a sequence of combining marks.
D. Indicated as "strongly discouraged", but reserved for use with special
protocols.
1. Tag Characters
E0001 LANGUAGE TAG
...
E007F CANCEL TAG
These were born "strongly discouraged" by the UTC, but were not marked
as deprecated, since they were put in explicitly for particular protocol
usage.
E. Indicated as "strongly discouraged" for plain text interchange
1. Interlinear Annotation Characters
FFF9 INTERLINEAR ANNOTATION ANCHOR
FFFA INTERLINEAR ANNOTATION SEPARATOR
FFFB INTERLINEAR ANNOTATION TERMINATOR
See p. 326 of TUS 3.0. "Usage of the annotation character in plain text
interchange is strongly discouraged without prior agreement between
the sender and the receiver..." This is another way of saying that
they are reserved for use with a higher-level protocol.
Then we have groups of characters that are not overtly labelled
as deprecated or discouraged, but for which there are implied
discouragements by reason of their belonging to disparaged
classes of characters.
F. Indicated as "strongly discouraged" "in general"
1. Letterlike symbols "that are merely font variants or alternative
representations of other character sequences." (see TUS 3.0, p. 298)
This presumably was intended to apply to all the letterlike symbols
in the range 2100..213A that have a "" or "