|
Evolution of the 16 Bit Encoding Scheme
for Tamil
1. Introduction to Font
Encoding
2. ISCII (Indian Script
Code for Information Interchange)
3. Unicode
4. Simple and Complex
Scripts
5. The Unicode (16-bit)
Encoding Scheme for Tamil
6. Disadvantages
of current Unicode
7. Proposed
All Character Encoding for Tamil - All Tamil Block
8. Advantages of All
Character Encoding (All Tamil Block)
9. Whether efficiency
of the coding is something to be considered or not?
A computer stores text in the form of numbers
and not in the visual written form. Each alphabet is stored
internally as a number, but displayed on the screen as a glyph
(written form). A font is used to convert the internal number
to a visual form on the screen. By changing the font one can
use the text in any type face.
"Hello" would be represented as 72
101 108 108 111 internally
72 101 108 108 111 would appear as Hello when
viewed with a Times Font
72 101 108 108 111 would appear as Hello when
viewed with an Arial Font
By changing the shapes defined in the font, the
computer can represent any language.
For a long time a 'BYTE' (consisting of 8 bits)
was the basic unit of a character in the computer. A BYTE
could accommodate 256 different characters. Out of these 256
locations a minimum of 32 locations are reserved for use by
the operating system. Hence only 224 locations are freely
available. To represent a language in the computer one had
to allot each one of these locations to a character or glyph
of the language. This process of allotting a location to a
character or glyph is called encoding.
The current English language encoding in the
Windows operating system is the ANSI scheme. Since the English
alphabets contains only 52 letters (26 capital letters and
26 small letters), the assignment of numbers to locations
is straight forward.
The Tamil language has 313 alphabets including
the Grantha alphabets. All though it is desirable to allot
one location for each alphabet, we cannot do so due to limitations
of space (224 locations). Hence we assign one location for
each glyph (e.g. kaal, kombu, kokki, pulli etc.). For example
in the TAM encoding scheme the following allocations are made

Now கெ
can be represented with 170 232, செ
with 170 234 etc. In these examples instead of representing
கெ and செ
as a single byte, we represent them with two bytes. We need
more space but we do not have any other option, since the
number of Tamil letters exceeds the available locations.
This system is called the "Glyph Encoding".
By doing this we reduce the number of locations required to
implement the Tamil script. The Government of Tamilnadu has
announced two such encoding schemes for Tamil. They are the
TAM and TAB standards. TAM is a monolingual encoding scheme
and the TAB is a bilingual encoding scheme.
In these encoding schemes since there is a one
to one relation between storage and display (i.e. every byte
stored is displayed) the Tamil script could be implemented
in any 'readymade' software package without any additional
support. While it was relatively easy for the Tamil script
to be encoded, because of the lesser number of characters,
uniformity of script and absence of 'conjunct consonants',
it was more difficult to encode the other Indian languages.
|
2. ISCII (Indian
Script Code for Information Interchange)
|
Back to Top |
The Indian government, with a view to have a
common encoding for all Indian languages, developed the ISCII
standard.
Features
This encoding is bilingual in nature.
The first 128 locations is exactly as per the ANSI
encoding standard and contains the English alphabets, commonly
used punctuation marks and symbols.
The next 128 locations contains the encoding for each
Indian language.
Since these 128 locations are not enough to accommodate
the alphabets of most of the Indian languages, only vowels,
consonants and vowel modifiers were encoded.
Similar vowels, consonants and vowel modifiers of various
languages occupied the same slot. i.e the vowel 'a' would
be allocated the same location in all the languages.
This system of encoding enabled viewing of the text
in various of languages just be change of the font. A text
typed in Tamil could be easily read in a transliterated form
in Hindi by using a Hindi font. It may not be true vice-versa
since all the Hindi consonants do not have a Tamil equivalent.
Disadvantages
It requires more space since almost all the alphabets
would be stored in their broken down form.
It does not have a one-to-one relation between the
stored bytes and displayed bytes. For example the Tamil alphabet
'ku' would be displayed as a single glyph, but it would be
stored internally as a consonant 'k' + vowel modifier 'u'.
There is a two-to-one relation between the stored text and
the display.
This added complexity in display makes it unsuitable
for usage in 'ready-made' software.
Thus it was not used in Tamil.
It must be noted that the same 256 locations are used by different
languages. For example the ASCII scheme uses location 65 to
represent 'A' while the TAM encoding scheme uses the same
location for the Tamil alphabet 'A'.
'and' would be read as 'and' if we use a TAM
encoded Tamil font on an English text. Hence it is apparent
that unless we know which language the text pertains to, we
cannot use the appropriate font to view it. This led to a
chaotic situation.
In order to avoid this confusion, alphabets of
different languages had to be given different numbers. This
required more locations. Thus was born the 'double byte' encoding
scheme which uses 16 bits to represent a character instead
of 8 bits. In the 16 bit space 65,536 locations are available
as compared to 256 locations in the 8 bit space.
Unicode is a 16 bit encoding scheme which is
the most common 16 bit encoding scheme in use today. It contains
characters of all the major world languages. It is being developed
by the Unicode Consortium which has the major software developers
and computer manufacturers as members. The Indian Government
is also a voting member in the consortium.
It is a stated policy of Unicode that only characters
will be encoded and not glyphs. It also states that it is
not concerned with efficiency of the encoding system. It must
be noted however that in the beginning all existing standard
encoding schemes of different languages were implemented 'as
is' in Unicode without consideration of the above principles.
Because of this policy the ISCII standard which
was primarily designed for an 8-bit environment was used as
the base for implementing the Indian languages in Unicode.
Thus the Tamil block of Unicode was based on the Tamil encoding
in ISCII which was not used at all till then. A simple script
was converted into a Complex script.
In some Indian languages, when two consonants
come together, like ik and ir, the glyphs are not rendered
in the normal way, but rendered in a different way. In these
languages, for one character, more than one glyph rendering
is possible. It depends on the situation. These languages
are said to have complex scripts. For these languages, the
character type representation in memory may be beneficial.
But they have to pay the price of a rendering module.
In Tamil for each character there is only one
way of rendering. Hence Tamil does NOT have a complex script.
Because of the necessity of a processing module
to show the letters on the screen and coordinate with the
memory content, any software designed for English will not
work as such for complex scripts. But they will work smoothly
for simple scripts. This is the reason for glyph encodings
being popular in India, whereas ISCII has not become that
much popular in the commercial world. In Tamil the use of
ISCII is negligible.
|
5. The Unicode (16-bit) Encoding Scheme
for Tamil
|
Back to Top |
65,536 combinations are possible with 16 bits.
If 16 bits are used for every character, then many languages
of the world can be accommodated in this scheme. Unicode is
designed to accommodate many languages of the world. It is
supposed to encode characters and not glyphs.
Formation of Indic blocks from ISCII
As stated above, in ISCII each Indian language
was given 128 slots. Basically the same scheme
was adopted in Unicode also. Each Indian language is given
128 slots in Unicode.
Each block is different from the other block.
Hence the perceived advantage of ISCII store in
one language and see in any language is not valid in
Unicode. But the disadvantages of ISCII have been carried
over with further setback.
For Indian languages having complex scripts,
there may not be much difference in using the Unicode instead
of ISCII. But Tamil does NOT have a complex script.
In ISCII, like others, Tamil is also given 128
slots. Tamil having a simple script, should have been treated
differently in the Unicode. All its 313 characters and the
special symbols could have been
accommodated easily. But unfortunately, in Unicode, Tamil
is given 128 slots only. This has resulted in many disadvantages.
6.1. Not a real character encoding
As already mentioned, this coding taken from
ISCII is not character encoding. Unicode is supposed to encode
characters, but this is not the case. Pure consonants, which
are fundamental in nature, do not have single slots. Pure
consonants have been treated in an unnatural way. This will
be a constant irritant to the programmers while doing natural
language processing in a large scale, in the future.
6.2. Multiple ways in representation
For some letters like ko, there seems to be two
different ways of coding. One as ka and the vowel modifier
for o. The other is ka plus the vowel modifier for ae plus
the vowel modifier for aa. This ambiguity in representation
will lead to a situation where search and similar operations
can lead to incorrect results.
6.3. Not a complex script
Above all, treating Tamil as having a complex
script has enormous negative consequences, which will hinder
the growth of the language use in the future.
|
7. Proposed All Character Encoding
for Tamil - All Tamil Block
|
Back to Top |
Representing all the Tamil letters, each with
a separate slot is the natural way to treat Tamil. The table
given in the primary school books should form the basis for
such a scheme. It should include the special symbols also.
Such a scheme is shown in Annexure A.
|
8. Advantages
of All Character Encoding (All Tamil Block)
|
Back to Top |
8.1. Real Character Encoding
It is the real character encoding, representing
the true nature of Tamil. In the All Tamil Block the space
required for any Tamil text will be just about two thirds
of what is required in the current Unicode scheme. Let us
see an example. Consider the word Tamil. In the current Unicode
it will be
represented as five symbols. ta, ma, ikara modifier, la, and
pulli. In the All Tamil scheme, it will be represented with
only three letters, ta mi and izh.
Traditionally most of the popular Tamil software in the past
20 years have been encoding almost all the Tamil characters.
The same is done in the TAM, the TN Govt. standard encoding
for the printing and publishing industry. The All Tamil
Block follows not only the traditional way of encoding
Tamil characters but removes the earlier restrictions of the
8 Bit encoding.
8.2. Efficient Design
The creation of the 16 bits is done in a scientific
way. Of the sixteen bits, the first 7 bits indicates the language.
The next 5 bits gives the serial number of the consonant part
of a Tamil letter. The next 4 bits gives the serial number
of the vowel part of a Tamil letter. A zero here means the
absence of the consonant or vowel part, that is, it is a pure
vowel or pure consonant. Hence, it is extremely easy to see
what a letter contains. This simplicity comes from the natural
way in which the coding is designed.
8.3. Savings in Cost of Computer Storage
Space
Its simplicity leads to enormous savings. The
space requirement for a Tamil text in the current Unicode
is about 50% more than what is required in the all character
encoding.
8.4. Saving in Cost of Internet Communication
Bandwidth
The Time required to communicate Tamil
text is also increased by about 50% in the current Unicode.
It is common sense to note that any language processing will
take more time when the length of the text is more.
8.5. Saving in Computer Display
Time
Tamil data will be displayed on a computer monitor
much slower when compared to the proposed scheme.
8.6. Other Savings
The rendering time, searching time and many language
processing times are also significantly more in the current
Unicode. When the unproductive waiting time of the users is
included, this amount will be far higher. Also the additional
storage cost of about one and a half times of what is really
required. This is an avoidable, never ending, recurring expenditure.
Current Unicode will result in enormous drain of the resources
of the Tamil community. Many crores of rupees will be wasted
each and every month for many many years to come.
|
9. Whether efficiency of the coding
is something to be considered or not?
|
Back to Top |
The whole progress in the western scientific
world was possible only when they came to know of the decimal
number system. When they were using the Roman numerals, they
took enormous amount of time even for simple calculations,
and hence could not progress much. Try adding two Roman numerals
and you will find the power of notation. History shows that
the coding has enormous influence on the people who use it.
One should not forget what history has taught us. Many essays
and books on history can be found to testify this. A sample
from the net is given below.
The following is from the website: http://essayfabric.com/free_essay/essay53.htm
"Prior to the use of "Arab" numerals, as we
know them today, the West relied upon the somewhat clumsy
system of Roman numerals. Whereas in the decimal system, the
number 1948 can be written in four figures, eleven figures
were needed using the Roman system: MDCCCXLVIII. It is obvious
that even for the solution of the simplest arithmetical problem,
Roman numerals called for an enormous expenditure of time
and labor. The Arab numerals, on the other hand, rendered
even complicated mathematical tasks relatively simple.
The scientific advances of the West would have
been impossible had scientists continued to depend upon the
Roman numerals and been deprived of the simplicity and flexibility
of the decimal system and its main glory, the zero. Though
the Arab numerals were originally a Hindu invention, it was
the
Arabs who turned them into a workable system; the earliest
Arab zero on record dates from the year 873, whereas the earliest
Hindu zero is dated 876. For the subsequent four hundred years,
Europe laughed at a method that depended upon the use of zero,
"a meaningless nothing." "
It may be noted that if Tamil had been implemented
as having a simple script, Tamil would have been implemented
in Unicode more than 10 years ago. Just because the ISCII
based code was imposed on Tamil, it has taken so much time
for providing the basic support for Tamil. This can be seen
as the first negative effect of the encoding. A proof for
what history foretells us.
Just because a mistake has been there in Unicode
for a few years does not mean that we have to live with it
forever. Citing stability and their policy of not considering
efficiency in any manner, if the Unicode consortium does not
agree to change the existing scheme, the best thing to do
will be the
following:
1. Govt. of Tamilnadu to convince
the Govt. of India to recommend the All Tamil Block scheme
as the proposed scheme for Tamil and forward the proposal
to Unicode Consortium.
2. Govt of Tamilnadu to represent the case for
All Tamil Block scheme in the next UC meeting by direct presence
and convince the UC for the need for the revised scheme.
Dr. Krishnamurthy, Mr. Elangovan and Mr.
P. Chellappan
Kanithamizh Sangam
|