i18n for Linux libc

From: Swedish GNU/LI List (sv_at_li.org)
Date: 1995-06-13 19:40:07

     ------
     List:     Swedish GNU/LI List
     Sender:   Ulrich Drepper <drepper@ipd.info.uni-karlsruhe.de>
     Subject:  i18n for Linux libc
     Date:     Tue, 13 Jun 1995 19:40:07 +0200
     ------

Hi,

I promised it already some weeks ago but there were still some problems
to solve.  This is the announcement of the availability of the first part
of i18n support for Linux libc.  I appended the readme below.  To use
i18n you need
  libc-5.1.1 or above
and
  i44ftp.info.uni-karlsruhe.de:pub/linux/ctype/WG15-collection.linux.tar.gz

For the later:
Somebody major sites should get this.  I would prefer if you put it in
the same directories as the libc or a complete new hierachy named i18n.
But please let me know about the distribution.

-- Uli
________---------------------------------------------------------------
\      / Ulrich Drepper / Univ. at Karlsruhe, Germany / CS Dept. / IPD
L\inux/  email: drepper@gnu.ai.mit.edu          smail: Rubensstr. 5
  \  /          drepper@ipd.info.uni-karlsruhe.de      76149 Karlsruhe
   \/1.2.10 ------------------------------------------ Germany --------

----------------------------------------------------------------------------
	       Internationalization for Linux C Library
               ----------------------------------------

The Linux C Library Version 5 has completely new code for the support
of internationalization.  (* This is not quite right in the moment
but will be soon when I manage to release the new message handlng
code.)  This code is written by Roland McGrath and Ulrich Drepper for
the GNU and Linux C Libraries considering the POSIX standards where
applicable.

The code is designed to be portable to various architectures which are
allowed to share their files defining a locale.  Special attention was
also given to performance.  When available all files are mmap'ed.
Together both of these conditions require an elaborated file format.

To construct these locale files the POSIX.2 standard defines a tool
named localedef.  The input of this tool consists of locale definition
files in the format POSIX.2 defined and it writes out the files the C
library can work with.  You cannot use a localedef program of another
system because the produced locale files all have special file
formats.

There is a collection of these POSIX locale definition files
available.  They were created in the POSIX working group 15 on i18n 
(ISO/IEC JTC1/SC22/WG15).  Please read README.locales for more
information about this and before you change anything.

The complete set is not distributed with the C library for several
reasons:

- Not all users are interested in it
- It will not change that often
- It is quite big

So for now it is available on

	i44ftp.info.uni-karlsruhe.de:pub/linux/ctype

and hopefully soon on tsx-11 and sunsite (perhaps in the C library
directories;  server maintainers, please let me know where you make it
available).


			      How to use
                              ----------

As said above you need Linux libc-5, more specific libc-5.1.1 or
above.  5.1.1 is the first version which ships the localedef program.

If you upgraded your library sources with patches you will probably
have in the libc/locale/ directory other directories (collate, ctype,
monetary, numeric, response, and time).  These can savely be removed!
They contain the code for the old programs which are not usable with
libc-5.  If you still run libc-4 and don't have this programs in your
bin directories consider building them first though.

If you got this library and a binary of it you should go into the
directory libc/locale/ and run

	make SHARED= programs

SHARED= is necessary to prevent it being compiled with -fPIC etc.

(Please don't pay attention to the warning.  This is *work-in-progress*.)

The compilation will hopefully end up with to programs built:
localedef and locale.

There is not yet any documentation but the POSIX.2 description and
this text.  (But I'm working on this.)

After installing the programs in /usr/bin

[1]  cd /usr/src/libc/locale
[2]  cp localedef locale /usr/bin

you should unpack the WG-collection.

[3]  tar zxvf WG-collection.tar.gz

In the created directory you find one directory named `charmaps'.
This contains a lot of character map definition files (also described
in POSIX.2).  Some files describe rather exotic character maps (at
least for Linux which does not run on EBCDIC machine).  I suggest to
install at least the files beginning with `ISO_'.  The place to
install is determined by the value the preprocessor variable
CHARMAP_PATH had while compiling localedef.  Normally this is
/usr/share/nls/charmap.

[4]  cd WG-collection/charmaps
[5]  mkdirhier /usr/share/nls/charmap
[6]  cp ISO_* /usr/share/nls/charmap

One strange point in the WG15-collection is that there is no ISO_10646
charmap is in charmaps/.  But you can find one in locales/.  So you
should copy it, too.

[7]  cd WG-collection/locales
[8]  cp ISO_10646 /usr/share/nls/charmap/ISO_10646-1:1993

Now you should also make the locale definition files available in a
common place.  I would suggest /usr/share/nls/locale:

[9]  cd WG-collection/locales
[10] mkdirhier /usr/share/nls/locale
[11] cp POSIX ??_* /usr/share/nls/locale

The rest of the WG15-collection is perhaps not interesting at this
time.


Create locale files
-------------------

So far only preparations have been made.  To create the needed binary
locale files you have first to determine the environment you want.

For me the situation would be: I want to have the definition for
Germany and german languages.  Further I use Linux with ISO_8859-1
(although I could also use 8859-2 and 8859-5).

The first to points specify the locale definition file I have to use.
If you look through the collection and also note that the ISO
appreviation for Germany is De you will easily find the candidate:

	de_DE.

The third point determines the locale definition file to use.
Obviously this has to be ISO_8859-1:1987.

To get the locale file I run localedef with this commands

[12] cd ~
[13] mkdir new-dir
[14] cd new-dir
[15] localedef -i /usr/share/nls/locale/de_DE -f ISO_8859-1:1987 ./de

I you run this with the given locale definition file you will get the
following output:

localedef: /usr/share/nls/locale/de_DE:23: invalid locale `en_DK' in copy statement
localedef: /usr/share/nls/locale/de_DE:27: invalid locale `en_DK' in copy statement
localedef: category `LC_COLLATE' not defined
localedef: category `LC_CTYPE' not defined
localedef: item `era' of category `LC_TIME' undefined
localedef: item `era_year' of category `LC_TIME' undefined
localedef: item `era_d_fmt' of category `LC_TIME' undefined
localedef: item `alt_digits' of category `LC_TIME' undefined
localedef: item `yesstr' of category `LC_MESSAGES' undefined
localedef: item `nostr' of category `LC_MESSAGES' undefined
localedef: no output file produced because warning were issued

Especially interesting are the first two lines.  They tell you that
the locale en_DK is missing.  Why this?  I don't want to have english
language support for Danemark.

The answer is the "OO concept" of the POSIX locale definition files.
The en_DK locale definition is (one of) the main locale definition
files.  Many locales share a lot of information.  Instead of copying
it they can inherit it by the copy statement.  But the design is not
optimal: the locale from which we want to inherit something must be
created already and installed in the standard place (normally
/usr/share/locale).  I.e. before making the de locale we have first to
generate the en_DK locale.

[16] su root
[17] localedef -i /usr/share/nls/locale/en_DK -f ISO_8859-1:1987 en_DK
localedef: item `era' of category `LC_TIME' undefined
localedef: item `era_year' of category `LC_TIME' undefined
localedef: item `era_d_fmt' of category `LC_TIME' undefined
localedef: item `alt_digits' of category `LC_TIME' undefined
localedef: item `yesstr' of category `LC_MESSAGES' undefined
localedef: item `nostr' of category `LC_MESSAGES' undefined
localedef: no output file produced because warning were issued

There are some warning which prevent according to the POSIX standard
the generation of the locale files.  But now I tell you that this are
harmless so we can try it again with te -c option (do --help for
info):

[18] localedef -c -i /usr/share/nls/locale/en_DK -f ISO_8859-1:1987 en_DK
localedef: item `NL_dummy' of category `LC_COLLATE' undefined
localedef: item `era' of category `LC_TIME' undefined
localedef: item `era_year' of category `LC_TIME' undefined
localedef: item `era_d_fmt' of category `LC_TIME' undefined
localedef: item `alt_digits' of category `LC_TIME' undefined
localedef: item `yesstr' of category `LC_MESSAGES' undefined
localedef: item `nostr' of category `LC_MESSAGES' undefined
LC_COLLATE
LC_CTYPE
LC_MONETARY
LC_NUMERIC
LC_TIME
LC_MESSAGES

(For explanation see the de locale).

Now run the command for the german locale again and you'll get

[19] localedef -c -i /usr/share/nls/locale/de_DE -f ISO_8859-1:1987 ./de
localedef: item `era' of category `LC_TIME' undefined
localedef: item `era_year' of category `LC_TIME' undefined
localedef: item `era_d_fmt' of category `LC_TIME' undefined
localedef: item `alt_digits' of category `LC_TIME' undefined
localedef: item `yesstr' of category `LC_MESSAGES' undefined
localedef: item `nostr' of category `LC_MESSAGES' undefined
LC_COLLATE
LC_CTYPE
LC_MONETARY
LC_NUMERIC
LC_TIME
LC_MESSAGES

The six LC_* lines signal that for these locale categories output is
produced.  If you look through new-dir you will notice a directory de
which contains the following:

total 12
-rw-r--r--   1 drepper  users          13 Jun 12 04:18 LC_COLLATE
-rw-r--r--   1 drepper  users        6940 Jun 12 04:18 LC_CTYPE
-rw-r--r--   1 drepper  users          42 Jun 12 04:18 LC_MESSAGES
-rw-r--r--   1 drepper  users          94 Jun 12 04:18 LC_MONETARY
-rw-r--r--   1 drepper  users          24 Jun 12 04:18 LC_NUMERIC
-rw-r--r--   1 drepper  users         951 Jun 12 04:18 LC_TIME

These are the desired files!  The last parameter of localedef, ./de,
told it to place them in a directory de/ in the current dir.  In fact
all names here containing at least one slash ('/') will be placed in
the specified directory.  Because this work is often done by root to
install global locale files there is a special option implemented.  If
the name does not contain any slash, the files are placed in the
system's locale directory (i.e. the one looked for locale files by the
C library).  If a non-root user does omit the slash s/he should not be
paniced by an error message like:

localedef: cannot write output file `/usr/share/locale/de': Permission denied


One more point which is not important in the moment is coming with the
LC_MESSAGES file.  The name might suggest that this file contains
messages for some programs.  But this is not right.  Only some very
general (and rarely used) definition are found here.  The real message
files will be produced in another way.  I'm nearly finished with this
stuff so that it will be incorporated in the Linux C Library soon but
not now.

Important is only that LC_MESSAGES should not be a plain file but
instead a directory.  localedef does not create this automatically but
it can handle this situation.  Take the situation where I want to make
a good colleague an account on my machine while he is French speaking
Canadian.  The locale name I choose is fr_CA.  So I do the following
steps:

[20] mkdirhier fr_CA/LC_MESSAGES
[21] localedef -c -i /usr/share/nls/locale/fr_CA -f ISO_8859-1:1987 ./fr_CA

I get the following result:

total 12
-rw-r--r--   1 drepper  users          13 Jun 12 04:32 LC_COLLATE
-rw-r--r--   1 drepper  users        6940 Jun 12 04:32 LC_CTYPE
drwxr-xr-x   2 drepper  users        1024 Jun 12 04:32 LC_MESSAGES/
-rw-r--r--   1 drepper  users          93 Jun 12 04:32 LC_MONETARY
-rw-r--r--   1 drepper  users          25 Jun 12 04:32 LC_NUMERIC
-rw-r--r--   1 drepper  users         945 Jun 12 04:32 LC_TIME

fr_CA/LC_MESSAGES:
total 1
-rw-r--r--   1 drepper  users          42 Jun 12 04:32 SYS_LC_MESSAGES

localedef recognized that LC_MESSAGES/ is a directory and made the
file in this directory with the name prepended by `SYS_'.  (`SYS_' is a
prefix reserved by POSIX for system usage.)  This is of course also
understood by the libc and it is possible for the other locale
categories, too.



Naming problems
---------------

One problem that will naturally arise is naming.  In the above
examples I had the names `de' and `fr_CA'.  These are the names which
are today mostly used.  The internationalized GNU packages which will
soon be released also follow this.

This is of course recognized very early and the complete name could
look like this [X/Open Portability Guide, Vol. 3]:

	language[_territory[.codeset]]

For the above examples the full names should be:

	de_DE.ISO_8859-1:1987
and
	fr_CA.ISO_8859-5:1985

This will also be necessary if we have full support for ISO_10646
(i.e. the 32-bit character set).  At this point at least two different
locales are available for each language.

But on the other hand the "strange" behaviour of the localedef program
needs the simple names.


A last problem to mantion he is that the /usr/share hierachy is
intended to be used on various platforms.  Nobody can say which
character set in available on which machine.  So it is not generally
possible to have e.g. a de_DE without specifying a character set.  We
discussed this topic while writing the libc code but haven't found a
solution yet.


locale program
--------------

The second program which is created is locale.  I haven't mentioned it
yet.  It is intended to get information about the current locale.  You
can request the state of selection and also single values from a
category.

[[More information will be written soon.  I hope at least...]]


Bugs, limitations & prospects
-----------------------------

I know the programs have still several problems.  If you find some
please document them (description and perhaps way to reproduce) and
send this to me,

	drepper@gnu.ai.mit.edu

not to HJ (at least you should also include me).  Patches are welcome
to fix bugs but I don't think extensions are useful now because it is
not complete now.  If you look through the code you will see a lot of
#ifdefs.  This is mostly because I started writing the missing things.

This leades me to explain what is missing:
- wide character support (e.g. ISO_10646)
- LC_COLLATE handling (and strcoll() and strxfrm() functions of the libc)

I will work on this whenever I find time but there are other important
things to do (I have to work on my diploma thesis).


Projects
--------

I have several points which I want to see realized and which I surely
cannot write alone.  So if you are interested in internationalization
and have some time read on and tell me if you are interested.  I would
especially love to see some people from Asia.


Once the libc has wide-character support (which is not far away
anymore because I already wrote most of the code) I would like to have
terminals capable to handle this.  This would mean to have an xterm
and a text-console.  For xterm I heard that X11R7 will have much more
complete i18n support but I also want to have a text-console (this is
because my box is far too small to run X; anybody having a spare-i586? :-).

In previous discussion about this it became clear that this must *not*
be done in ther kernel.  There might be some features added to the
kernel but remember that putting bitmaps for N*1000 characters in
kernel space is not acceptable.

Japanese and chinese users reported that there are already text
terminal emulations which can handle this situation.  We should
examine this.  This whole project should also be tided couple with the
kbd package (I think Andries Brouwer is the current developer of
this!?).

Anybody interested in this could perhaps contact me.  I there are some
more I will try to get a mailing list organized (Patrick, would you be
wiling to establish another li.org mailing list?).

Arkiv genererat av hypermail 2.1.1.