+@c PSPP - a program for statistical analysis.
+@c Copyright (C) 2019 Free Software Foundation, Inc.
+@c Permission is granted to copy, distribute and/or modify this document
+@c under the terms of the GNU Free Documentation License, Version 1.3
+@c or any later version published by the Free Software Foundation;
+@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+@c A copy of the license is included in the section entitled "GNU
+@c Free Documentation License".
+@c
+
@node Internationalisation
@chapter Internationalisation
Internationalisation in pspp is complicated.
The most annoying aspect is that of character-encoding.
-This chapter attempts to describe the problems and current ways
+This chapter attempts to describe the problems and current ways
in which they are addressed.
@item The locale of the data. Only the character encoding is relevant.
@end itemize
-Each of these locales may, at different times take
+Each of these locales may, at different times take
separate (or identical) values.
-So for example, a French statistician can use pspp to prepare a report
-in the English language, using
-a datafile which has been created by a Japanese researcher hence
+So for example, a French statistician can use pspp to prepare a report
+in the English language, using
+a datafile which has been created by a Japanese researcher hence
uses a Japanese character set.
It's rarely, if ever, necessary to interrogate the system to find out
the values of the 3 locales.
However it's important to be aware of the source (destination) locale
when reading (writing) string data.
-When transfering data between a source and a destination, the appropriate
+When transferring data between a source and a destination, the appropriate
recoding must be performed.
from the system locale if not set.
@subsection The output locale
-This locale is the one that should be visible to the person reading a
+This locale is the one that should be visible to the person reading a
report generated by pspp. Non-data related strings (Eg: ``Page number'',
``Standard Deviation'' etc.) will appear in this locale.
@subsection The data locale
This locale is the one associated with the data being analysed with pspp.
The only important aspect of this locale is the character encoding.
-@footnote {It might also be desirable for the LC_COLLATE category to be used for the purposes of sorting data.}
+@footnote{It might also be desirable for the LC_COLLATE category to be used for the purposes of sorting data.}
The dictionary pertaining to the data contains a field denoting the encoding.
Any string data stored in a @union{value} will be encoded in the
dictionary's character set.
-
@section System files
@file{*.sav} files contain a field which is supposed to identify the encoding
-of the data they contain (@pxref{Machine Integer Info Record}).
+of the data they contain (@pxref{Machine Integer Info Record}).
However, many
files produced by early versions of spss set this to ``2'' (ASCII) regardless
of the encoding of the data.
Later versions contain an additional
record (@pxref{Character Encoding Record}) describing the encoding.
-When a system file is read, the dictionary's encoding is set using information
+When a system file is read, the dictionary's encoding is set using information
gleened from the system file.
-If the encoding cannot be determined or would be unreliable, then it
+If the encoding cannot be determined or would be unreliable, then it
remains unset.
@section GUI
-The psppire graphic user interface is written using the Gtk+ api, for which
+The psppire graphic user interface is written using the Gtk+ api, for which
all strings must be encoded in UTF8.
All strings passed to the GTK+/GLib library functions (except for filenames)
must be UTF-8 encoded otherwise errors will occur.
-Thus, for the purposes of the programming psppire, the user interface locale
+Thus, for the purposes of the programming psppire, the user interface locale
should be assumed to be UTF8, even if setlocale and/or nl_langinfo
indicates otherwise.
@subsection Filenames
The GLib API has some special functions for dealing with filenames.
-Strings returned from functions like gtk_file_chooser_dialog_get_name are not,
+Strings returned from functions like gtk_file_chooser_dialog_get_name are not,
in general, encoded in UTF8, but in ``filename'' encoding.
-If that filename is passed to another GLib function which expects a filename,
+If that filename is passed to another GLib function which expects a filename,
no conversion is necessary.
-If it's passed to a function for the purposes of displaying it (eg. in a
-window's title-bar) it must be converted to UTF8 --- there is a special
+If it's passed to a function for the purposes of displaying it (eg. in a
+window's title-bar) it must be converted to UTF8 --- there is a special
function for this: g_filename_display_name or g_filename_basename.
If however, a filename needs to be passed outside of GTK+/GLib (for example to fopen) it must be converted to the local system encoding.
Converts the string @var{text}, which is encoded in @var{from} to a new string encoded in @var{to} encoding.
If @var{len} is not -1, then it must be the number of bytes in @var{text}.
-It is the caller's responsibility to free the returned string when no
+It is the caller's responsibility to free the returned string when no
longer required.
@end deftypefun
+In order to minimise the number of conversions required, and to simplify
+design, PSPP attempts to store all internal strings in UTF8 encoding.
+Thus, when reading system and portable files (or any other data source),
+the following items are immediately converted to UTF8 encoding:
+@itemize
+@item Variable names
+@item Variable labels
+@item Value labels
+@end itemize
+Conversely, when writing system files, these are converted back to the
+encoding of that system file.
-For example, in order to display a string variable's value in a label widget in the psppire gui one would use code similar to
-@example
-
-struct variable *var = /* assigned from somewhere */
-struct case c = /* from somewhere else */
-
-const union value *val = case_data (&c, var);
-
-char *utf8string = recode_string (UTF8, dict_get_encoding (dict), val->s,
- var_get_width (var));
-
-GtkWidget *entry = gtk_entry_new();
-gtk_entry_set_text (entry, utf8string);
-gtk_widget_show (entry);
-
-free (utf8string);
-
-@end example
+String data stored in union values are left in their original encoding.
+These will be converted by the data_in/data_out functions.
This makes it difficult (impossible?) to elegantly handle the issues.
For example, it would make sense for the gui's datasheet to display
numbers formatted according to the LC_NUMERIC category of the data locale.
-Instead however there is the @func{data_out} function
+Instead however there is the @func{data_out} function
(@pxref{Obtaining Properties of Format Types}) which uses the
-@func{settings_get_decimal_char} function instead of the decimal separator
-of the locale. Similarly, formatting of monetary values is displayed
+@func{settings_get_decimal_char} function instead of the decimal separator
+of the locale. Similarly, formatting of monetary values is displayed
in a pspp/spss specific fashion instead of using the LC_MONETARY category.