spv-light-decoder: Add back character set encoding support.
Originally, SPV light member decoding would obtain the member's declared
character set encoding and recode all strings from that encoding into
UTF-8.
The SPV file submitted along with bug #59837, however, showed that SPV
files actually contain UTF-8 strings despite their declared character
encoding. This SPV file declared windows-1253 encoding. It contained 221
unique strings, 52 of which contained non-ASCII characters, and all of
which were valid UTF-8. Therefore, commit
db9a44802bb9
("spv-light-decoder: Text strings are all UTF-8 encoded.") changed the
SPV light member decoder to treat all strings as UTF-8 encoded.
Unfortunately, further examination of the SPV corpus showed that there is
no consistency. Some files do contain all UTF-8 despite declaring another
character set. For example:
00764b1c.spv: windows-1251: 481 unique strings, 123 non-ASCII, 123 UTF-8.
009bf8ba.spv: windows-1252: 76 unique strings, 9 non-ASCII, 9 UTF-8.
00b033b6.spv: windows-1252: 389 unique strings, 16 non-ASCII, 16 UTF-8.
014fc3df.spv: ISO_8859-1:1987: 81 unique strings, 4 non-ASCII, 4 UTF-8.
01a20a32.spv: windows-1254: 71 unique strings, 10 non-ASCII, 10 UTF-8.
01d41135.spv: windows-1251: 142 unique strings, 51 non-ASCII, 51 UTF-8.
01d7942a.spv: windows-1251: 64 unique strings, 43 non-ASCII, 0 UTF-8.
0203e88a.spv: windows-1254: 82 unique strings, 4 non-ASCII, 0 UTF-8.
0247cf5a.spv: windows-1256: 236 unique strings, 5 non-ASCII, 0 UTF-8.
026777ed.spv:
027b66dd.spv: windows-1252: 224 unique strings, 2 non-ASCII, 0 UTF-8.
02cc8c22.spv: windows-1254: 79 unique strings, 1 non-ASCII, 1 UTF-8.
...
Others are entirely non-UTF-8:
00029fbf.spv: windows-1251: 115 unique strings, 88 non-ASCII, 0 UTF-8.
00f101f5.spv: windows-1252: 94 unique strings, 1 non-ASCII, 0 UTF-8.
00ff0628.spv: windows-1252: 112 unique strings, 14 non-ASCII, 0 UTF-8.
01d7942a.spv: windows-1251: 64 unique strings, 43 non-ASCII, 0 UTF-8.
0203e88a.spv: windows-1254: 82 unique strings, 4 non-ASCII, 0 UTF-8.
0247cf5a.spv: windows-1256: 236 unique strings, 5 non-ASCII, 0 UTF-8.
027b66dd.spv: windows-1252: 224 unique strings, 2 non-ASCII, 0 UTF-8.
03235aa7.spv: windows-1254: 198 unique strings, 18 non-ASCII, 0 UTF-8.
07a43c5c.spv: ISO-8859-15: 124 unique strings, 13 non-ASCII, 0 UTF-8.
07a85498.spv: windows-1254: 86 unique strings, 1 non-ASCII, 0 UTF-8.
07a91f3e.spv: windows-1252: 111 unique strings, 13 non-ASCII, 0 UTF-8.
0ad295d8.spv: windows-1252: 81 unique strings, 1 non-ASCII, 0 UTF-8.
0ceb843b.spv: windows-1252: 3108 unique strings, 392 non-ASCII, 0 UTF-8.
and still others are a mix:
02f274e6.spv: windows-1252: 746 unique strings, 77 non-ASCII, 27 UTF-8.
0a7be05b.spv: windows-1255: 334 unique strings, 122 non-ASCII, 113 UTF-8.
4c7a575d.spv: windows-1250: 400 unique strings, 73 non-ASCII, 72 UTF-8.
785b0737.spv: windows-1250: 322 unique strings, 164 non-ASCII, 158 UTF-8.
94365e08.spv: windows-1250: 353 unique strings, 77 non-ASCII, 2 UTF-8.
This commit just gives up and interprets any string that is valid UTF-8 as
UTF-8. So far, it works in practice.