code.vuplus.com Git - vuplus_xbmc/blob - lib/enca/FAQ

   1 #============================================================================
   2 # Enca v1.12 (2009-10-29)  guess and convert encoding of text files
   3 # Copyright (C) 2000-2003 David Necas (Yeti) <yeti@physics.muni.cz>
   4 # Copyright (C) 2009 Michal Cihar <michal@cihar.com>
   5 #============================================================================
   6
   7 Frequently Asked Questions about Enca
   8
   9 * Q: What obscure encoding the FAQ, THANKS, and other files use?
  10 * A: Run Enca on them and you'll see.
  11
  12 * Q: How do I specify input encoding?
  13 * A: You can't.  If you know the encoding, you don't need Enca.  Use some
  14      fully-fledged converter.
  15
  16 * Q: Why Enca can't detect both language and encoding?
  17 * A: Because this is impossible.  Well, it's possible for natural, long
  18      enough texts.  But Enca can detect encoding of nonsense and people
  19      are used to it.  No program can tell you “ěčřýáíéú” is both Czech
  20      and e.g. ISO-8859-2.  Incidentally, interpreting the same bytes as
  21      Win-1251 one gets “мишэбнйъ”, which is equally good Russian nonsense.
  22
  23 * Q: Why Enca can't recognise encoding of all-uppercase texts?
  24 * A: Mostly again because there's a trade-off between nonsense detection and
  25      low-probability natural text detection.  But it's possible to detect
  26      both, I'm working on that.
  27
  28 * Q: Why Enca needs LC_CTYPE to be set?
  29 * A: No, it doesn't need it.  But you need to use “-L language-code” then.
  30      It's possible to put it into ENCAOPT environment variable, if you want
  31      to make your life easier.  (Never versions of Enca try to guess your
  32      language from other locale settings too.)
  33
  34 * Q: Why “enca -x ascii” doesn't work?
  35 * A: Unfortunately there are several different things people call “conversion
  36      to ASCII”.  Consider following characters: ě (Latin small letter e with
  37      caron) and ≫ (Much greater-than).  By conversion to ASCII you may mean:
  38      1a. Omitting these characters in output, because they are not
  39          representable in ASCII.
  40      1b. Keeping these characters intact in output, because they are not
  41          representable in ASCII.
  42      1c. Failing (possibly damaging the file), because they are not
  43          representable in ASCII
  44      2. Approximating them with single, close ASCII characters; in this case
  45         probably plain “e” and “>”.
  46      3. Expanding them to sequences of ASCII characters which may be less
  47         readable then the approximations above, but generally allow
  48         reconstruction of the original characters.  Such sequences could
  49         be e.g. RFC-1345 mnemonics “e<” and “>>”.
  50      What happens when you run “enca -x ascii” on something, depends on
  51      the converter used.
  52
  53      The usual scenario is: Enca uses librecode or libiconv, which do (1),
  54      and you get upset.  The only tool doing (2) known to me is cstocs, and
  55      does it only for Latin2 characters (install cstocs and specify cstocs
  56      as the converter to be used, if you want (2)).  AFAIK, there's no tool
  57      doing reasonably (3), though recode can expand to mnemonics (use rfc1345
  58      as target charset instead of ascii).
  59
  60 * Q: Why “enca -E cstocs” doesn't work in my RedHat/Fedora?
  61 * A: This is a cstocs problem.  Cstocs is broken for Perl ≥ 5.8 and UTF-8
  62      locales.  Perl Unicode handling changes with every version so an
  63      advanced charset converter working in mutiple Perl versions is something
  64      between impossible and a big mess.  Either set your locales to non-UTF-8
  65      (e.g. ISO-8859-2) or don't use cstocs until this is resolved somehow.
  66