manakai

Whatpm::Charset::UniversalCharDet

A Perl Interface to universalchardet Character Encoding Detection

SYNOPSIS

  require Whatpm::Charset::UniversalCharDet;
  $charset_name = Whatpm::Charset::UniversalCharDet
      ->detect_byte_string ($byte_string);
  # $charset_name: charset name (in lowercase) or undef

DESCRIPTION

The Whatpm::Charset::UniversalCharDet module is a Perl interface to the universalchardet character encoding detection.

The universalchardet is originally developed by Mozilla project and then ported to other platforms. The Whatpm::Charset::UniversalCharDet module provides a Perl interface to Universal Encoding Detector, a Python port of the Mozilla's universalchardet code. Future version of this module might provide an interface to another port of the universalchardet.

METHOD

$charset = Whatpm::Charset::UniversalCharDet->detect_byte_string ($s)

Detect the character encoding of the specified byte string.

$s

The byte string.

$charset

The name of the character encoding, detected by universalchardet, in lowercase. If no character encoding can be detected, because, e.g., no implementation for universalchardet is found, undef is returned.

For the list of supported encodings, see documentation for Universal Encoding Detector <http://chardet.feedparser.org/docs/supported-encodings.html>.

DEPENDENCY

Inline::Python

A Perl module which enables Python support for Inline code embedding, available from <http://search.cpan.org/dist/Inline-Python/>.

To install the module using CPAN:

  root# perl -MCPAN -eshell
  cpan> install Inline::Python
Python

Available at <http://www.python.org/download/>.

Universal Encoding Detector

Available at <http://chardet.feedparser.org/download/>.

Expand the archive and then execute python setup.py install in the expanded directory.

TROUBLESHOOTING

The Whatpm::Charset::UniversalCharDet module does not raise error even when it fails to load the universalchardet library; it simply warns the error message.

This behavior can be changed by setting a true value to the flag $Whatpm::Charset::UniversalCharDet::DEBUG - it will make any error invoke die instead of warn.

Common error messages are as follows:

Can't locate Inline.pm in @INC

Module Inline is not installed.

Error. You have specified 'Python' as an Inline programming language.

Module Inline::Python is not installed. If you did install the module, please find "the Inline DIRECTORY" (e.g. ./_Inline) and remove it.

Couldn't find an appropriate DIRECTORY for Inline to use.

The temporary directory for the Inline module is not available. See "The Inline DIRECTORY" in Inline::Python or <http://search.cpan.org/dist/Inline/Inline.pod#The_Inline_DIRECTORY>.

Error -- py_eval raised an exception

Universal Encoding Detector is not installed.

SEE ALSO

UNIVCHARDET

SuikaWiki <https://suika.suikawiki.org/gate/2005/sw/UNIVCHARDET>.

Universal Encoding Detector: character encoding auto-detection in Python <http://chardet.feedparser.org/>.

A composite approach to language/encoding detection <http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html>.

Web Applications 1.0 - Determining the character encoding <http://www.whatwg.org/specs/web-apps/current-work/complete.html#determining-the-character-encoding>.

AUTHOR

Wakaba <wakaba@suikawiki.org>.

LICENSE

Copyright 2007-2010 Wakaba <wakaba@suikawiki.org>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.