1 |
wakaba |
1.1 |
=head1 NAME |
2 |
|
|
|
3 |
|
|
Whatpm::Charset::UniversalCharDet - A Perl Interface to universalchardet |
4 |
|
|
Character Encoding Detection |
5 |
|
|
|
6 |
|
|
=head1 SYNOPSIS |
7 |
|
|
|
8 |
|
|
require Whatpm::Charset::UniversalCharDet; |
9 |
|
|
$charset_name = Whatpm::Charset::UniversalCharDet |
10 |
|
|
->detect_byte_string ($byte_string); |
11 |
|
|
# $charset_name: charset name (in lowercase) or undef |
12 |
|
|
|
13 |
|
|
=head1 DESCRIPTION |
14 |
|
|
|
15 |
|
|
The C<Whatpm::Charset::UniversalCharDet> module is a Perl interface to |
16 |
|
|
the universalchardet character encoding detection. |
17 |
|
|
|
18 |
|
|
The universalchardet is originally developed by Mozilla project |
19 |
|
|
and then ported to other platforms. The C<Whatpm::Charset::UniversalCharDet> |
20 |
|
|
module provides a Perl interface to Universal Encoding Detector, |
21 |
|
|
a Python port of the Mozilla's universalchardet code. Future |
22 |
|
|
version of this module might provide an interface to another |
23 |
|
|
port of the universalchardet. |
24 |
|
|
|
25 |
|
|
=head1 METHOD |
26 |
|
|
|
27 |
|
|
=over 4 |
28 |
|
|
|
29 |
|
|
=item I<$charset> = Whatpm::Charset::UniversalCharDet->detect_byte_string (I<$s>) |
30 |
|
|
|
31 |
|
|
Detect the character encoding of the specified byte string. |
32 |
|
|
|
33 |
|
|
=over 4 |
34 |
|
|
|
35 |
|
|
=item I<$s> |
36 |
|
|
|
37 |
|
|
The byte string. |
38 |
|
|
|
39 |
|
|
=item I<$charset> |
40 |
|
|
|
41 |
|
|
The name of the character encoding, detected by universalchardet, |
42 |
|
|
in lowercase. |
43 |
|
|
If no character encoding can be detected, because, e.g., no implementation |
44 |
|
|
for universalchardet is found, C<undef> is returned. |
45 |
|
|
|
46 |
|
|
For the list of supported encodings, see documentation for |
47 |
|
|
Universal Encoding Detector |
48 |
|
|
<http://chardet.feedparser.org/docs/supported-encodings.html>. |
49 |
|
|
|
50 |
|
|
=back |
51 |
|
|
|
52 |
|
|
=back |
53 |
|
|
|
54 |
|
|
=head1 DEPENDENCY |
55 |
|
|
|
56 |
|
|
=over 4 |
57 |
|
|
|
58 |
|
|
=item L<Inline::Python> |
59 |
|
|
|
60 |
|
|
A Perl module available at CPAN |
61 |
|
|
<http://search.cpan.org/~neilw/Inline-Python-0.22/>. |
62 |
|
|
|
63 |
|
|
To install the module using L<CPAN.pm>: |
64 |
|
|
|
65 |
|
|
root# perl -MCPAN -eshell |
66 |
|
|
cpan> install Inline::Python |
67 |
|
|
|
68 |
|
|
=item Python |
69 |
|
|
|
70 |
|
|
Available at <http://www.python.org/download/>. |
71 |
|
|
|
72 |
|
|
=item Universal Encoding Detector |
73 |
|
|
|
74 |
|
|
Available at <http://chardet.feedparser.org/download/>. |
75 |
|
|
|
76 |
|
|
Expand the archive and then execute C<python setup.py install> |
77 |
|
|
in the expanded directory. |
78 |
|
|
|
79 |
|
|
=back |
80 |
|
|
|
81 |
|
|
=head1 TROUBLESHOOTING |
82 |
|
|
|
83 |
|
|
The C<Whatpm::Charset::UniversalCharDet> module does not raise |
84 |
|
|
error even when it fails to load the universalchardet library; |
85 |
|
|
it simply C<warn>s the error message. |
86 |
|
|
|
87 |
|
|
This behavior can be changed by setting a true value to the |
88 |
|
|
flag C<$Whatpm::Charset::UniversalCharDet> - it will make any |
89 |
|
|
error C<die> rather than C<warn>. |
90 |
|
|
|
91 |
|
|
Common error messages are following: |
92 |
|
|
|
93 |
|
|
=over 4 |
94 |
|
|
|
95 |
|
|
=item Can't locate Inlinea.pm in @INC |
96 |
|
|
|
97 |
|
|
L<Inline> is not installed. |
98 |
|
|
|
99 |
|
|
=item Error. You have specified 'Python' as an Inline programming language. |
100 |
|
|
|
101 |
|
|
L<Inline::Python> is not installed. |
102 |
|
|
|
103 |
|
|
=item Couldn't find an appropriate DIRECTORY for Inline to use. |
104 |
|
|
|
105 |
|
|
The temporary directory for the L<Inline> module is not available. |
106 |
|
|
See L<Inline::Python/"The Inline DIRECTORY"> or |
107 |
|
|
<http://search.cpan.org/~ingy/Inline-0.44/Inline.pod#The_Inline_DIRECTORY>. |
108 |
|
|
|
109 |
|
|
=item Error -- py_eval raised an exception |
110 |
|
|
|
111 |
|
|
Universal Encoding Detector is not installed. |
112 |
|
|
|
113 |
|
|
=back |
114 |
|
|
|
115 |
|
|
=head1 SEE ALSO |
116 |
|
|
|
117 |
|
|
UNIVCHARDET - SuikaWiki |
118 |
|
|
<http://suika.fam.cx/gate/2005/sw/UNIVCHARDET> |
119 |
|
|
|
120 |
|
|
Universal Encoding Detector: character encoding auto-detection in Python |
121 |
|
|
<http://chardet.feedparser.org/> |
122 |
|
|
|
123 |
|
|
A composite approach to language/encoding detection |
124 |
|
|
<http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html> |
125 |
|
|
|
126 |
|
|
=head1 AUTHOR |
127 |
|
|
|
128 |
|
|
Wakaba <w@suika.fam.cx>. |
129 |
|
|
|
130 |
|
|
=head1 LICENSE |
131 |
|
|
|
132 |
|
|
Copyright 2007 Wakaba <w@suika.fam.cx> |
133 |
|
|
|
134 |
|
|
This library is free software; you can redistribute it |
135 |
|
|
and/or modify it under the same terms as Perl itself. |
136 |
|
|
|
137 |
|
|
=cut |
138 |
|
|
|
139 |
|
|
## $Date:$ |