1 |
=head1 NAME |
2 |
|
3 |
Whatpm::HTML - An HTML Parser and Serializer |
4 |
|
5 |
=head1 SYNOPSIS |
6 |
|
7 |
use Whatpm::HTML; |
8 |
|
9 |
my $s = q<<!DOCTYPE html><html>...</html>>; |
10 |
# $doc = an empty DOM |Document| object |
11 |
my $on_error = sub { |
12 |
my %error = @_; |
13 |
warn $error{type}, "\n"; |
14 |
}; |
15 |
|
16 |
Whatpm::HTML->parse_string ($s => $doc, $onerror); |
17 |
|
18 |
## Now, |$doc| is the DOM representation of |$s|. |
19 |
|
20 |
=head1 DESCRIPTION |
21 |
|
22 |
The C<Whatpm::HTML> module contains HTML parser and serializer. |
23 |
|
24 |
The HTML parser can be used to construct the DOM tree representation |
25 |
from an HTML document. The parsing and tree construction are done |
26 |
as described in the Web Application 1.0 specification. |
27 |
|
28 |
The HTML serializer can be used to obtain the HTML document representation |
29 |
of a DOM tree (or a tree fragment thereof). The serialization |
30 |
is performed as described in the Web Applications 1.0 specification |
31 |
for C<innerHTML> DOM attribute. |
32 |
|
33 |
This module is part of Whatpm - Perl Modules for |
34 |
Web Hypertext Application Technologies. |
35 |
|
36 |
=head1 METHODS |
37 |
|
38 |
=over 4 |
39 |
|
40 |
=item [I<$doc> =] Whatpm::HTML->parse_string (I<$s>, I<$doc>[, I<$onerror>]); |
41 |
|
42 |
Parse a string I<$s> as an HTML document. |
43 |
|
44 |
The first argument, I<$s>, MUST be a string. It is parsed |
45 |
as a sequence of characters representing an HTML document. |
46 |
|
47 |
The second argument, I<$doc>, MUST be an empty read-write |
48 |
DOM C<Document> object. The HTML DOM tree is constructed |
49 |
onto this C<Document> object. |
50 |
|
51 |
The third argument, I<$onerror>, MUST be a reference to |
52 |
the error handler code. Whenever a parse error is detected, |
53 |
this code is invoked with an argument that contains a |
54 |
useless string that might describe what is wrong. |
55 |
The code MAY throw an exception, so that whole the parsing |
56 |
process aborts. Otherwise, the parser will continue to |
57 |
process the input. The code MUST NOT modify I<$s> or I<$doc>. |
58 |
If it does, then the result is undefined. |
59 |
This argument is optional; if missing, any |
60 |
parse error makes that string being C<warn>ed. |
61 |
|
62 |
B<NOTE>: To be a conforming user agent, the code MUST either |
63 |
abort the processing by throwing an exception at the first |
64 |
invocation or MUST continue the processing until the parser |
65 |
stops. |
66 |
|
67 |
The method returns the DOM C<Document> object (i.e. the second argument). |
68 |
|
69 |
Note that the C<Whatpm::NanoDOM> module provides a non-conforming |
70 |
implementation of DOM that only implements a subset that |
71 |
is necessary for the purpose of C<Whatpm::HTML>'s parsing and |
72 |
serializing. |
73 |
With this module, creating a new HTML C<Document> object |
74 |
from a string containing HTML document might be coded as: |
75 |
|
76 |
use Whatpm::HTML; |
77 |
use Whatpm::NanoDOM; |
78 |
my $doc = Whatpm::HTML->parse_string |
79 |
($s => Whatpm::NanoDOM::Document->new, $onerror); |
80 |
|
81 |
=item I<$s> = Whatpm::HTML->get_inner_html (I<$node>[, I<$onerror>]); |
82 |
|
83 |
Return the HTML serialization of a DOM node I<$node>. |
84 |
|
85 |
The first argument, I<$node>, MUST be a DOM C<Document>, |
86 |
C<Element>, or C<DocumentFragment> node. |
87 |
|
88 |
The second argument, I<$onerror>, MUST be a reference to the |
89 |
error handling code. This code will be invoked if a descendant |
90 |
of I<$node> is neither of C<Element>, C<Text>, C<CDATASection>, |
91 |
C<Comment>, C<DocumentType>, nor C<EntityReference>, so |
92 |
that an C<INVALID_STATE_ERR> exception MUST be thrown. |
93 |
The code will be invoked with an argument, which is the node |
94 |
whose type is invalid. |
95 |
The argument I<$onerror> is optional; if missing, any erroneous |
96 |
node is simply ignored. |
97 |
|
98 |
The method returns a reference to the C<inner_html> attribute |
99 |
value, i.e. the HTML serialization of the I<$node>. |
100 |
|
101 |
=back |
102 |
|
103 |
=head1 LOW-LEVEL INTERFACE |
104 |
|
105 |
@@ TBW |
106 |
|
107 |
=head2 Application Cache Selection Algorithm Hook |
108 |
|
109 |
Once a parser I<$p> is instantiated by method C<new>, |
110 |
a C<CODE> reference can be set to C<< I<$p>->{application_cache_selection} >>. |
111 |
That C<CODE> will be called back when the application cache selection |
112 |
algorithm MUST be run per HTML5. By default, |
113 |
C<< I<$p>->{application_cache_selection} >> is set to an empty subroutine. |
114 |
|
115 |
The subroutine will be invoked with an argument I<manifest_uri>, |
116 |
which is set to the manifest URI when the algorithm MUST be invoked |
117 |
with a manifest URI, or is set to C<undef> when the algorithm MUST |
118 |
be invoked without no manifest URI. |
119 |
|
120 |
=head1 ERROR REPORTS |
121 |
|
122 |
@@ TBW |
123 |
|
124 |
The list of the error types is available in |
125 |
Whatpm Error Types <http://suika.fam.cx/gate/2005/sw/Whatpm%20Error%20Types>. |
126 |
|
127 |
=head1 TO DO |
128 |
|
129 |
Tokenizer should emit a sequence of character tokens as one token |
130 |
to improve performance. |
131 |
|
132 |
A method that accepts a byte stream as an input. |
133 |
|
134 |
Charset detection algorithm. |
135 |
|
136 |
Documentation for the setter of inner_html. |
137 |
|
138 |
And there are many "TODO"s and "ISSUE"s in the source code. |
139 |
|
140 |
=head1 SEE ALSO |
141 |
|
142 |
Whatpm <http://suika.fam.cx/www/markup/html/whatpm/readme>. |
143 |
|
144 |
Whatpm Error Types |
145 |
<http://suika.fam.cx/gate/2005/sw/Whatpm%20Error%20Types>. |
146 |
|
147 |
HTML5 <http://whatwg.org/html5>. |
148 |
|
149 |
L<Whatpm::NanoDOM>. |
150 |
|
151 |
L<Whatpm::ContentChecker::HTML>. |
152 |
|
153 |
=head1 AUTHOR |
154 |
|
155 |
Wakaba <w@suika.fam.cx>. |
156 |
|
157 |
=head1 LICENSE |
158 |
|
159 |
Copyright 2007 Wakaba <w@suika.fam.cx> |
160 |
|
161 |
This library is free software; you can redistribute it |
162 |
and/or modify it under the same terms as Perl itself. |
163 |
|
164 |
=cut |
165 |
|
166 |
# $Date: 2007/11/04 04:15:07 $ |