| 1 |
wakaba |
1.1 |
=head1 NAME |
| 2 |
|
|
|
| 3 |
wakaba |
1.6 |
Whatpm::HTML - An HTML Parser and Serializer |
| 4 |
wakaba |
1.1 |
|
| 5 |
|
|
=head1 SYNOPSIS |
| 6 |
|
|
|
| 7 |
wakaba |
1.2 |
use Whatpm::HTML; |
| 8 |
wakaba |
1.1 |
|
| 9 |
|
|
my $s = q<<!DOCTYPE html><html>...</html>>; |
| 10 |
|
|
# $doc = an empty DOM |Document| object |
| 11 |
|
|
my $on_error = sub { |
| 12 |
wakaba |
1.6 |
my %error = @_; |
| 13 |
|
|
warn $error{type}, "\n"; |
| 14 |
wakaba |
1.1 |
}; |
| 15 |
|
|
|
| 16 |
wakaba |
1.2 |
Whatpm::HTML->parse_string ($s => $doc, $onerror); |
| 17 |
wakaba |
1.1 |
|
| 18 |
wakaba |
1.6 |
## Now, |$doc| is the DOM representation of |$s|. |
| 19 |
wakaba |
1.1 |
|
| 20 |
|
|
=head1 DESCRIPTION |
| 21 |
|
|
|
| 22 |
wakaba |
1.2 |
The C<Whatpm::HTML> module contains HTML parser and serializer. |
| 23 |
wakaba |
1.1 |
|
| 24 |
|
|
The HTML parser can be used to construct the DOM tree representation |
| 25 |
|
|
from an HTML document. The parsing and tree construction are done |
| 26 |
|
|
as described in the Web Application 1.0 specification. |
| 27 |
|
|
|
| 28 |
|
|
The HTML serializer can be used to obtain the HTML document representation |
| 29 |
|
|
of a DOM tree (or a tree fragment thereof). The serialization |
| 30 |
|
|
is performed as described in the Web Applications 1.0 specification |
| 31 |
|
|
for C<innerHTML> DOM attribute. |
| 32 |
|
|
|
| 33 |
wakaba |
1.2 |
This module is part of Whatpm - Perl Modules for |
| 34 |
wakaba |
1.1 |
Web Hypertext Application Technologies. |
| 35 |
|
|
|
| 36 |
|
|
=head1 METHODS |
| 37 |
|
|
|
| 38 |
|
|
=over 4 |
| 39 |
|
|
|
| 40 |
wakaba |
1.2 |
=item [I<$doc> =] Whatpm::HTML->parse_string (I<$s>, I<$doc>[, I<$onerror>]); |
| 41 |
wakaba |
1.1 |
|
| 42 |
|
|
Parse a string I<$s> as an HTML document. |
| 43 |
|
|
|
| 44 |
|
|
The first argument, I<$s>, MUST be a string. It is parsed |
| 45 |
|
|
as a sequence of characters representing an HTML document. |
| 46 |
|
|
|
| 47 |
|
|
The second argument, I<$doc>, MUST be an empty read-write |
| 48 |
|
|
DOM C<Document> object. The HTML DOM tree is constructed |
| 49 |
|
|
onto this C<Document> object. |
| 50 |
|
|
|
| 51 |
|
|
The third argument, I<$onerror>, MUST be a reference to |
| 52 |
|
|
the error handler code. Whenever a parse error is detected, |
| 53 |
|
|
this code is invoked with an argument that contains a |
| 54 |
|
|
useless string that might describe what is wrong. |
| 55 |
|
|
The code MAY throw an exception, so that whole the parsing |
| 56 |
|
|
process aborts. Otherwise, the parser will continue to |
| 57 |
|
|
process the input. The code MUST NOT modify I<$s> or I<$doc>. |
| 58 |
|
|
If it does, then the result is undefined. |
| 59 |
|
|
This argument is optional; if missing, any |
| 60 |
|
|
parse error makes that string being C<warn>ed. |
| 61 |
|
|
|
| 62 |
wakaba |
1.3 |
B<NOTE>: To be a conforming user agent, the code MUST either |
| 63 |
|
|
abort the processing by throwing an exception at the first |
| 64 |
|
|
invocation or MUST continue the processing until the parser |
| 65 |
|
|
stops. |
| 66 |
|
|
|
| 67 |
wakaba |
1.1 |
The method returns the DOM C<Document> object (i.e. the second argument). |
| 68 |
|
|
|
| 69 |
wakaba |
1.2 |
Note that the C<Whatpm::NanoDOM> module provides a non-conforming |
| 70 |
wakaba |
1.4 |
implementation of DOM that only implements a subset that |
| 71 |
wakaba |
1.2 |
is necessary for the purpose of C<Whatpm::HTML>'s parsing and |
| 72 |
wakaba |
1.1 |
serializing. |
| 73 |
|
|
With this module, creating a new HTML C<Document> object |
| 74 |
wakaba |
1.3 |
from a string containing HTML document might be coded as: |
| 75 |
wakaba |
1.1 |
|
| 76 |
wakaba |
1.2 |
use Whatpm::HTML; |
| 77 |
|
|
use Whatpm::NanoDOM; |
| 78 |
|
|
my $doc = Whatpm::HTML->parse_string |
| 79 |
|
|
($s => Whatpm::NanoDOM::Document->new, $onerror); |
| 80 |
wakaba |
1.1 |
|
| 81 |
|
|
=back |
| 82 |
|
|
|
| 83 |
wakaba |
1.5 |
=head1 LOW-LEVEL INTERFACE |
| 84 |
|
|
|
| 85 |
|
|
@@ TBW |
| 86 |
|
|
|
| 87 |
|
|
=head2 Application Cache Selection Algorithm Hook |
| 88 |
|
|
|
| 89 |
|
|
Once a parser I<$p> is instantiated by method C<new>, |
| 90 |
wakaba |
1.6 |
a C<CODE> reference can be set to C<< I<$p>->{application_cache_selection} >>. |
| 91 |
wakaba |
1.5 |
That C<CODE> will be called back when the application cache selection |
| 92 |
|
|
algorithm MUST be run per HTML5. By default, |
| 93 |
wakaba |
1.6 |
C<< I<$p>->{application_cache_selection} >> is set to an empty subroutine. |
| 94 |
|
|
|
| 95 |
|
|
The subroutine will be invoked with an argument I<manifest_uri>, |
| 96 |
|
|
which is set to the manifest URI when the algorithm MUST be invoked |
| 97 |
|
|
with a manifest URI, or is set to C<undef> when the algorithm MUST |
| 98 |
|
|
be invoked without no manifest URI. |
| 99 |
|
|
|
| 100 |
|
|
=head1 ERROR REPORTS |
| 101 |
|
|
|
| 102 |
|
|
@@ TBW |
| 103 |
|
|
|
| 104 |
|
|
The list of the error types is available in |
| 105 |
|
|
Whatpm Error Types <http://suika.fam.cx/gate/2005/sw/Whatpm%20Error%20Types>. |
| 106 |
wakaba |
1.5 |
|
| 107 |
wakaba |
1.1 |
=head1 TO DO |
| 108 |
|
|
|
| 109 |
|
|
Tokenizer should emit a sequence of character tokens as one token |
| 110 |
|
|
to improve performance. |
| 111 |
|
|
|
| 112 |
|
|
A method that accepts a byte stream as an input. |
| 113 |
|
|
|
| 114 |
|
|
Charset detection algorithm. |
| 115 |
|
|
|
| 116 |
wakaba |
1.4 |
Documentation for the setter of inner_html. |
| 117 |
wakaba |
1.1 |
|
| 118 |
|
|
And there are many "TODO"s and "ISSUE"s in the source code. |
| 119 |
|
|
|
| 120 |
|
|
=head1 SEE ALSO |
| 121 |
|
|
|
| 122 |
wakaba |
1.6 |
Whatpm <http://suika.fam.cx/www/markup/html/whatpm/readme>. |
| 123 |
|
|
|
| 124 |
|
|
Whatpm Error Types |
| 125 |
|
|
<http://suika.fam.cx/gate/2005/sw/Whatpm%20Error%20Types>. |
| 126 |
|
|
|
| 127 |
|
|
HTML5 <http://whatwg.org/html5>. |
| 128 |
wakaba |
1.3 |
|
| 129 |
wakaba |
1.7 |
L<Whatpm::HTML::Serializer>. |
| 130 |
|
|
|
| 131 |
wakaba |
1.6 |
L<Whatpm::NanoDOM>. |
| 132 |
wakaba |
1.1 |
|
| 133 |
wakaba |
1.6 |
L<Whatpm::ContentChecker::HTML>. |
| 134 |
wakaba |
1.1 |
|
| 135 |
|
|
=head1 AUTHOR |
| 136 |
|
|
|
| 137 |
|
|
Wakaba <w@suika.fam.cx>. |
| 138 |
|
|
|
| 139 |
|
|
=head1 LICENSE |
| 140 |
|
|
|
| 141 |
|
|
Copyright 2007 Wakaba <w@suika.fam.cx> |
| 142 |
|
|
|
| 143 |
|
|
This library is free software; you can redistribute it |
| 144 |
|
|
and/or modify it under the same terms as Perl itself. |
| 145 |
|
|
|
| 146 |
|
|
=cut |
| 147 |
|
|
|
| 148 |
wakaba |
1.7 |
# $Date: 2007/11/04 04:34:30 $ |