| 1 |
wakaba |
1.1 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> |
| 2 |
|
|
<html xmlns="http://www.w3.org/1999/xhtml"> |
| 3 |
|
|
<head> |
| 4 |
wakaba |
1.6 |
<title>Whatpm::HTML - An HTML Parser and Serializer</title> |
| 5 |
wakaba |
1.1 |
<link rel="stylesheet" href="http://suika.fam.cx/www/style/html/pod.css" type="text/css" /> |
| 6 |
|
|
<link rev="made" href="mailto:admin@suika.fam.cx" /> |
| 7 |
|
|
</head> |
| 8 |
|
|
|
| 9 |
|
|
<body> |
| 10 |
|
|
|
| 11 |
|
|
<p><a name="__index__"></a></p> |
| 12 |
|
|
<!-- INDEX BEGIN --> |
| 13 |
|
|
|
| 14 |
|
|
<ul> |
| 15 |
|
|
|
| 16 |
|
|
<li><a href="#name">NAME</a></li> |
| 17 |
|
|
<li><a href="#synopsis">SYNOPSIS</a></li> |
| 18 |
|
|
<li><a href="#description">DESCRIPTION</a></li> |
| 19 |
|
|
<li><a href="#methods">METHODS</a></li> |
| 20 |
wakaba |
1.5 |
<li><a href="#lowlevel_interface">LOW-LEVEL INTERFACE</a></li> |
| 21 |
|
|
<ul> |
| 22 |
|
|
|
| 23 |
|
|
<li><a href="#application_cache_selection_algorithm_hook">Application Cache Selection Algorithm Hook</a></li> |
| 24 |
|
|
</ul> |
| 25 |
|
|
|
| 26 |
wakaba |
1.6 |
<li><a href="#error_reports">ERROR REPORTS</a></li> |
| 27 |
wakaba |
1.1 |
<li><a href="#to_do">TO DO</a></li> |
| 28 |
wakaba |
1.8 |
<li><a href="#dependency">DEPENDENCY</a></li> |
| 29 |
wakaba |
1.1 |
<li><a href="#see_also">SEE ALSO</a></li> |
| 30 |
|
|
<li><a href="#author">AUTHOR</a></li> |
| 31 |
|
|
<li><a href="#license">LICENSE</a></li> |
| 32 |
|
|
</ul> |
| 33 |
|
|
<!-- INDEX END --> |
| 34 |
|
|
|
| 35 |
|
|
<hr /> |
| 36 |
|
|
<p> |
| 37 |
|
|
</p> |
| 38 |
|
|
<h1><a name="name">NAME</a></h1> |
| 39 |
wakaba |
1.6 |
<p>Whatpm::HTML - An HTML Parser and Serializer</p> |
| 40 |
wakaba |
1.1 |
<p> |
| 41 |
|
|
</p> |
| 42 |
|
|
<hr /> |
| 43 |
|
|
<h1><a name="synopsis">SYNOPSIS</a></h1> |
| 44 |
|
|
<pre> |
| 45 |
wakaba |
1.2 |
use Whatpm::HTML; |
| 46 |
wakaba |
1.1 |
|
| 47 |
|
|
my $s = q<<!DOCTYPE html><html>...</html>>; |
| 48 |
|
|
# $doc = an empty DOM |Document| object |
| 49 |
|
|
my $on_error = sub { |
| 50 |
wakaba |
1.6 |
my %error = @_; |
| 51 |
|
|
warn $error{type}, "\n"; |
| 52 |
wakaba |
1.1 |
}; |
| 53 |
|
|
|
| 54 |
wakaba |
1.8 |
Whatpm::HTML->parse_char_string ($s => $doc, $onerror); |
| 55 |
wakaba |
1.1 |
|
| 56 |
wakaba |
1.6 |
## Now, |$doc| is the DOM representation of |$s|.</pre> |
| 57 |
wakaba |
1.1 |
<p> |
| 58 |
|
|
</p> |
| 59 |
|
|
<hr /> |
| 60 |
|
|
<h1><a name="description">DESCRIPTION</a></h1> |
| 61 |
wakaba |
1.2 |
<p>The <code>Whatpm::HTML</code> module contains HTML parser and serializer.</p> |
| 62 |
wakaba |
1.1 |
<p>The HTML parser can be used to construct the DOM tree representation |
| 63 |
|
|
from an HTML document. The parsing and tree construction are done |
| 64 |
|
|
as described in the Web Application 1.0 specification.</p> |
| 65 |
|
|
<p>The HTML serializer can be used to obtain the HTML document representation |
| 66 |
|
|
of a DOM tree (or a tree fragment thereof). The serialization |
| 67 |
|
|
is performed as described in the Web Applications 1.0 specification |
| 68 |
|
|
for <code>innerHTML</code> DOM attribute.</p> |
| 69 |
wakaba |
1.2 |
<p>This module is part of Whatpm - Perl Modules for |
| 70 |
wakaba |
1.1 |
Web Hypertext Application Technologies.</p> |
| 71 |
|
|
<p> |
| 72 |
|
|
</p> |
| 73 |
|
|
<hr /> |
| 74 |
|
|
<h1><a name="methods">METHODS</a></h1> |
| 75 |
|
|
<dl> |
| 76 |
wakaba |
1.8 |
<dt><strong><a name="item_parse_char_string">[<em>$doc</em> =] Whatpm::HTML->parse_char_string (<em>$s</em>, <em>$doc</em>[, <em>$onerror</em>]);</a></strong><br /> |
| 77 |
wakaba |
1.1 |
</dt> |
| 78 |
|
|
<dd> |
| 79 |
|
|
Parse a string <em>$s</em> as an HTML document. |
| 80 |
|
|
</dd> |
| 81 |
|
|
<dd> |
| 82 |
|
|
<p>The first argument, <em>$s</em>, MUST be a string. It is parsed |
| 83 |
|
|
as a sequence of characters representing an HTML document.</p> |
| 84 |
|
|
</dd> |
| 85 |
|
|
<dd> |
| 86 |
|
|
<p>The second argument, <em>$doc</em>, MUST be an empty read-write |
| 87 |
|
|
DOM <code>Document</code> object. The HTML DOM tree is constructed |
| 88 |
|
|
onto this <code>Document</code> object.</p> |
| 89 |
|
|
</dd> |
| 90 |
|
|
<dd> |
| 91 |
|
|
<p>The third argument, <em>$onerror</em>, MUST be a reference to |
| 92 |
|
|
the error handler code. Whenever a parse error is detected, |
| 93 |
|
|
this code is invoked with an argument that contains a |
| 94 |
|
|
useless string that might describe what is wrong. |
| 95 |
|
|
The code MAY throw an exception, so that whole the parsing |
| 96 |
|
|
process aborts. Otherwise, the parser will continue to |
| 97 |
|
|
process the input. The code MUST NOT modify <em>$s</em> or <em>$doc</em>. |
| 98 |
|
|
If it does, then the result is undefined. |
| 99 |
|
|
This argument is optional; if missing, any |
| 100 |
|
|
parse error makes that string being <code>warn</code>ed.</p> |
| 101 |
|
|
</dd> |
| 102 |
|
|
<dd> |
| 103 |
wakaba |
1.3 |
<p><strong>NOTE</strong>: To be a conforming user agent, the code MUST either |
| 104 |
|
|
abort the processing by throwing an exception at the first |
| 105 |
|
|
invocation or MUST continue the processing until the parser |
| 106 |
|
|
stops.</p> |
| 107 |
|
|
</dd> |
| 108 |
|
|
<dd> |
| 109 |
wakaba |
1.1 |
<p>The method returns the DOM <code>Document</code> object (i.e. the second argument).</p> |
| 110 |
|
|
</dd> |
| 111 |
|
|
<dd> |
| 112 |
wakaba |
1.2 |
<p>Note that the <code>Whatpm::NanoDOM</code> module provides a non-conforming |
| 113 |
wakaba |
1.4 |
implementation of DOM that only implements a subset that |
| 114 |
wakaba |
1.2 |
is necessary for the purpose of <code>Whatpm::HTML</code>'s parsing and |
| 115 |
wakaba |
1.1 |
serializing. |
| 116 |
|
|
With this module, creating a new HTML <code>Document</code> object |
| 117 |
wakaba |
1.3 |
from a string containing HTML document might be coded as:</p> |
| 118 |
wakaba |
1.1 |
</dd> |
| 119 |
|
|
<dd> |
| 120 |
|
|
<pre> |
| 121 |
wakaba |
1.2 |
use Whatpm::HTML; |
| 122 |
|
|
use Whatpm::NanoDOM; |
| 123 |
wakaba |
1.8 |
my $doc = Whatpm::HTML->parse_char_string |
| 124 |
wakaba |
1.2 |
($s => Whatpm::NanoDOM::Document->new, $onerror);</pre> |
| 125 |
wakaba |
1.1 |
</dd> |
| 126 |
|
|
<p></p></dl> |
| 127 |
wakaba |
1.5 |
<p> |
| 128 |
|
|
</p> |
| 129 |
|
|
<hr /> |
| 130 |
|
|
<h1><a name="lowlevel_interface">LOW-LEVEL INTERFACE</a></h1> |
| 131 |
|
|
<p>@@ TBW</p> |
| 132 |
|
|
<p> |
| 133 |
|
|
</p> |
| 134 |
|
|
<h2><a name="application_cache_selection_algorithm_hook">Application Cache Selection Algorithm Hook</a></h2> |
| 135 |
|
|
<p>Once a parser <em>$p</em> is instantiated by method <code>new</code>, |
| 136 |
wakaba |
1.6 |
a <code>CODE</code> reference can be set to <code>$p->{application_cache_selection}</code>. |
| 137 |
wakaba |
1.5 |
That <code>CODE</code> will be called back when the application cache selection |
| 138 |
|
|
algorithm MUST be run per HTML5. By default, |
| 139 |
wakaba |
1.6 |
<code>$p->{application_cache_selection}</code> is set to an empty subroutine.</p> |
| 140 |
|
|
<p>The subroutine will be invoked with an argument <em>manifest_uri</em>, |
| 141 |
|
|
which is set to the manifest URI when the algorithm MUST be invoked |
| 142 |
|
|
with a manifest URI, or is set to <code>undef</code> when the algorithm MUST |
| 143 |
|
|
be invoked without no manifest URI.</p> |
| 144 |
|
|
<p> |
| 145 |
|
|
</p> |
| 146 |
|
|
<hr /> |
| 147 |
|
|
<h1><a name="error_reports">ERROR REPORTS</a></h1> |
| 148 |
|
|
<p>@@ TBW</p> |
| 149 |
|
|
<p>The list of the error types is available in |
| 150 |
|
|
Whatpm Error Types <http://suika.fam.cx/gate/2005/sw/Whatpm%20Error%20Types>.</p> |
| 151 |
wakaba |
1.1 |
<p> |
| 152 |
|
|
</p> |
| 153 |
|
|
<hr /> |
| 154 |
|
|
<h1><a name="to_do">TO DO</a></h1> |
| 155 |
wakaba |
1.8 |
<p>Documentation for parse_byte_string.</p> |
| 156 |
wakaba |
1.1 |
<p>Tokenizer should emit a sequence of character tokens as one token |
| 157 |
|
|
to improve performance.</p> |
| 158 |
|
|
<p>A method that accepts a byte stream as an input.</p> |
| 159 |
|
|
<p>Charset detection algorithm.</p> |
| 160 |
wakaba |
1.4 |
<p>Documentation for the setter of inner_html.</p> |
| 161 |
wakaba |
1.1 |
<p>And there are many ``TODO''s and ``ISSUE''s in the source code.</p> |
| 162 |
wakaba |
1.8 |
<p> |
| 163 |
|
|
</p> |
| 164 |
|
|
<hr /> |
| 165 |
|
|
<h1><a name="dependency">DEPENDENCY</a></h1> |
| 166 |
|
|
<p>This module requires <em>Error</em>. That module is available at CPAN |
| 167 |
|
|
<http://search.cpan.org/author/SHLOMIF/Error-0.17009/lib/Error.pm>. |
| 168 |
|
|
It is also part of manakai-core distribution |
| 169 |
|
|
<http://suika.fam.cx/www/2006/manakai/>.</p> |
| 170 |
wakaba |
1.1 |
<p> |
| 171 |
|
|
</p> |
| 172 |
|
|
<hr /> |
| 173 |
|
|
<h1><a name="see_also">SEE ALSO</a></h1> |
| 174 |
wakaba |
1.6 |
<p>Whatpm <http://suika.fam.cx/www/markup/html/whatpm/readme>.</p> |
| 175 |
|
|
<p>Whatpm Error Types |
| 176 |
|
|
<http://suika.fam.cx/gate/2005/sw/Whatpm%20Error%20Types>.</p> |
| 177 |
|
|
<p>HTML5 <http://whatwg.org/html5>.</p> |
| 178 |
wakaba |
1.7 |
<p><a href="../Whatpm/HTML/Serializer.html">the Whatpm::HTML::Serializer manpage</a>.</p> |
| 179 |
wakaba |
1.6 |
<p><a href="../Whatpm/NanoDOM.html">the Whatpm::NanoDOM manpage</a>.</p> |
| 180 |
|
|
<p><a href="../Whatpm/ContentChecker/HTML.html">the Whatpm::ContentChecker::HTML manpage</a>.</p> |
| 181 |
wakaba |
1.1 |
<p> |
| 182 |
|
|
</p> |
| 183 |
|
|
<hr /> |
| 184 |
|
|
<h1><a name="author">AUTHOR</a></h1> |
| 185 |
|
|
<p>Wakaba <<a href="mailto:w@suika.fam.cx">w@suika.fam.cx</a>>.</p> |
| 186 |
|
|
<p> |
| 187 |
|
|
</p> |
| 188 |
|
|
<hr /> |
| 189 |
|
|
<h1><a name="license">LICENSE</a></h1> |
| 190 |
|
|
<p>Copyright 2007 Wakaba <<a href="mailto:w@suika.fam.cx">w@suika.fam.cx</a>></p> |
| 191 |
|
|
<p>This library is free software; you can redistribute it |
| 192 |
|
|
and/or modify it under the same terms as Perl itself.</p> |
| 193 |
|
|
|
| 194 |
|
|
</body> |
| 195 |
|
|
|
| 196 |
|
|
</html> |