1 |
wakaba |
1.1 |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> |
2 |
|
|
<html xmlns="http://www.w3.org/1999/xhtml"> |
3 |
|
|
<head> |
4 |
wakaba |
1.2 |
<title>Whatpm::HTML - An HTML Parser</title> |
5 |
wakaba |
1.1 |
<link rel="stylesheet" href="http://suika.fam.cx/www/style/html/pod.css" type="text/css" /> |
6 |
|
|
<link rev="made" href="mailto:admin@suika.fam.cx" /> |
7 |
|
|
</head> |
8 |
|
|
|
9 |
|
|
<body> |
10 |
|
|
|
11 |
|
|
<p><a name="__index__"></a></p> |
12 |
|
|
<!-- INDEX BEGIN --> |
13 |
|
|
|
14 |
|
|
<ul> |
15 |
|
|
|
16 |
|
|
<li><a href="#name">NAME</a></li> |
17 |
|
|
<li><a href="#synopsis">SYNOPSIS</a></li> |
18 |
|
|
<li><a href="#description">DESCRIPTION</a></li> |
19 |
|
|
<li><a href="#methods">METHODS</a></li> |
20 |
|
|
<li><a href="#to_do">TO DO</a></li> |
21 |
|
|
<li><a href="#see_also">SEE ALSO</a></li> |
22 |
|
|
<li><a href="#author">AUTHOR</a></li> |
23 |
|
|
<li><a href="#license">LICENSE</a></li> |
24 |
|
|
</ul> |
25 |
|
|
<!-- INDEX END --> |
26 |
|
|
|
27 |
|
|
<hr /> |
28 |
|
|
<p> |
29 |
|
|
</p> |
30 |
|
|
<h1><a name="name">NAME</a></h1> |
31 |
wakaba |
1.2 |
<p>Whatpm::HTML - An HTML Parser</p> |
32 |
wakaba |
1.1 |
<p> |
33 |
|
|
</p> |
34 |
|
|
<hr /> |
35 |
|
|
<h1><a name="synopsis">SYNOPSIS</a></h1> |
36 |
|
|
<pre> |
37 |
wakaba |
1.2 |
use Whatpm::HTML; |
38 |
wakaba |
1.1 |
|
39 |
|
|
my $s = q<<!DOCTYPE html><html>...</html>>; |
40 |
|
|
# $doc = an empty DOM |Document| object |
41 |
|
|
my $on_error = sub { |
42 |
|
|
my $error_code = shift; |
43 |
|
|
warn $error_code, "\n"; |
44 |
|
|
}; |
45 |
|
|
|
46 |
wakaba |
1.2 |
Whatpm::HTML->parse_string ($s => $doc, $onerror); |
47 |
wakaba |
1.1 |
|
48 |
|
|
## Then, |$doc| is the DOM representation of |$s|.</pre> |
49 |
|
|
<p> |
50 |
|
|
</p> |
51 |
|
|
<hr /> |
52 |
|
|
<h1><a name="description">DESCRIPTION</a></h1> |
53 |
wakaba |
1.2 |
<p>The <code>Whatpm::HTML</code> module contains HTML parser and serializer.</p> |
54 |
wakaba |
1.1 |
<p>The HTML parser can be used to construct the DOM tree representation |
55 |
|
|
from an HTML document. The parsing and tree construction are done |
56 |
|
|
as described in the Web Application 1.0 specification.</p> |
57 |
|
|
<p>The HTML serializer can be used to obtain the HTML document representation |
58 |
|
|
of a DOM tree (or a tree fragment thereof). The serialization |
59 |
|
|
is performed as described in the Web Applications 1.0 specification |
60 |
|
|
for <code>innerHTML</code> DOM attribute.</p> |
61 |
wakaba |
1.2 |
<p>This module is part of Whatpm - Perl Modules for |
62 |
wakaba |
1.1 |
Web Hypertext Application Technologies.</p> |
63 |
|
|
<p> |
64 |
|
|
</p> |
65 |
|
|
<hr /> |
66 |
|
|
<h1><a name="methods">METHODS</a></h1> |
67 |
|
|
<dl> |
68 |
wakaba |
1.2 |
<dt><strong><a name="item_parse_string">[<em>$doc</em> =] Whatpm::HTML->parse_string (<em>$s</em>, <em>$doc</em>[, <em>$onerror</em>]);</a></strong><br /> |
69 |
wakaba |
1.1 |
</dt> |
70 |
|
|
<dd> |
71 |
|
|
Parse a string <em>$s</em> as an HTML document. |
72 |
|
|
</dd> |
73 |
|
|
<dd> |
74 |
|
|
<p>The first argument, <em>$s</em>, MUST be a string. It is parsed |
75 |
|
|
as a sequence of characters representing an HTML document.</p> |
76 |
|
|
</dd> |
77 |
|
|
<dd> |
78 |
|
|
<p>The second argument, <em>$doc</em>, MUST be an empty read-write |
79 |
|
|
DOM <code>Document</code> object. The HTML DOM tree is constructed |
80 |
|
|
onto this <code>Document</code> object.</p> |
81 |
|
|
</dd> |
82 |
|
|
<dd> |
83 |
|
|
<p>The third argument, <em>$onerror</em>, MUST be a reference to |
84 |
|
|
the error handler code. Whenever a parse error is detected, |
85 |
|
|
this code is invoked with an argument that contains a |
86 |
|
|
useless string that might describe what is wrong. |
87 |
|
|
The code MAY throw an exception, so that whole the parsing |
88 |
|
|
process aborts. Otherwise, the parser will continue to |
89 |
|
|
process the input. The code MUST NOT modify <em>$s</em> or <em>$doc</em>. |
90 |
|
|
If it does, then the result is undefined. |
91 |
|
|
This argument is optional; if missing, any |
92 |
|
|
parse error makes that string being <code>warn</code>ed.</p> |
93 |
|
|
</dd> |
94 |
|
|
<dd> |
95 |
wakaba |
1.3 |
<p><strong>NOTE</strong>: To be a conforming user agent, the code MUST either |
96 |
|
|
abort the processing by throwing an exception at the first |
97 |
|
|
invocation or MUST continue the processing until the parser |
98 |
|
|
stops.</p> |
99 |
|
|
</dd> |
100 |
|
|
<dd> |
101 |
wakaba |
1.1 |
<p>The method returns the DOM <code>Document</code> object (i.e. the second argument).</p> |
102 |
|
|
</dd> |
103 |
|
|
<dd> |
104 |
wakaba |
1.2 |
<p>Note that the <code>Whatpm::NanoDOM</code> module provides a non-conforming |
105 |
wakaba |
1.1 |
implementation of DOM that only implements the subset that |
106 |
wakaba |
1.2 |
is necessary for the purpose of <code>Whatpm::HTML</code>'s parsing and |
107 |
wakaba |
1.1 |
serializing. |
108 |
|
|
With this module, creating a new HTML <code>Document</code> object |
109 |
wakaba |
1.3 |
from a string containing HTML document might be coded as:</p> |
110 |
wakaba |
1.1 |
</dd> |
111 |
|
|
<dd> |
112 |
|
|
<pre> |
113 |
wakaba |
1.2 |
use Whatpm::HTML; |
114 |
|
|
use Whatpm::NanoDOM; |
115 |
|
|
my $doc = Whatpm::HTML->parse_string |
116 |
|
|
($s => Whatpm::NanoDOM::Document->new, $onerror);</pre> |
117 |
wakaba |
1.1 |
</dd> |
118 |
|
|
<p></p> |
119 |
wakaba |
1.2 |
<dt><strong><a name="item_get_inner_html"><em>$s</em> = Whatpm::HTML->get_inner_html (<em>$node</em>[, <em>$onerror</em>]);</a></strong><br /> |
120 |
wakaba |
1.1 |
</dt> |
121 |
|
|
<dd> |
122 |
|
|
Return the HTML serialization of a DOM node <em>$node</em>. |
123 |
|
|
</dd> |
124 |
|
|
<dd> |
125 |
|
|
<p>The first argument, <em>$node</em>, MUST be a DOM <code>Document</code>, |
126 |
|
|
<code>Node</code>, or <code>DocumentFragment</code> object.</p> |
127 |
|
|
</dd> |
128 |
|
|
<dd> |
129 |
|
|
<p>The second argument, <em>$onerror</em>, MUST be a reference to the |
130 |
|
|
error handling code. This code will be invoked if a descendant |
131 |
wakaba |
1.3 |
of <em>$node</em> is not of <code>Element</code>, <code>Text</code>, <code>CDATASection</code>, |
132 |
wakaba |
1.1 |
<code>Comment</code>, <code>DocumentType</code>, or <code>EntityReference</code> so |
133 |
|
|
that <code>INVALID_STATE_ERR</code> MUST be thrown. |
134 |
|
|
The code will be invoked with an argument, which is the node |
135 |
|
|
whose type is invalid. |
136 |
|
|
This argument is optional; if missing, any such |
137 |
|
|
node is simply ignored.</p> |
138 |
|
|
</dd> |
139 |
|
|
<dd> |
140 |
wakaba |
1.2 |
<p>The method returns a reference to the <code>inner_html</code> attribute |
141 |
wakaba |
1.3 |
value, i.e. the HTML serialization of the <em>$node</em>.</p> |
142 |
wakaba |
1.1 |
</dd> |
143 |
|
|
<p></p></dl> |
144 |
|
|
<p> |
145 |
|
|
</p> |
146 |
|
|
<hr /> |
147 |
|
|
<h1><a name="to_do">TO DO</a></h1> |
148 |
|
|
<p>Tokenizer should emit a sequence of character tokens as one token |
149 |
|
|
to improve performance.</p> |
150 |
|
|
<p>A method that accepts a byte stream as an input.</p> |
151 |
|
|
<p>Charset detection algorithm.</p> |
152 |
|
|
<p>Setting inner_html.</p> |
153 |
|
|
<p>And there are many ``TODO''s and ``ISSUE''s in the source code.</p> |
154 |
|
|
<p> |
155 |
|
|
</p> |
156 |
|
|
<hr /> |
157 |
|
|
<h1><a name="see_also">SEE ALSO</a></h1> |
158 |
wakaba |
1.3 |
<p>Whatpm |
159 |
|
|
<http://suika.fam.cx/www/markup/html/whatpm/readme></p> |
160 |
wakaba |
1.1 |
<p>Web Applications 1.0 Working Draft (aka HTML5) |
161 |
|
|
<http://whatwg.org/html5>. (Revision 792, 1 May 2007)</p> |
162 |
wakaba |
1.2 |
<p><a href="../Whatpm/NanoDOM.html">the Whatpm::NanoDOM manpage</a></p> |
163 |
wakaba |
1.1 |
<p> |
164 |
|
|
</p> |
165 |
|
|
<hr /> |
166 |
|
|
<h1><a name="author">AUTHOR</a></h1> |
167 |
|
|
<p>Wakaba <<a href="mailto:w@suika.fam.cx">w@suika.fam.cx</a>>.</p> |
168 |
|
|
<p> |
169 |
|
|
</p> |
170 |
|
|
<hr /> |
171 |
|
|
<h1><a name="license">LICENSE</a></h1> |
172 |
|
|
<p>Copyright 2007 Wakaba <<a href="mailto:w@suika.fam.cx">w@suika.fam.cx</a>></p> |
173 |
|
|
<p>This library is free software; you can redistribute it |
174 |
|
|
and/or modify it under the same terms as Perl itself.</p> |
175 |
|
|
|
176 |
|
|
</body> |
177 |
|
|
|
178 |
|
|
</html> |