whatpm/Whatpm/HTML.pod

=head1 NAME

Whatpm::HTML - An HTML Parser

=head1 SYNOPSIS

  use Whatpm::HTML;
  
  my $s = q<<!DOCTYPE html><html>...</html>>;
  # $doc = an empty DOM |Document| object
  my $on_error = sub {
    my $error_code = shift;
    warn $error_code, "\n";
  };
  
  Whatpm::HTML->parse_string ($s => $doc, $onerror);
  
  ## Then, |$doc| is the DOM representation of |$s|.

=head1 DESCRIPTION

The C<Whatpm::HTML> module contains HTML parser and serializer.

The HTML parser can be used to construct the DOM tree representation
from an HTML document.  The parsing and tree construction are done 
as described in the Web Application 1.0 specification.

The HTML serializer can be used to obtain the HTML document representation
of a DOM tree (or a tree fragment thereof).  The serialization
is performed as described in the Web Applications 1.0 specification
for C<innerHTML> DOM attribute.

This module is part of Whatpm - Perl Modules for 
Web Hypertext Application Technologies.

=head1 METHODS

=over 4

=item [I<$doc> =] Whatpm::HTML->parse_string (I<$s>, I<$doc>[, I<$onerror>]);

Parse a string I<$s> as an HTML document.

The first argument, I<$s>, MUST be a string.  It is parsed
as a sequence of characters representing an HTML document.

The second argument, I<$doc>, MUST be an empty read-write 
DOM C<Document> object.  The HTML DOM tree is constructed
onto this C<Document> object.

The third argument, I<$onerror>, MUST be a reference to
the error handler code.  Whenever a parse error is detected,
this code is invoked with an argument that contains a
useless string that might describe what is wrong.
The code MAY throw an exception, so that whole the parsing
process aborts.  Otherwise, the parser will continue to
process the input.  The code MUST NOT modify I<$s> or I<$doc>.
If it does, then the result is undefined.
This argument is optional; if missing, any
parse error makes that string being C<warn>ed.

B<NOTE>: To be a conforming user agent, the code MUST either
abort the processing by throwing an exception at the first
invocation or MUST continue the processing until the parser
stops.

The method returns the DOM C<Document> object (i.e. the second argument).

Note that the C<Whatpm::NanoDOM> module provides a non-conforming
implementation of DOM that only implements a subset that
is necessary for the purpose of C<Whatpm::HTML>'s parsing and
serializing.
With this module, creating a new HTML C<Document> object
from a string containing HTML document might be coded as:

  use Whatpm::HTML;
  use Whatpm::NanoDOM;
  my $doc = Whatpm::HTML->parse_string
      ($s => Whatpm::NanoDOM::Document->new, $onerror);

=item I<$s> = Whatpm::HTML->get_inner_html (I<$node>[, I<$onerror>]);

Return the HTML serialization of a DOM node I<$node>.

The first argument, I<$node>, MUST be a DOM C<Document>,
C<Element>, or C<DocumentFragment> node.

The second argument, I<$onerror>, MUST be a reference to the
error handling code.  This code will be invoked if a descendant
of I<$node> is neither of C<Element>, C<Text>, C<CDATASection>,
C<Comment>, C<DocumentType>, nor C<EntityReference>, so
that an C<INVALID_STATE_ERR> exception MUST be thrown.
The code will be invoked with an argument, which is the node
whose type is invalid.  
The argument I<$onerror> is optional; if missing, any erroneous
node is simply ignored.

The method returns a reference to the C<inner_html> attribute
value, i.e. the HTML serialization of the I<$node>.

=back

=head1 TO DO

Tokenizer should emit a sequence of character tokens as one token
to improve performance.

A method that accepts a byte stream as an input.

Charset detection algorithm.

Documentation for the setter of inner_html.

And there are many "TODO"s and "ISSUE"s in the source code.

=head1 SEE ALSO

Whatpm
<http://suika.fam.cx/www/markup/html/whatpm/readme>

Web Applications 1.0 Working Draft (aka HTML5)
<http://whatwg.org/html5>.  (Revision 792, 1 May 2007)

L<Whatpm::NanoDOM>

=head1 AUTHOR

Wakaba <w@suika.fam.cx>.

=head1 LICENSE

Copyright 2007 Wakaba <w@suika.fam.cx>

This library is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

=cut

# $Date: 2007/05/02 13:44:34 $
1	=head1 NAME
2
3	Whatpm::HTML - An HTML Parser
4
5	=head1 SYNOPSIS
6
7	use Whatpm::HTML;
8
9	my $s = q<<!DOCTYPE html><html>...</html>>;
10	# $doc = an empty DOM \|Document\| object
11	my $on_error = sub {
12	my $error_code = shift;
13	warn $error_code, "\n";
14	};
15
16	Whatpm::HTML->parse_string ($s => $doc, $onerror);
17
18	## Then, \|$doc\| is the DOM representation of \|$s\|.
19
20	=head1 DESCRIPTION
21
22	The C<Whatpm::HTML> module contains HTML parser and serializer.
23
24	The HTML parser can be used to construct the DOM tree representation
25	from an HTML document. The parsing and tree construction are done
26	as described in the Web Application 1.0 specification.
27
28	The HTML serializer can be used to obtain the HTML document representation
29	of a DOM tree (or a tree fragment thereof). The serialization
30	is performed as described in the Web Applications 1.0 specification
31	for C<innerHTML> DOM attribute.
32
33	This module is part of Whatpm - Perl Modules for
34	Web Hypertext Application Technologies.
35
36	=head1 METHODS
37
38	=over 4
39
40	=item [I<$doc> =] Whatpm::HTML->parse_string (I<$s>, I<$doc>[, I<$onerror>]);
41
42	Parse a string I<$s> as an HTML document.
43
44	The first argument, I<$s>, MUST be a string. It is parsed
45	as a sequence of characters representing an HTML document.
46
47	The second argument, I<$doc>, MUST be an empty read-write
48	DOM C<Document> object. The HTML DOM tree is constructed
49	onto this C<Document> object.
50
51	The third argument, I<$onerror>, MUST be a reference to
52	the error handler code. Whenever a parse error is detected,
53	this code is invoked with an argument that contains a
54	useless string that might describe what is wrong.
55	The code MAY throw an exception, so that whole the parsing
56	process aborts. Otherwise, the parser will continue to
57	process the input. The code MUST NOT modify I<$s> or I<$doc>.
58	If it does, then the result is undefined.
59	This argument is optional; if missing, any
60	parse error makes that string being C<warn>ed.
61
62	B<NOTE>: To be a conforming user agent, the code MUST either
63	abort the processing by throwing an exception at the first
64	invocation or MUST continue the processing until the parser
65	stops.
66
67	The method returns the DOM C<Document> object (i.e. the second argument).
68
69	Note that the C<Whatpm::NanoDOM> module provides a non-conforming
70	implementation of DOM that only implements a subset that
71	is necessary for the purpose of C<Whatpm::HTML>'s parsing and
72	serializing.
73	With this module, creating a new HTML C<Document> object
74	from a string containing HTML document might be coded as:
75
76	use Whatpm::HTML;
77	use Whatpm::NanoDOM;
78	my $doc = Whatpm::HTML->parse_string
79	($s => Whatpm::NanoDOM::Document->new, $onerror);
80
81	=item I<$s> = Whatpm::HTML->get_inner_html (I<$node>[, I<$onerror>]);
82
83	Return the HTML serialization of a DOM node I<$node>.
84
85	The first argument, I<$node>, MUST be a DOM C<Document>,
86	C<Element>, or C<DocumentFragment> node.
87
88	The second argument, I<$onerror>, MUST be a reference to the
89	error handling code. This code will be invoked if a descendant
90	of I<$node> is neither of C<Element>, C<Text>, C<CDATASection>,
91	C<Comment>, C<DocumentType>, nor C<EntityReference>, so
92	that an C<INVALID_STATE_ERR> exception MUST be thrown.
93	The code will be invoked with an argument, which is the node
94	whose type is invalid.
95	The argument I<$onerror> is optional; if missing, any erroneous
96	node is simply ignored.
97
98	The method returns a reference to the C<inner_html> attribute
99	value, i.e. the HTML serialization of the I<$node>.
100
101	=back
102
103	=head1 TO DO
104
105	Tokenizer should emit a sequence of character tokens as one token
106	to improve performance.
107
108	A method that accepts a byte stream as an input.
109
110	Charset detection algorithm.
111
112	Documentation for the setter of inner_html.
113
114	And there are many "TODO"s and "ISSUE"s in the source code.
115
116	=head1 SEE ALSO
117
118	Whatpm
119	<http://suika.fam.cx/www/markup/html/whatpm/readme>
120
121	Web Applications 1.0 Working Draft (aka HTML5)
122	<http://whatwg.org/html5>. (Revision 792, 1 May 2007)
123
124	L<Whatpm::NanoDOM>
125
126	=head1 AUTHOR
127
128	Wakaba <w@suika.fam.cx>.
129
130	=head1 LICENSE
131
132	Copyright 2007 Wakaba <w@suika.fam.cx>
133
134	This library is free software; you can redistribute it
135	and/or modify it under the same terms as Perl itself.
136
137	=cut
138
139	# $Date: 2007/05/02 13:44:34 $