| 1 | 
=head1 NAME | 
| 2 | 
 | 
| 3 | 
What::HTML - An HTML Parser | 
| 4 | 
 | 
| 5 | 
=head1 SYNOPSIS | 
| 6 | 
 | 
| 7 | 
  use What::HTML; | 
| 8 | 
   | 
| 9 | 
  my $s = q<<!DOCTYPE html><html>...</html>>; | 
| 10 | 
  # $doc = an empty DOM |Document| object | 
| 11 | 
  my $on_error = sub { | 
| 12 | 
    my $error_code = shift; | 
| 13 | 
    warn $error_code, "\n"; | 
| 14 | 
  }; | 
| 15 | 
   | 
| 16 | 
  What::HTML->parse_string ($s => $doc, $onerror); | 
| 17 | 
   | 
| 18 | 
  ## Then, |$doc| is the DOM representation of |$s|. | 
| 19 | 
 | 
| 20 | 
=head1 DESCRIPTION | 
| 21 | 
 | 
| 22 | 
The C<What::HTML> module contains HTML parser and serializer. | 
| 23 | 
 | 
| 24 | 
The HTML parser can be used to construct the DOM tree representation | 
| 25 | 
from an HTML document.  The parsing and tree construction are done  | 
| 26 | 
as described in the Web Application 1.0 specification. | 
| 27 | 
 | 
| 28 | 
The HTML serializer can be used to obtain the HTML document representation | 
| 29 | 
of a DOM tree (or a tree fragment thereof).  The serialization | 
| 30 | 
is performed as described in the Web Applications 1.0 specification | 
| 31 | 
for C<innerHTML> DOM attribute. | 
| 32 | 
 | 
| 33 | 
This module is part of WHAT.pm - Perl Modules for  | 
| 34 | 
Web Hypertext Application Technologies. | 
| 35 | 
 | 
| 36 | 
=head1 METHODS | 
| 37 | 
 | 
| 38 | 
=over 4 | 
| 39 | 
 | 
| 40 | 
=item [I<$doc> =] What::HTML->parse_string (I<$s>, I<$doc>[, I<$onerror>]); | 
| 41 | 
 | 
| 42 | 
Parse a string I<$s> as an HTML document. | 
| 43 | 
 | 
| 44 | 
The first argument, I<$s>, MUST be a string.  It is parsed | 
| 45 | 
as a sequence of characters representing an HTML document. | 
| 46 | 
 | 
| 47 | 
The second argument, I<$doc>, MUST be an empty read-write  | 
| 48 | 
DOM C<Document> object.  The HTML DOM tree is constructed | 
| 49 | 
onto this C<Document> object. | 
| 50 | 
 | 
| 51 | 
The third argument, I<$onerror>, MUST be a reference to | 
| 52 | 
the error handler code.  Whenever a parse error is detected, | 
| 53 | 
this code is invoked with an argument that contains a | 
| 54 | 
useless string that might describe what is wrong. | 
| 55 | 
The code MAY throw an exception, so that whole the parsing | 
| 56 | 
process aborts.  Otherwise, the parser will continue to | 
| 57 | 
process the input.  The code MUST NOT modify I<$s> or I<$doc>. | 
| 58 | 
If it does, then the result is undefined. | 
| 59 | 
This argument is optional; if missing, any | 
| 60 | 
parse error makes that string being C<warn>ed. | 
| 61 | 
 | 
| 62 | 
The method returns the DOM C<Document> object (i.e. the second argument). | 
| 63 | 
 | 
| 64 | 
Note that the C<What::NanoDOM> module provides a non-conforming | 
| 65 | 
implementation of DOM that only implements the subset that | 
| 66 | 
is necessary for the purpose of C<What::HTML>'s parsing and | 
| 67 | 
serializing. | 
| 68 | 
With this module, creating a new HTML C<Document> object | 
| 69 | 
from a string containing HTML document can be coded as: | 
| 70 | 
 | 
| 71 | 
  use What::HTML; | 
| 72 | 
  use What::NanoDOM; | 
| 73 | 
  my $doc = What::HTML->parse_string ($s => What::NanoDOM->new, $onerror); | 
| 74 | 
 | 
| 75 | 
=item I<$s> = What::HTML->get_inner_html (I<$node>[, I<$onerror>]); | 
| 76 | 
 | 
| 77 | 
Return the HTML serialization of a DOM node I<$node>. | 
| 78 | 
 | 
| 79 | 
The first argument, I<$node>, MUST be a DOM C<Document>, | 
| 80 | 
C<Node>, or C<DocumentFragment> object. | 
| 81 | 
 | 
| 82 | 
The second argument, I<$onerror>, MUST be a reference to the | 
| 83 | 
error handling code.  This code will be invoked if a descendant | 
| 84 | 
of C<$node> is not of C<Element>, C<Text>, C<CDATASection>, | 
| 85 | 
C<Comment>, C<DocumentType>, or C<EntityReference> so | 
| 86 | 
that C<INVALID_STATE_ERR> MUST be thrown. | 
| 87 | 
The code will be invoked with an argument, which is the node | 
| 88 | 
whose type is invalid.   | 
| 89 | 
This argument is optional; if missing, any such | 
| 90 | 
node is simply ignored. | 
| 91 | 
 | 
| 92 | 
The method returns the C<inner_html> attribute | 
| 93 | 
value, i.e. the HTML serialization of the C<$node>. | 
| 94 | 
 | 
| 95 | 
=back | 
| 96 | 
 | 
| 97 | 
=head1 TO DO | 
| 98 | 
 | 
| 99 | 
Tokenizer should emit a sequence of character tokens as one token | 
| 100 | 
to improve performance. | 
| 101 | 
 | 
| 102 | 
A method that accepts a byte stream as an input. | 
| 103 | 
 | 
| 104 | 
Charset detection algorithm. | 
| 105 | 
 | 
| 106 | 
Setting inner_html. | 
| 107 | 
 | 
| 108 | 
And there are many "TODO"s and "ISSUE"s in the source code. | 
| 109 | 
 | 
| 110 | 
=head1 SEE ALSO | 
| 111 | 
 | 
| 112 | 
Web Applications 1.0 Working Draft (aka HTML5) | 
| 113 | 
<http://whatwg.org/html5>.  (Revision 792, 1 May 2007) | 
| 114 | 
 | 
| 115 | 
L<What::NanoDOM> | 
| 116 | 
 | 
| 117 | 
=head1 AUTHOR | 
| 118 | 
 | 
| 119 | 
Wakaba <w@suika.fam.cx>. | 
| 120 | 
 | 
| 121 | 
=head1 LICENSE | 
| 122 | 
 | 
| 123 | 
Copyright 2007 Wakaba <w@suika.fam.cx> | 
| 124 | 
 | 
| 125 | 
This library is free software; you can redistribute it | 
| 126 | 
and/or modify it under the same terms as Perl itself. | 
| 127 | 
 | 
| 128 | 
=cut | 
| 129 | 
 | 
| 130 | 
# $Date: 2007/05/01 08:17:44 $ |