Whatpm — Perl modules for Web hypertext application technologies (beta)

Introduction

Whatpm is a work-in-progress set of Perl modules for Web hypertext application technologies. It is part of the manakai project.

Whatpm supports various Web standard technologies, including HTML, XHTML, XML, CSS, HTTP, and URL.

Modules

Note that all of these modules are work in progress and have a number of unresolved problems.

Note also that some modules have no documentation for now.

Modules for HTML and XML

Modules related to HTML and XHTML are as follows:

Whatpm::HTML
An implementation of HTML5 document and fragment parsing algorithms. It can be used to convert an arbitrary string into a DOM. (See also demo.)
Whatpm::HTML::Serializer
An implementation of HTML5 fragment serialization algorithm. (See also demo.)
Whatpm::HTMLTable
An implementation of the HTML5 table algorithm. It can be used to extract a table structure from a DOM table element node. (See also demo.)

The module for tentative XML support is as follow:

Whatpm::XMLSerializer
A simple XML serializer.

Real XML parser and serializer are currently not available yet.

The module for conformance checking of a DOM tree (i.e. a in-memory representation of an HTML or XML document) is as follows:

Whatpm::ContentChecker
A DOM5 HTML (in-memory representation of a document) conformance checker with a partial support for Atom 1.0. (See also demo and application.)

Currently, conformance checking of HTML/XHTML and Atom documents is supported.

For these modules, a DOM implementation that supports the manakai's Perl binding of DOM is necessary to represent a document in memory. The manakai-core package contains such an implementation, Message::DOM::Implementation, but it should also be possible to use any other implementation that supports the binding.

Modules for CSS

Modules for CSS and related technologies are as follows:

Whatpm::CSS::Cascade
A media-independent implementation of CSS cascading and value computations. (See also demo.)
Whatpm::CSS::MediaQueryParser
A media query parser. Note that only CSS 2.1 media types are supported at the moment.
Whatpm::CSS::MediaQuerySerializer
A media query serializer. Note that only CSS 2.1 media types are supported at the moment.
Whatpm::CSS::Parser
A CSS parser that constructs CSSOM trees from style sheets. (See also demo.)
Whatpm::CSS::SelectorsParser
A group of selectors parser. (See also demo.)
Whatpm::CSS::SelectorsSerializer
A group of selectors serializer. (See also specification and demo.)
Whatpm::CSS::Tokenizer
A CSS tokenizer. (See also demo.)

For the Whatpm::CSS::Parser module reresenting a CSSOM tree, modules in the manakai-core package are used. Those modules also provide the serializer for the CSSOM tree, in the form of the standard css_text CSSOM attribute.

Modules for HTTP

Modules for HTTP and related technologies are as follows:

Whatpm::ContentType
An implementation of HTML5 Content Type sniffing algorithm.
Whatpm::IMTChecker
An Internet Media Type (aka MIME type) label conformance checker.

Currently, support for parsing of HTTP headers and as such is not yet available.

Module for URL

Module for the URL support is as follows:

Whatpm::URIChecker
An IRI reference conformance checker.

Support for HTML5's realistic definition of URL is not available yet.

Modules for other technologies

Following modules provide support for other Web-related technologies:

Whatpm::CacheManifest
An HTML5 cache manifest parser.
Whatpm::Charset::DecodeHandle
A filehandle-like wrapper interface to decode byte stream encoded in some character encoding.
Whatpm::Charset::UnicodeChecker
A Unicode character string checker.
Whatpm::Charset::UniversalCharDet
A Perl interface to universalchardet character encoding detection library.
Whatpm::LangTag
A language tag parser and conformance checker, supporting both older RFC 3066 definition and latest RFC 4646 definition. (See also demo.)
Whatpm::RDFXML
An implementation of RDF/XML by which RDF triples can be extracted from RDF/XML documents.
Whatpm::WebIDL
A WebIDL fragment parser. It parses an IDL fragment, whether conforming or not, and constructs a DOM-like object model for further processing. Non-conforming (or broken) IDL fragment-like string will be parsed using CSS-like error-tolerant parsing rules, e.g. ignoring anything until next ; character.

Documents

For the description of functionalities provided by each module, see pod documentation of the module. HTML version of pod documentations are linked from the list of modules above.

In addition, there are additional documents for some topics:

Standards supported by WebHACC
List and description of Web standards supported by the WebHACC conformance checker. Although it is a documentation for the WebHACC, it is also applicable to Whatpm in general (note that WebHACC is an interactive user interface for the conformance checking feature provided by Whatpm).
List of error types
Description of errors to be notified to callback functions by Whatpm modules.
Selectors object
Description of data structure for Selectors, as implemented by Whatpm::CSS::SelectorsParser (as output), and Whatpm::CSS::SelectorsSerializer (as input).
List of predefined user data names
List of user data names defined by Whatpm modules.
Handle objects
Description of character or byte stream input handle interfaces.

Following specifications define Whatpm-specific formats and extensions:

SSFT Specification
The specification for the serialization format used for testing Selectors-related modules.
manakai's CSS extensions
The specification for -manakai-* properties and property values implemented by CSS-related modules.
manakai's Selectors extensions
The specification for :-manakai-* pseudo-classes implemented by Selectors-related modules.

Demo

Application

Dependency

Perl 5.8 or later
It is recommended to use newer stable release of Perl 5.8 (or later).
Some modules require Encode modules, which are part of standard Perl distribution.
Modules from manakai-core
Error
Module Whatpm::HTML requires Error, which is bundled in manakai-core.
Message::IMT::InternetMediaType
Module Whatpm::IMTChecker depends on Message::IMT::InternetMediaType, which is part of manakai-core.
Message::URI::URIReference
Modules Whatpm::URIChecker and Whatpm::CacheManifest depend on Message::URI::URIReference, which is part of manakai-core.
Message::Charset::Info
Module Whatpm::ContentChecker depends on Message::Charset::Info, which is part of manakai-core.
Message::DOM::DOMImplementation
Module Whatpm::URIChecker depends on Message::DOM::DOMImplementation, which is part of manakai-core.
Message::DOM::DOMImplementation and related modules
Testing for module Whatpm::ContentChecker depends on Message::DOM::DOMImplementation and related modules in manakai-core. They are not required for any practical use of those modules.
manakai charlib
Module Whatpm::Charset::DecodeHandle depends on modules in manakai charlib for decoding of Japanese character encodings. See the documentation for manakai charlib for more information.
Python, Perl Inline::Python module, and Universal Encoding Detector
For the module Whatpm::Charset::UniversalCharDet being meaningful, these softwares are required on the system. See the documentation for more information.
JSON
Testing for modules Whatpm::HTML and Whatpm::CSS::Tokenizer depends on JSON and related modules. They are not required for any practical use of those modules.

Distribution

The development version of Whatpm may be found in the CVS repository.

The latest developmenet version of the Whatpm is also available as a tarball.

TO DO

Acknowledgments

Thanks to the html5lib team for their HTML5 parser test data.

Author

.

License

Copyright 2007‐2008 Wakaba <>.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.