Character entity reference like strings in 2ch threads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This directory contains results for a quick survey on
usage of character entity reference like strings
(/&[0-9A-Za-z]+;?/) in 28,085 *biased* dat file collection,
as of Auguest 2007, containing 11,332,194 res,
from 2ch and similar BBS Web sites.
* Files
result-all.txt
List of entities sorted by occurence.
result-res.txt
List of entities sorted by occurence, counting more than
one occurence of an entity in a res as one.
result-dat.txt
List of entities sorted by occurence, counting more than
one occurence of an entity in a thread as one.
all-result.txt
Source for result files above, in Perl Data::Dumper output format.
It's a Perl array reference representing:
[number_of_res, {entity => occurence_in_res_number},
number_of_threads, {entity => occurence_in_thread_number},
{entity => occurence}].
* Glossary
Dat file
A file representing a thread, which consists of a number of "res".
Formatted HTML documents provided for Web browsers are generated
from dat files. Dat files might contain some HTML markup including
character entity references.
Thread
A unit of sequential collection of messages in 2ch and similar BBS,
discussing a topic. A thread is part of a board.
Res
A message posted by a user to 2ch or similar BBS. A res belongs
to a thread.