Code for the slcon3 "XML damage control" presentation 
=====================================================

This repo contains contains a benchmark of several XML libraries that
I have written for the presentation. Most of the libraries are C ones
because the goal is to compare some of the simplest but most efficient
tools that ease the pain of having to work with XML.

The libraries compared in this benchmark are:

* ezxml
* simple xml (sxmlc)
* mini xml (mxml)
* yxml
* Go's encoding/xml
* Python's elementtree


Compile
-------

You will have to install the Mini-XML (mxml) library somewhere and
then make sure that the compiler can find it by editing the Makefile
(provided the library is not installed in one of the usual places). All
other libraries have been copied into the benchmark programs (in their
own C file ending on 'lib.c').

As soon as you have the mxml library installed you can just run the usual

make

to compile everything.


Run the benchmark
-----------------

To run the benchmark you need the test input XML files which are a subset
of all the Open Access Pubmed Central full text XML files[0]. The exact
subset used can be found in the 'xmldata/subset.txt' file. The input
consists of 10'000 small XML files that have to be copied into their
subdirectories in the 'xmldata' directory (just untar the tar.gz file
found at the link location there).

If you have located and copied all the input files into 'xmldata/'
you can execute the "runbenchmarks.sh" script to run the benchmark.

The benchmarks will be run 10 times each (taking around 45 minutes to
complete) while being timed. The 10 time measurements will be appended
to log files (so that running the benchmark several times will result in
more data points). The "runbenchmarks.sh" script will convert the time
measurements to seconds afterwards and then run a R one-liner I found
on the internet[1] to print out the mean and the standard deviation of
the measurements to $programname.statistics files.


Bugs
----

Currently only the output of ezxml and goencxml is identical. The other
programs insert some space characters and new lines in places. Ideally
the output of all the programs should be identical. At the moment I
don't have the time to look into where these differences come from but
I doubt that they influence the benchmarking results in a significant way.


[0] I used a subet of ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/non_comm_use.A-B.xml.tar.gz
    (warning: the file is about 1.2GB in size)
[1] http://stackoverflow.com/questions/9789806/command-line-utility-to-print-statistics-of-numbers-in-linux