Monday, November 24, 2003

Theory and XML

XML (eXtensible Markup Language) has become quite a popular data format in recent years. XML roughly corresponds to a tree. For example,

<person><name>Jane</name><major>Computer Science</major></person>

represents a tree. The root having two children, each labeled "person". The first of these children has two children named "name" and "age". The first of those children has a leaf node labeled with the phrase "Harry". For a larger example, see the RSS feed for my weblog.

XML was designed as a flexible way to present documents for later displaying. Since the XML format can be easily produced and parsed, XML also serves as a standardized method for transferring data between databases, far better than the old CSV (Comma-Separated Values) format.

Recently there have been some work on directly manipulating and querying the XML data. As a theorist, this seems like a bad idea, particularly for larger databases. While XML completely represents the underlying tree, it is not a good implementation of that tree. Basic tree operations like parent and sibling are very expensive just looking at XML. About the only thing one can do quickly with XML is depth-first search. Far better to "shred" the data into a better tree implementation like a DOM (Document Object Model) or a full-fledged database and do the work there, rewriting a new XML if needed.

One issue though is when the XML file is on the order of 5-10 GB, a bit larger than what can be stored in the memory of a typical desktop machine. One can stream through the file rather quickly but cannot recreate the tree. This opens up some interesting theoretical questions:

  1. Given a stream of data in XML format, how can one do efficient analysis and manipulations of the underlying tree? I suspect one would want to sometimes shred subtrees, but you cannot determine the size of the subtree until after its been streamed. Perhaps some randomness or streaming the file multiple times might be helpful.
  2. XML might not be the right model of a tree for this purpose. What is the best way to stream a tree or other data structure to allow an efficient implementation of the basic operations of the data structure? Perhaps some redundancy might be useful.

No comments:

Post a Comment