Wednesday, January 04, 2006

The Technology of Academic Papers

The Internet has led to a complete shifts in how we deal with storing and sharing information, but when it comes to academic papers the changes we see are ad hoc and added in a piecemeal basis.

Suppose we could start from scratch and create a proper system for research papers. Here is how I would envision such a system.

XML has become the standard for storing information on the internet; it gives a simple machine-readable method for creating tree structures. Academic papers have such a tree structure (Sections, subsections, theorems, proofs, etc.) that would lend it itself well to XML. Mathematical equations should also be written using XML, we already have a MathML specification for doing this.

A academic paper XML file would only have content information, not any formatting information. For this we would use XSL files, themselves XML files that describe how to format the document. You would use different XSL files depending on whether the paper is viewed on the screen or printed, and different publishers can develop their own XSL files to have consistent looking papers. LaTeX, the system used by most theoretical computer scientists, has similar capabilities but because LaTeX does not enforce any standards, changing style files often requires considerable editing.

Researchers will not have to create these XML files directly (unless they want to) but can use word processors that will save the documents according to those standards.

For citations we should just point to a unique identifier for a paper, no longer should we need to cut and paste bibliographic information. The formatting program can go online based on the identifier to get the information to create a human readable bibliography with web links if appropriate. Most publishers already use Digital Object Identifiers (DOI), we just need DOIs to point to an XML file giving bibliographic information, have DOIs for unpublished papers and have a method for DOIs to point to a later version of a paper.

The author information on academic papers are often useless (like my postal address) or out of date as academics change locations. Each academic research should get their own DOI-like number that points to an XML file giving personal and contact information and then we only need add these DOIs to the academic papers.

Most importantly we need to have enforced standards for each of these XML documents (via XML schemas). If we can truly separate the content from the formatting of documents, and make that content available in an easy machine-readable forms, not only can researchers focus more on the writing and less on the style but will also open the door to applications that we cannot even imagine today.


  1. A academic paper XML file would only have content information, not any formatting information.

    This may be reasonable for the text of a document, but I don't think it is for the illustrations. Illustrations can be embedded into an XML document as SVG, of course, but in creating the illustration one needs to know something about the font of the document (to match it in the text within the illustration), the width of the document area, the availability of color (that is, do I need to fall back on black/white shading conventions, unless we're positing a nonexistent ideal world where all papers are read and printed in full color), etc.

  2. Each academic research should get their own DOI-like number... I hope this will never happen. We are human beings, not machines. On the top of this, regular names have an element of redandancy in them that helps with error correction.

  3. Another alternative would be to use the MediaWiki syntax for documents. All you'd need is to create a server that has MediaWiki installed on it with some additional programs to generate PDFs. For images there could be a way to supply both print and screen versions.

    The MediaWiki syntax (which is what Wikipedia uses) is more natural than LaTeX, and far more natural than XML. (This server could output XML as well as PDFs, so it could be extensible that way.) A bonus is that all formulas are written using LaTeX conventions... I think most scientists who publish papers find the TeX/LaTeX syntax to be easy to use (though hard to learn).

    If all files are stored (or referenced) centrally then people could also use formatting conventions consistently. No more cases of some complexity classes appearing in black-board bold in some papers, and in MathCal in others.

  4. Recently Lance and I were dealing with
    the exact specs of some Journal, which
    was annoying. His reaction was to write
    the weblog. Ironically, I disagree with him.
    I'm for a free-for-all. Authors should
    be able to use whatever latex or other
    style files they want to. This will work
    for e-journals, which are the wave of the
    future anyway. I grant they will NOT work
    for non-e-journals. If my margins and
    biblio differ from yours- who cares?


  5. Having people write LaTeX and converting automatically to XML for archival purposes makes no sense, as the conversion can only lose information. What is stored should be as close as possible to the human input, and the language used should encourage people to include helpful semantic information. LaTeX does a pretty good job of this, actually, and I think most of its problems could be best handled by adopting a further standard building on LaTeX. You could start from scratch, but you'd have trouble getting people to adopt the new standard.

    A unique identifier for each researcher would be helpful, but I agree with "anonymous" that it should not be a number. It is just as easy to use words: I suggest the researcher's full name, plus an additional word or phrase to be used in case of redundant names. Just think, every researcher would have their own personal motto!

    An alternative, less exciting, disambiguation would be the identifier of your thesis advisor. It might be a problem for graduate students who switch advisors, though, and I guess there's always the possibility of one person having two students with the same name.

  6. One day in the not-so-distant future computers will write all the proofs. Until then, let's keep our papers like stories, emphasizing substance _and_ style. At the moment, the only thing we have going for us over the machines is this unquantifiable notion of creativity. Don't be embarrassed.

  7. A few points:

    I think the more standard way to format web pages today is xml+css, or just xhtml+css. This is very powerful (it encompasses a lightweight analog of powerpoint, for example), but not as powerful as LaTex macros. This guy has done some amazing things in math with XML+CSS, but I don't quite see the improvement w.r.t. Mathml.

    Mathml has both "presentational" elements and "semantic" elements; the latter do encode more information about the intended meaning of formulae than LaTex does; there's a discussion here, for example;

    Mathml is excruciating to enter by hand, and many people (like me) dislike WYSIWYG editing of math. It's difficult to see how something TeX-like, such as asciimathml could be avoided. This defeats some of the purposes of MathML, unfortunately, such as that semantic markup.

    MediaWiki handles many simple formatting tasks elegantly, but I don't think it does anything in particular for math.

  8. BTW, another very interesting recent result in algorithm design has been obtianed by Hajiaghayi, Kortsarz, Salavatipour who obtained polylogarithmic approximation algorithm for non-uniform multicommodity rent-or-buy, a problem which was open for several years.
    This was another important result in 2005.

  9. I believe that the spirit of this is well placed, but I don't entirely agree with the implementation. Your post on Fonts sometime ago provides for further motivation on this topic.

    I agree with the principle of separation of data and presentation layer as this is a common foundation of applications development in IT. I further agree that XML/XSLT/XSD is useful, but I think that much of this proposal is a document-centric approach. I think that the academic community is better served through the use of the information and the methods by which both discovery and browsing can occur. If the focus is placed squarely on the usage of the information from the professional's perspective, I think that the requirements for this work would become much more clear.

    That said, it is important to take first steps in these matters and the operational issues, e.g. user acceptance of standards, etc., are far more difficult than any technical issues. That LATEX and other standards may play a reduced role, for example, could be either a catalyst for change, or a retardant for a move in other directions.

    I think that the issues you raise are critical, but, If I may, I'd like to touch on a few key points that I think are important on the technical side:

    1. You wrote:

    ". . . XML has become the standard for storing information on the internet;. . ."

    I disagree. XML has become pervasive because it is the foundation for many web technologies. But storage management is not XML's strength. It has been used for messaging, contract-first development, WSDL, etc. If the view is document-centric, i.e. the document must be the artifact for the transaction (as transcribed by an author or authors), then this is acceptable. This may be reasonable for the entry of the information into a body of knowledge (be it arXiv or ECCC or any other), but this is less than optimal for most other characteristics of the knowledge.

    XML, when used in a manner for storage management, is nothing more than a set of nested relations. The storage management of those relations, the ability to query those relations and the ability to maintain those relations is brittle in XML. There are many examples of this, but I'll provide a common problem: cross-reference indices in many-to-many relationships. XML is not able to easily portray information that has a many-to-many multiplicity relationship. Furthermore, any queries done against any relations involved in the many-to-many are not optimal, because the XML is unable to provide for a query plan that is optimized for usage, i.e. one that involves indices on XML elements (in a fashion similar to B-Trees or other methods in large relational databases) that may be needed. XPath and XQuery have begun to mature and have most of the relational algebra operators available in various forms, but true to much of the literature in computational complexity starting with Garey and Johnson and leading through to today, many of the necessary operations for the querying of the data lead to problems whose cost is in NP.

    2. XML is an unordered storage mechanism. XML is based on set theory and as such is an unordered set of objects, or XML elements. If order is to be maintained, i.e. sections or other collateral of a document, the XSD (or other schema definition languages such as DTD or XDR; of these XSD is superior in all respects, but these cannot be ignored) would need to reflect the appropriate metadata to maintain these characteristics. While authoring tools could simplify these types of tasks, they are unlikely to be trivial in the target XML representation and it's corresponding XSD.

    3. XML is less than sufficient for large non-text data. As mentioned earlier by 0xDE, there are difficulties in the management of illustrations or other information. DIME, MTOM and other standards have come and gone on this topic. XML remains naked on this issue, especially when the topic of storage management again is considered. To separate the "large objects" from the other data in XML is not difficult in physical implementation, but the referential integrity of the data, even of a 3NF or BCNF XML schema is put at risk for transactional management of data, particularly in storage management.

    4. XSD is a weak grammar. At best, XSD provides formalism for basic schema definitions, but falls short in key areas. Strongly typed data can be defined in XSD, but the onus is on the XSD validation tools to enforce the rules. Multiplicity is not automata based, i.e. it is fixed for a given XML element and cannot portray information that has transitive dependencies (that may be necessary in a particular schema). For example, consider the following XML snippet of a document (and I apologize in advance if this XML does not render correctly; replace parentheses with XML brackets):

    . . . We find the result, due to Blum, Cohen and Von Neumann in Van Leeuwen [(CitationReference RefID="1" /)] . . .
    (Citation RefID="1" DOI="X23477" /)
    (Citation RefID="2" DOI="E1490" /)

    Note that DOI X23477 has transitive dependence on author identifiers (that are not included in the XML snippet). If we include the author identifiers, we overspecify the set of data and leave the data in either 1NF or 2NF instead of (the minimal competence) 3NF or BCNF. If we omit the author data, as I have shown in this example, then the document does not contain complete information and a further query is needed against a database that has author information that is dependent solely on the DOI or other attribute. The trade off is "encapsulation of the document with potentially stale data" vs. "on-demand queries (potentially to web services or other WS-*) that require internet connectivity when reading a document". Ouch.

    Additionally, the citations section of a document may contain "0 to unbounded" citations (in it's XSD). But if within the document, a reference to a citation is entered in the body, it must be checked for domain integrity, i.e. if it contains a valid RefID that is contained in the Citations element. Again, an authoring tool may provide assistance, but the XSD is unable to provide for rules in domain and referential integrity. An authoring tool is hardly the place for the implementation of a referential or domain integrity of a schema (XSD) on a data store (XML document), but that is the current state of affairs in the world of XML. Transaction managers provide these services in large relational databases.

    5. The Semantic Web (TSW), while less than robust provides both lessons and direction.

    The use of OWL (and it's predecessor DAML/OIL) have shown some reasonable progress and success. OWL, based on First Order Logic, is a useful query language for ontologies. While I feel that it represents better perspectives on the relationships of data, it is still weak in it's foundations. It provides for no usage of temporal or modal logics. The example as mentioned earlier, i.e. the current contact information of an author is an example of where temporal logic would provide some value. Your Moons and Planets post is an example where the 12 citations of Dinur's PCP would change to 33 citations one year later. As the body of knowledge improves, so do the attributes of the body of knowledge, fully under the referential and domain integrity of the collection of knowledge.

    The RDF standards for TSW also provide for a more robust method of information exposure. But the subscription based approach, much like RSS, is an example of an Enterprise Integration Pattern (see Hohpe and Woolf) that provides for a publish/subscribe pattern with constraints on date/time and nothing more. These are typically insufficient conditions for requirements for the professional that must consume the data in these documents.

    6. Requirements for usage of the knowledge as well as the cost to construct should be considered.

    Among many constraints of a body of knowledge, the most difficult is the ability to find useful information, browse the useful information, interact in some form with the useful information and finally consume the useful information.

    It is my firm belief that the academic community, in the defense of any discipline be it a branch of mathematics or an area of computing, has the difficulty of knowing whether the body of knowledge is correct and complete (research of dubious quality has many ways in which to invade this body of knowledge). While human interaction is currently important in this work, the amount of information that can be constructed that relates materials will become more and more important over time. Relational Theory provides the core storage and access management foundations, but there remains much to be done. As mentioned in 5 - TSW, this is still in it's infancy.

    For example, is it important to know the author, the author's history of publications and other information. But keyword searches are less adequate when attempting to find an idea and the publications that relate to an idea.

    7. XML is information theoretic expensive, i.e. the encoding of information is not efficient/optimal.

    Researchers should not be burdened with much of this information. They should be free to explore and transcribe the research that they perform. To provide tools that encumber that process of transcription, or worse, distract, distort or retard the progress of the research process is to ask researchers to waste brain cycles on peripheral activities. Quality of research and quality of life for the researcher are improved with better tools and infrastructure.

    Sorry for hogging space, but your topic is important and I hope these opinions add rather than detract from the topic at hand.

    Thanks for your time.

  10. These ideas certainly seem to be along the right lines. As an interim measure, latex to xml converters could be used.

    On a related note, has anyone tried to use LyX to write a serious technical paper? Although it produces LaTex rather than XML, it certainly seems to be along the lines Lance was talking about with its WYSIWYM (M stands for "Mean") rather than WYSIWYG philosophy. I have experimented with it, but I always end up converting the file to LaTex and editing it by hand. Certain features, like the ability to enter raw LaTex commands for constructing formulas, need to be improved, but it already beats monstrosities like Scientific Workplace.

  11. Each academic research should get their own DOI-like number that points to an XML file giving personal and contact information

    That's great! I always wanted to be a digital object.