Structured text

Dictionaries and other products of lexicography are probably the most complex examples around of structured texts. True, they don’t tend to be very large (a few hundred megabytes at most), but they will have lots of elements. Tagged corpuses are similar, though the structure works very differently for them and tends to be less complex.

Today, XML is the most common way to mark up structured text documents. It’s kind of a lousy format though.

An interesting research project which, to my knowledge, nobody has really tackled yet is inferring structure (and descriptive markup) from presentationally marked-up documents. As a human I can parse the structure of any document just by its presentation; computers should be able to do that. The people who should be working on this are probably still on their high horses trying to get everyone to use descriptive markup in the first place, even though that might ultimately be something of a dead-end in information processing.