Markup Languages for Writers

Adam Carter

HomeArticlesBlogPhilosophyNews

May 31, 2024. 03:18 PM

There are two kinds of text files: plain text, and rich text. Plain text is the simplest and oldest of all formats, and is one of the earliest types of computer files. It’s a simple sequence of bytes (zeroes and ones), defined by a file name, a beginning, and an end. It contains characters encoded as some kind of “binary alphabet,” such as ASCII, Unicode, and many others.

Plain text is good to create very simple messages. But, when it comes to any depth and complexity, like representing titles, quotes, example notes, meta-data (author name, creation date, version, etc.), plain text becomes too limited.

To solve this problem, early computer scientists came up with structured text. Structured text files were still plain text, but they contained marks and symbols that indicated the structure, and the function of each line in relation to the text.

One of the earliest standards to define structured text was the markup language SGML. It later evolved a simplified, but highly powerful markup language called XML. You probably already saw it before:

<?xml version="1.0" encoding="utf-8" ?>
<note>
  <title>Remember these things!</title>
  <to>John Doe</to>
  <from>Jane Dane</from>
  <content>
    <paragraph>
      Don't forget to water the plants and feed the cat. The
      parrot will try to bite you.  We will be  back  before
      Friday.
    </paragraph>
    <paragraph>
      Thank you again for your help!
    </paragraph>
  </content>
</note>

“Markup” refers to the tags, also called elements, that are inserted between the text, such as <note>. Inserting markups, one could define that the text within these tags is supposed to be rendered bold, for example (<b>bold text</b>), or that it’s supposed to be a paragraph (<p>...</p>). When markup defines how a text should look like, it’s said that the markup is visual. But, the markup could also say what the text is, without defining what it looks like. In this case, it’s said that the markup is semantic.

Rich text comes from a combination of plain text and markup signs. The old rich text formats like rtf and Microsoft’s doc format were like that. You had visual markup indicating that a certain text was supposed to look in a certain way, and the software (Windows Wordpad or Microsoft Office Word) would show the text as such in the screen.

Many old rich text formats like Word’s, and its competitors, used a non-standard markup based on binary elements, instead of plain text tags. Word, for example, included a bunch of binary information alongside the text to indicate the text’s markup, and only Word could decode this information. That way, Microsoft ensured the incompatibility of its documents with its competitors, and forced users to stick to Word.

But, many communities of users and programmers pushed to promote a universal markup language that could be decoded by anyone, and be human readable, if needed. The XML system became an important standard, and even important word processors nowadays, like Microsoft Word, use a format called docx, which is based on XML (an open standard), as the main rich text format.

I’m always shocked to see that many authors nowadays don’t know the difference between rich text and plain text, and don’t know a thing about markup languages. The raw material of their entire career is words, and almost always, these words are written on a computer. How can they use something without understanding its basic functionality? It’s extremely important for any author to understand how text works on a computer, how it’s represented, and how the text structure and visuals are denoted in a text file.

Most rich text editors nowadays are “what you see is what you get” type softwares: the screen tries to emulate the final result, showing bold, italic, line heights, paragraph distances, and other visual elements of the text. So, the user clicks a little button, and suddenly, his text is bold. Under the hood, the program is adding markup signs to indicate this. But, users don’t see the markup directly.

Yet, they need to know how these types of text work. Because, eventually, if they find any kind of problem with the file, or difficulty in its automatic conversion, they will have no idea how to solve the problem.

A common problem, for example, is deffective conversion. Imagine that you have a complex book with four levels of headers, bullet and numbered lists, bibliography, images, graphics, foot notes, cross references, and a bunch of other structural elements. You write this document without a clue on how the word processor handles direct markup. Then, you pass the document through an automatic converter to transform the text into EPUB, so it can be read on your Kindle. After the conversion, you find out that the document is completely scrambled. There are crazy extra lines, images are all over the place, and the conversion system didn’t recognize your table of contents, so you have no way to jump to a given page or chapter.

Or, even worse, you have a novel written in docx format, and then you have to convert it to a bunch of formats to sell in different stores. The automatic converter generates a new document for each format, but when you check the document, you realize that its meta-data are all wrong (title, author name, etc. are messed up), and the converter failed to generate a table of contents. What do you do?

Errors in conversion happen because the computer didn’t understand the original markup of the text. Markup must be machine readable, so it needs to be precise. Computers don’t handle errors very well. If a computer sees a tag that is out of place, the computer will not understand that it’s right or wrong. It will read literally what it sees and generate an output accordingly.

This is why XML is so strict, and any validation tool will scream at you if you put a single tag without closing it. This strictness is a good thing, because it will prevent errors when parsing and processing with any software.

One notable exception is HTML, a markup language derived from SGML (like XML), but specifically created to serve as the markup language behind web sites. When creating a site, you will write HTML, and the browser will interpret the content. But, if the author messed up and forgot to close tags, the browser is programmed to ignore the mistakes and render the rest of the page, and insert extra tags, “guessing” where the closing tags should be.

This controversial choice during the early days of the internet ensured that web authors would still be able to publish their sites, despite making small mistakes. As such, web authors wouldn’t need to be masters in writing markup text, and wouldn’t have to be super precise. The web became more accessible because of this.

But, a fiction author who is creating a document that will be converted automatically into half-a-dozen formats cannot afford to make these kind of mistakes. He also can’t fully rely on the flawed systems under the hood of word processors like Word.

What are the best options, then? How do you write a book or novel that is structurally sound, precise, and can be easily interpreted by any computer, and converted by any tool into whatever format you want, without any mistakes?

The advantage in having such a system is undeniable. Imagine if all you needed to do to generate your novels, ready and perfectly structured for publication, was to type a single command in the terminal and magically watch as the computer performs the job that half a dozen people used to perform in pre-computer times. From a single document, you create dozens of documents in formats like EPUB, PDF, MOBI, and whatever else you need.

So, let’s examine the options and their pros and cons:

1. HTML

HTML is the format that powers the web. You can write HTML code with any text editor, and you can easily learn the tags, and their proper order, with online tutorials. While the format is mainly used to create websites, you can also write novels and structured books with it, and automatically convert the content to other formats.

Being so universal, HTML can be converted and interpreted by almost every software out there. You can also visualize your book with a simple web browser, and read it without the markup tags getting in your way.

But, the biggest disadvantage is that HTML is not strict. You can write wrong code and still render your text on the browser. As such, you may make markup mistakes and miss them until it’s time to convert. That’s when you find out that the converter is improperly parsing your text.

To solve this problem, you can use XHTML, which is like HTML 4.0 (the older version), but works like XML. As such, if you write XHTML code improperly, the browser will scream at you and will refuse to render your page. It will prevent you from making markup mistakes.

XHTML and HTML also have an important problem: they lack proper tags to indicate elements that are part of a book publication, such as foot notes, bibliography, and cross references. All the language offers are links, but they are always interpreted by the automatic converters as “simple links.” The parsers do not differentiate between external link, cross reference, and bibliography.

2. Docbook

Docbook is a markup language based on XML, like XHTML. It’s as strict as XHTML, and the validator will scream at you if you forget to close a tag. But, unlike HTML, it has a bunch of elements specifically designed to be part of a book publication. It has footnote, bibliographic reference, cross reference link, external link, admonitions, and a bunch of other things. It also offers a potentially infinite quantity of nested sections, so you are not limited to the six nested headers of HTML.

While Docbook cannot be visualized by any web browser, you can still read it using a Linux software called yelp. It will parse and interpret the document visually. Docbook can also be converted to a great number of formats using programs like pandoc. It can also be automatically converted to any format using any universal XML tools like xsltproc, combined with conversion stylesheets (in the xsl format).

The biggest disadvantage of Docbook is its complexity and strictness, which will demand a bit of a learning curve. You need to get a good reference of the elements, and learn the rules of what goes inside of what, and what is the point of each element. You will also need to learn some basic concepts of XML, like the rules for namespace, attribute, and validation using xmllint. But, once you learn these things, any basic Linux with a terminal and a hanful of apps will be the only tool you’ll ever need to generate any kind of highly structured, perfectly built book in any format you want.

3. Lightweight Markup Languages

The main alternative to HTML and Docbook are so called lightweight markup languages. The two main languages of this type are Markdown and AsciiDoc.

In practice, these languages are simplified “shortcut” ways to write HTML and Docbook respectively. With these languages, you don’t need to open and close tags. You only follow a few simple rules, like, writing a hash symbol before a title, to represent the <h1> tag.

Markdown’s markup directly translates to HTML. AsciiDoc, which is a more complex system, directly translates to Docbook. The biggest advantage of both of them is that you don’t need to worry about validating the markup. As long as you write them properly, without making mistakes, and follow the rules defined by each of these languages, you’ll be able to automatically convert to a fully valid markup.

A markdown file will convert perfectly to HTML, and an AsciiDoc file will perfectly convert to Docbook.

Nowadays, there are programs that convert Markdown and AsciiDoc to any other format, not necessarily to their specific XML counterparts. As such, you can convert Markdown to Docbook, and AsciiDoc to HTML, among many other things.

Conclusion

In practice, AsciiDoc and Markdown are the two best choices for writing. They create a very human-readable text file, and are very easy and simple to learn and convert.

Out of these two, AsciiDoc is best if you’re writing something extremely large and complex. For example, if your book has loads of foot notes, cross references, images, examples and admonitions, and at least a dozen levels of nested sections, Markdown is not going to cut it.

Once you write your document with AsciiDoc, you can convert it to many formats using AsciiDoctor, a powerful open source application. It can convert to Docbook, and then, you can check the markup, add <info> meta-tags, and use xsltproc and good xsl stylesheets to generate other formats.

But, AsciiDoc, and AsciiDoctor, have important limitations. Being more powerful, it’s more complex to learn. It’s less tolerant to mistakes, and will scream at you if you try to parse an invalid file. Lastly, it’s less known and used, and there is less support for it online. Many automatic tools don’t recognize AsciiDoc and don’t know how to handle it.

That’s why, if you’re writing a text that’s simpler, like a fiction novel, or a simpler book with images, simple lists, links, and up to six levels of nested sections, Markdown is your friend. This format will be supported fully by most tools out there, especially pandoc, which is a powerful universal markup converter.

From a Markdown file, you can generate pretty much anything, from a website to a PDF. Markdown also offers support for meta-data, where you write title, author name, version, and much more. The only thing that markdown doesn’t offer, and AsciiDoc offers quite well, is the “include macro”—the possibility of breaking down a big file into smaller files, and inserting them into an index file using the macro include::src/to/file.txt.

So, you’ll probably be perfectly fine with Markdown. This is my ultimate recommendation. Nevertheless, it’s important to know HTML, XHTML, and Docbook, since these will be the intermediary formats that will you be using to create PDF, EPUB, and web formats.