IndieWebCamp is a 2-day creator camp focused on growing the independent web

databases-antipattern

Contents

databases for storage

Using databases for the primary storage of your content is a known antipattern.

Typical CMS's (e.g. WordPress, Drupal, MediaWiki etc.) all use databases for primary storage of content. Most often MySQL, sometimes others like MongoDB.

Databases are all a pain to maintain (i.e. highly human-time inefficient - see DBA tax below), and more fragile[1] than the file system.

For any content you care about it, don't put the primary copy in a database.

DBA tax

(this likely deserves its own canonical definitional article)

The term DBA tax has been introduced to quickly reference the extra overhead you as a person incur by depending on any system that uses a database for primary storage. If you maintain any such system, you have to spend some amount of your time being a database administrator (DBA), hence that time spent is referred to as the DBA tax that you are incurring.

In particular, a database is yet another space of things.

Everyone already deals with a file system. Why deal with yet another space?

The following are aspects of the DBA tax - hassles of:

  • setting up a database
  • setting up doing tables
  • setting up username/login that is database specific
  • altering tables
  • dealing with username/login that is database specific (often gets in the way of things, e.g. password reset on MediaWiki)[2]
  • exporting
  • importing
  • backups (separate from the backups you're already doing of the file system, because database files can themselves become corrupted internally - as what happened with Magnolia).

All extra crap (thus tax) compared to "just" using the file system.

All databases have a DBA tax.

Experience and evidence shows DBA tax exists as well. People lose things (and have them corrupted) in databases all the time, far more often than in file systems. This is because databases always require extra "magic" for backups etc.

E.g. search the web for people complaining about having to backup their WordPress databases before upgrading their WordPress install.

Uninspectability

Even in databases that use only a single file (e.g. SQLite?), that "one file" is still uninspectable, as opposed to simple HTML you can always look at in a browser.

There are many reasons why an uninspectable/random/binary magic file format is fragile, whereas HTML is an inspectable file format that many tools can read/write, including a simple text editor. In particular:

  • one implementation. magic database formats typically only work with one implementation. In contrast, HTML is a well documented format that supported by numerous tools (1:1 like text editors, and higher level tools) which makes it much more robust/reliable than any database file format. Similar problems occur with proprietary binary formats in general.

Platform trap

Most databases tend to be biased/tied to a particular programming language (or operating system) as well - more unnecessary constraints - trapping you into a particular platform (language or OS).

In contrast, every language / OS has flat file APIs. Nearly all have DOM Document APIs as well.

Fragile

Databases are known to be fragile in exceptional situations, e.g. high traffic load, server hard restart. Real world example: MySQL and Rackspace:

... Bad news: MySQL barfed all over the table holding my posts. ... [3]
And I'm pretty sure that happened when Rackspace hard reset my machine due to swapping.[4]

Note: apparently postgres doesn't have that problem.

Caching ok

Databases are essentially dead for anything but caching/performance needs for high volume sites. Examples:

  • aggregations (e.g. all of your posts with a specific tag) can be cached.
  • full-text search index
  • geo-search index

In both search use-cases the DBs should be a cache/query store, the real data should still be in flat files (e.g. .md, HTML, etc.)

Premature Optimization

Even though databases are fine for rebuildable caches, that's still solving an optimization problem.

Any design that depends on databases is optimizing prematurely - before you have any idea what the performance characteristics of the system are.

Alternatives

Rather than using a database for storage, use open well established formats like HTML + microformats, and if you really like how databases let you access them, instead build a database-like API on top (of said HTML + microformats files).

With HTML+microformats2 + a microformats2 parser, we already have a database-like API that gives you data back from HTML.

It's likely that eventually we'll come up with something resembling a database / storage API on top of an HTML+microformats2 file.

  • What do mean by a storage /database API on top of an HTML file?

FAQ

SQLite tax avoidance

Q: Is SQLite a way to avoid the DBA tax?

  • A: No. It's still yet another space of stuff and one space of stuff is easier to maintain / keep track of than two spaces of stuff. It also still has the "one implementation" fragility of its magic file format.

How are database file formats random

Q: How are database file formats random?

  • A: One way is that they're random over time - the implementation changes, the format changes typically in lockstep. If you're lucky you may be able to import/export across versions or different versions of the implementation may corrupt different versions of the file format. Hence database file formats are both semi-random (unpredictable future), and frankly, potentially randomizing (corrupting).

MySQL portability

Q: Isn't MySQL pretty portable?

  • A: No. That's actually a huge misconception. It might be portable if all you have is very small amounts of short data all in ASCII7. Change any of those and you're likely to experience invisible corruption when moving MySQL data from one instance to another.

Examples:

  • MySQL corrupts content all the time when "porting" - e.g. UTF8/encoding problems. Just search for MySQL UTF8. Eric Meyer has written a few blog posts about it (and he was just doing WordPress to WordPress migrations, and worse yet - he got LOTS of *wrong* advice from those in the "MySQL community"). And the corruption is often invisible, especially if most of your content is just ASCII7. You don't notice it until long after the original where you ported from is gone and you can't do any attempt at reporting.

Follow-up: Q: Isn't MySQL trivial to port via: mysqldump ... | scp | mysql ?

  • A: No. Another misconception. It doesn't work across MySQL versions/configurations. People get UTF8 corruptions because there's magical database configuration crap you have to be aware of. And don't forget the --skip-extended-insert flag otherwise if you have really long tables they will break when the target DB has a smaller max query size.

Still not convinced

Q: Still not convinced that databases for storage are an anti-pattern or about the DBA tax.

  • A: For those still not convinced, it's left as an exercise to lose enough data in random/binary file formats over the years until it's painful enough that you decide to stop using such things for anything you care about. If you don't think it's a pain, the only likely way to learn that it is is by doing it for years. Go ahead and try to make databases for primary storage work if you like, and report back on how well your data is doing in a few years (like 10).

See Also