lesscode.org


The Myth Of Data Integrity  

By Alex Bunardzic under Rails on 01. November 2005

Another longish post that I wrote, this time it’s about the The Myth Of Data Integrity (posted on my blog). I was inspired by David’s interview on c|net news.com (Simplicity in Programming).

Controversial? You bet. Challenge the status quo!

8 Responses to “The Myth Of Data Integrity”

  1. Anonymous:

    So… wrong.

    comment at 01. November 2005

  2. Daniel:

    Alex Bunardzic wrote an article on database integrity in which he completely dismissed commonly accepted methods of using an RDBMS to enforce data integrity. I strongly disagree with his conclusions. In fact, I think his arguments are terrible.

    Continue reading…

    comment at 02. November 2005

  3. Anonymous:

    I’m realling getting fed up of these nonsensical and unsubstantiated articles from the Agile Methodologies front. Everyday someone jumps on the bandwagon, spurts a random blog entry and maybe enjoys being “controversial” after having been debunked by 99% of the readers.

    What would you do if someone claims that the earth is flat, and starts becoming famous thanks to his “controversial” theories? You may demonstrate his claims to be false, or just ignore him. By choosing the first option, the risk is to feed the debate and his fame as well.

    I think that I’ll ignore these people, starting from now. They will happily drown in their “agile” mud when their methodologies will collide with real-world non-trivial development problems.

    comment at 02. November 2005

  4. Kragen Sitaker:

    Alex’s article isn’t controversial. It’s just stupid. He’s grasping at an understanding of some things other people have said, and missing.

    Here’s what I think of constraints in the database, based on a scant few years of experience building a small number of database-backed web apps. I promise not to use any stupid similes.

    If there’s some validity constraint in the database that’s not reflected in your application code, you still lose. If you try to delete some row that a foreign key constraint requires to exist, and the database stops you, your application code has three choices. It can display the database’s error message to the user; it can pretend that the transaction succeeded; it can retry (not that this is a good idea, but I’ve seen it done); or it can display a content-free “general failure” message.

    Assuming your application’s users don’t understand the underlying table structure and the RDBMSese in which the error message is phrased, none of these four options are acceptable. So putting validity constraints in the database doesn’t save you from also putting them in your code. (If there’s only one possible way a particular statement can fail, then you can catch the failure and turn it into an appropriate error message or change in behavior; but that still requires putting the knowledge of the data integrity constraint in your application code.)

    Now we have a maintenance problem: two copies of the constraint, in very different forms, that must be modified in parallel. This is a heavy cost, but that’s not the end of it; putting your constraints in the database has three other major drawbacks.

    Checking the correctness of an update can be done (and must be done, as explained above) in application code, which can be distributed across a large number of front-end machines if your application has to scale. If you check the correctness of the update again in the database, it will cost you a great deal more money to scale to larger numbers of users, and you will hit a much lower ceiling.

    Database integrity constraints are difficult to check in and out of a source control system, unless your “build process” involves reinitializing with an empty database. This leads to spending a lot of time looking for phantom bugs when your development database has a stale constraint, or lacks one that will exist on the production machine.

    Finally, database integrity constraints tend to cause problems with things like backup and restore. It’s true that if they are correctly implemented (both the constraints and the backup/restore process) then no such problems will arise. In the real world, though, it is often a real problem.

    Fundamentally the issue is that relational databases are being deployed as application-internal object stores mediated by object-relational mappers. Both relational databases and object stores are fine things, but they have very different design requirements. A relational database is intended to be interacted with directly some of the time; SQL is a carefully designed user interface for this purpose. It’s designed to serve as a shared data store among several different applications, some of which may be written in COBOL and thus have very definite ideas about how long certain strings should be. An object store is intended to hide inside a program and be easy to refactor. (I’m leaving aside the issue of the relational data model and the object-reference-graph model, which is important, but not as important.)

    Many useful features of relational databases, such as multi-column primary keys, string length limits, triggers, foreign key constraints, users and their associated permissions, and so forth, are simply not useful when the database is being deployed as an internal object store for some application.

    That’s the story behind the real “agile people don’t like database integrity constraints” issue.

    I’ve spent a few years developing software with XP, and I don’t appreciate being lumped in with clueless seekers of controversy like Alex Bunardzic.

    comment at 03. November 2005

  5. Kragen Sitaker:

    I guess I should point out that RDBMSes make the best object stores I’ve had the chance to use, even though that isn’t really what they’re designed for. I haven’t had the opportunity to use persistence software that’s really designed for object-reference graphs that does even a fraction of what MySQL 2.0 did.

    comment at 03. November 2005

  6. Alex Bunardzic:

    Not to clog the lesscode’s bandwidth, I have posted some of the responses to the criticisms here:

    More on data integrity

    comment at 03. November 2005

  7. Ben Finney:

    Kragen wrote:

    If you try to delete some row that a foreign key constraint requires to exist, and the
    database stops you, your application code has three choices. It can display the
    database’s error message to the user; it can pretend that the transaction succeeded; it
    can retry (not that this is a good idea, but I’ve seen it done); or it can display a
    content-free “general failure” message.

    Two problems:

    • that’s four options, not three :-)

    • you’ve missed (at least) one viable option.

    The code should be aware that the transaction can fail, but let the RDBMS do the job of enforcing the data integrity. If the transation succeeds, fine! The application hasn’t had to do any checking before that happens, and it can continue on with the rest of the job. If the transaction fails, the code knows immediately, and can then proceed with whatever recovery action the programmer deems necessary.

    You make the point that the application code need to know about the constraint; that’s only true in that the application needs to deal with the case when the constraint is violated. But why not pass the actual enforcement off to the RDBMS, and have the application just deal with how to handle that result?

    comment at 03. November 2005

  8. Kragen Sitaker:

    Ben Finney writes:

    that’s four options, not three :-)

    Oops, you’re right — thanks :)

    You make the point that the application code need to know about the constraint; that’s only true in that the application needs to deal with the case when the constraint is violated. But why not pass the actual enforcement off to the RDBMS, and have the application just deal with how to handle that result?

    The three, um, four alternatives I outlined are different ways to have the application deal with how to handle that result; in other words, they are the choices available as “recovery actions” to the programmer. I argued that each of them is unacceptable for software intended for end-users (that is, someone not conversant with the underlying database schema). Is there a better alternative I haven’t thought of, or do you think that one of those four choices is acceptable in that context?

    comment at 15. November 2005

Leave a Reply

Note: None of this information is required but leaving a Name and URL is much appreciated. You can also register to have this stuff remembered.

Your comment can be previewed here.


Markdown: use the force, Luke.