What is referential integrity?

Referential integrity (RI) is a set of rules to ensure that data makes sense when it’s spread across multiple tables. For example, in a referentially correct database, you’d never be able to record a sale in your sales table for a customer that doesn’t exist in your customer table.

How relational databases help with referential integrity

Relational databases have great tools to ensure your programmers don’t violate RI. This is called enforcing referential integrity at the database level. When RI is enforced at the database level, the database won’t allow your programs to violate the relationships you’ve indicated must exist. In other words, each sales record points to a customer record. If you try to create a sale record that doesn’t point to a customer, the transaction will fail.

However, RI enforcement is optional—you don’t have to implement it. There can be downsides to enforcing it. For example, RI enforcement can slow transactions a little as the database must do more work to check things before accepting data.

And, a number of modern databases don’t even give you the option to enforce RI. For example, analytics darling Snowflake allows you to define your RI requirements, but it won’t enforce them.

So, there are downsides to enforcing RI in your database. Still, enforcement is vital to protect the quality of your data and prevent bugs.

A story from years ago

Years ago, data warehousing and analytic data storage were young. We didn’t have analytics-optimized databases. When people designed analytic databases they would not enforce RI. Why? Because they were dealing with large quantities of data and didn’t want to incur the RI performance penalty. Their logic was that their ETL, their data loading programs, would check RI before inserting new data.

However, no self-respecting data warehouse developer ever wrote a program to intentionally violate referential integrity. RI enforcement is built into databases, in part, to protect us from ourselves. It’s not that we’re writing programs that intentionally violate RI but, database-based RI enforcement is the last line of defense against logic bugs in our code.

So, RI enforcement is an important tool. Why are we turning it off before we even know if we have the performance problem we’re trying to solve? I advised clients; make sure you have a performance problem before you try to solve it. And, if you do encounter problems, make sure to consider all alternatives, like indexing and partitioning your data, before you turn off a tool vital to ensuring your data is correct.

A story from last week

I’ve been working with a long-term client on a data warehouse that captures labor cost details. Our schema includes employees and lots of things that relate to employees—work schedules, timecards, paychecks, etc. The database design calls for enforcing RI.

Last week, the jobs loading work schedules, and everything after that, failed. Our developers found we were receiving schedule records for two new employees who didn’t appear in our employee table. The developers were nearly apoplectic—how could their code be wrong? Let’s turn off RI enforcement!

What they didn’t see was that, while it was a headache, this failure was a good thing. The system was operating as designed! It was receiving bad data. Subsequent analysis revealed that our data source had changed. We needed to accommodate those changes, both in our extract jobs and in our target data warehouse.

Now, think through the implications for our users if we didn’t enforce RI and hadn’t caught this issue.

The biggest implication is that queries joining schedules and employees (for example, to put an employee name to each schedule) would return incorrect results—they wouldn’t return data for the missing employees. The department manager might then assume he had open schedule slots and schedule a second set of people for the shift.

If this problem extended to payroll data, salary totals would be off. Now finance, and the rest of the organization, are working with bad data.
Another implication is that we clearly have problems in our source system or our ETL. If we hadn’t enforced RI, we would never have known about it. If we did find out about it, it would have been because a report user realized that their data was wrong. What would that do for our credibility?

The downsides of enforcing referential integrity

To be sure, enforcing RI does come with some downsides:

It takes time to add RI enforcement to database designs.
Enforcing RI could require changes to your ETL architectures. For example, without RI enforcement, you can add your sales data to your data warehouse before your customer data. With it, you must add the customers before the sales (although there are some ways around this).
RI enforcement takes processing cycles. However, there are ways to minimize this impact, too.
If you don’t build error-trapping routines into your jobs, they will fail if enforced RI is violated. However, if you understand your data, this should occur rarely, if at all.

So, yes, there are some costs to enforcing RI in your database. But, ask yourself, what are the potential costs of delivering bad data?

Recommendations

Enforcing RI has one benefit I haven’t mentioned yet—it ensures that you really understand your data. Modern systems and their databases are complex. Wouldn’t you rather figure out that you misinterpreted your source systems when you first try to load data than after you’ve released that data to your users?

So, here are some recommendations for you to take forward:

Enforce RI in your database when you can.
When you can’t enforce RI in your database, you should:
- Write RI validation jobs that run after each load, or at least periodically, to ensure that RI hasn’t been violated.
- Caution your users to be extremely careful with the data and immediately report any anomalies.

Thanks for reading! If you’re facing database design issues, reach out. I’d love to talk about them!

How Referential Integrity Protects Your Credibility

What is referential integrity?

How relational databases help with referential integrity

A story from years ago

A story from last week

The downsides of enforcing referential integrity

Recommendations

DATASPACE INC.

Data Science Resume Checklist Download

Download the Dataspace Info Sheet

Subscribe to Dataspace Emails

How Referential Integrity Protects Your Credibility

Share On

What is referential integrity?

How relational databases help with referential integrity

A story from years ago

A story from last week

The downsides of enforcing referential integrity

Recommendations