Let’s say we’ve decided to implement foreign keys, and we need to find data that’s going to violate our desired keys.
We’re going to use the Stack Overflow database, and we’ll focus on these 3 tables:
- users table: with id column as its primary key
- posts table: with owneruserid column noting which users.id wrote the post
- comments table: with userid column noting which users.id wrote the comment, and postid column noting which post.id is being commented on
Before we attempt to implement foreign keys, we need to find data which might violate the foreign key relationships. Are there any:
- posts rows whose owneruserid does not match up with a valid users.id
- comments rows whose userid doesn’t match up with a valid users.id
- comments rows whose postid doesn’t match up with a valid posts.id
- And to make your task easier, let’s focus on just the first 100K rows in each table (rows with an id <= 100000) to see whether or not foreign keys make sense for this database
Your query exercise has a few parts:
- Write one or more queries to find these answers as quickly as possible with low load on the database.
- Given what you find, hypothesize about what might have caused the foreign key problems.
- Given what you learned, are there any changes you want to make to the app, processes, or database?
You can post your answers in this blog post’s comments, and discuss each others’ ideas. You can also use <pre></pre> tags in your comments to frame code, like this:
<pre>
select stuff from tables
</pre>
We’ll revisit your answers in this post. No peeking! Have fun solving it first.

3 Comments. Leave new
Here’s my go:
select count(*) as number_of_orphaned_posts from ( select owneruserid from public.posts where owneruserid < 100000 except select id from public.users where id < 100000 ) as a; select count(*) as number_of_orphaned_comments from ( select userid from public.comments where userid < 100000 except select id from public.users where id < 100000 ) as a; select count(*) as number_of_comments_under_orphaned_posts from ( select p.owneruserid from public.comments c inner join public.posts p on c.postid = p.id where p.owneruserid < 100000 except select id from public.users where id < 100000 ) as a;The first two queries didn’t find any rows that would violate a candidate foreign key. I either have a mistake, or there just aren’t any. The third query didn’t complete any time soon. I haven’t tuned this query or looked into how much load it’s putting on the server. I did an analyze on the query, but I don’t know where to go from there.
Had there been any FK violations, I assume it’s because a user is deleted without cleaning up their posts.
If the intent is to keep posts available to read when users leave (with a [deleted] author name for example), then I’d suggest implementing a soft-delete pattern, where the user-record sticks around. Add a “date-deleted” column to the users table and hide the user name from such comments.
Ah! Sorry, I should have been more specific – when reviewing the posts and comments table, only look at ids 100000 and below.
[…] Query exercises with a challenge one week, followed by answers the next week […]