Query Exercise: Find Foreign Key Problems

3 Comments

Let’s say we’ve decided to implement foreign keys, and we need to find data that’s going to violate our desired keys.

We’re going to use the Stack Overflow database, and we’ll focus on these 3 tables:

  • users table: with id column as its primary key
  • posts table: with owneruserid column noting which users.id wrote the post
  • comments table: with userid column noting which users.id wrote the comment, and postid column noting which post.id is being commented on

Before we attempt to implement foreign keys, we need to find data which might violate the foreign key relationships. Are there any:

  • posts rows whose owneruserid does not match up with a valid users.id
  • comments rows whose userid doesn’t match up with a valid users.id
  • comments rows whose postid doesn’t match up with a valid posts.id
  • And to make your task easier, let’s focus on just the first 100K rows in each table (rows with an id <= 100000) to see whether or not foreign keys make sense for this database

Your query exercise has a few parts:

  1. Write one or more queries to find these answers as quickly as possible with low load on the database.
  2. Given what you find, hypothesize about what might have caused the foreign key problems.
  3. Given what you learned, are there any changes you want to make to the app, processes, or database?

You can post your answers in this blog post’s comments, and discuss each others’ ideas. You can also use <pre></pre> tags in your comments to frame code, like this:

<pre>
select stuff from tables
</pre>

We’ll revisit your answers in this post. No peeking! Have fun solving it first.

Previous Post
Smart Postgres News and Techniques 2024 Vol 1
Next Post
Smart Postgres News and Techniques 2024 Vol 2

3 Comments. Leave new

  • Michael J Swart
    January 8, 2024 7:58 am

    Here’s my go:

    select count(*) as number_of_orphaned_posts
    from (
        select owneruserid from public.posts where owneruserid < 100000
        except
        select id from public.users where id < 100000
    ) as a;
    
    select count(*) as number_of_orphaned_comments
    from (
        select userid from public.comments where userid < 100000
        except
        select id from public.users where id < 100000
    ) as a;
    
    select count(*) as number_of_comments_under_orphaned_posts
    from (
        select p.owneruserid
        from public.comments c
        inner join public.posts p
            on c.postid = p.id 
        where p.owneruserid < 100000
    
        except
    
        select id from public.users where id < 100000
    ) as a;
    

    The first two queries didn’t find any rows that would violate a candidate foreign key. I either have a mistake, or there just aren’t any. The third query didn’t complete any time soon. I haven’t tuned this query or looked into how much load it’s putting on the server. I did an analyze on the query, but I don’t know where to go from there.

    Had there been any FK violations, I assume it’s because a user is deleted without cleaning up their posts.
    If the intent is to keep posts available to read when users leave (with a [deleted] author name for example), then I’d suggest implementing a soft-delete pattern, where the user-record sticks around. Add a “date-deleted” column to the users table and hide the user name from such comments.

    Reply
  • […] Query exercises with a challenge one week, followed by answers the next week […]

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed