There’s a MySQL pattern that I use fairly frequently that I want to share, both to help anyone else who might find it useful and also to find out if there’s a beter way.
Here’s an example: you have one table for users, another for posts, another for pages:
| DROP TABLE IF EXISTS users; |
| CREATE TABLE users (user_id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(20)); |
| INSERT INTO users (name) VALUES (‘Matt‘); |
| INSERT INTO users (name) VALUES (‘Simon‘); |
| INSERT INTO users (name) VALUES (‘Jen‘); |
| DROP TABLE IF EXISTS posts; |
| CREATE TABLE posts (post_id INT PRIMARY KEY AUTO_INCREMENT, user_id INT); |
| INSERT INTO posts (user_id) VALUES (1); |
| INSERT INTO posts (user_id) VALUES (1); |
| INSERT INTO posts (user_id) VALUES (1); |
| INSERT INTO posts (user_id) VALUES (2); |
| INSERT INTO posts (user_id) VALUES (2); |
| DROP TABLE IF EXISTS pages; |
| CREATE TABLE pages (page_id INT PRIMARY KEY AUTO_INCREMENT, user_id INT); |
| INSERT INTO pages (user_id) VALUES (2); |
| INSERT INTO pages (user_id) VALUES (2); |
| INSERT INTO pages (user_id) VALUES (3); |
| INSERT INTO pages (user_id) VALUES (3); |
| INSERT INTO pages (user_id) VALUES (3); |
| INSERT INTO pages (user_id) VALUES (3); |
| INSERT INTO pages (user_id) VALUES (3); |
| +———-+——-+ |
| | user_id | name | |
| +———-+——-+ |
| | 1 | Matt | |
| | 2 | Simon | |
| | 3 | Jen | |
| +———-+——-+ |
| +———-+———+ |
| | post_id | user_id | |
| +———-+———+ |
| | 1 | 1 | |
| | 2 | 1 | |
| | 3 | 1 | |
| | 4 | 2 | |
| | 5 | 2 | |
| +———-+———+ |
| +———-+———+ |
| | page_id | user_id | |
| +———-+———+ |
| | 1 | 2 | |
| | 2 | 2 | |
| | 3 | 3 | |
| | 4 | 3 | |
| | 5 | 3 | |
| | 6 | 3 | |
| | 7 | 3 | |
| +———-+———+ |
And we want to know how many posts and pages each user has:
| +———-+————+————+ |
| | user_id | post_count | page_count | |
| +———-+————+————+ |
| | 1 | 3 | 0 | |
| | 2 | 2 | 2 | |
| | 3 | 0 | 5 | |
| +———-+————+————+ |
Attempt 1: COUNT with JOIN
For the moment, lets focus on just getting the post count by user.
We might be tempted to try JOINing the two tables and using COUNT:
| SELECT users.user_id, COUNT(*) AS post_count |
| FROM users |
| JOIN posts ON posts.user_id = users.user_id |
| GROUP BY 1 |
| +———-+————+ |
| | user_id | post_count | |
| +———-+————+ |
| | 1 | 3 | |
| | 2 | 2 | |
| +———-+————+ |
The problem is that when we JOIN the two tables, the results will only include users who have posts. In this case, there’s no result for Jen (user id 3) because she doesn’t have any records in the posts table.
Attempt 2: COUNT with LEFT JOIN
Ok, so if JOIN doesn’t work, lets try LEFT JOIN:
| SELECT users.user_id, COUNT(*) AS post_count |
| FROM users |
| LEFT JOIN posts ON posts.user_id = users.user_id |
| GROUP BY 1 |
| +———-+————+ |
| | user_id | post_count | |
| +———-+————+ |
| | 1 | 3 | |
| | 2 | 2 | |
| | 3 | 1 | |
| +———-+————+ |
Now Jen has a count, but it’s reflecting 1 post, not 0 like we expect.
We can see why it’s broken by looking at the LEFT JOIN results without aggregating them:
| SELECT * |
| FROM users |
| LEFT JOIN posts ON posts.user_id = users.user_id |
| +———-+——-+———+———+ |
| | user_id | name | post_id | user_id | |
| +———-+——-+———+———+ |
| | 1 | Matt | 1 | 1 | |
| | 1 | Matt | 2 | 1 | |
| | 1 | Matt | 3 | 1 | |
| | 2 | Simon | 4 | 2 | |
| | 2 | Simon | 5 | 2 | |
| | 3 | Jen | NULL | NULL | |
| +———-+——-+———+———+ |
Jen doesn’t have any posts, but because we’re LEFT JOINing, her users record is still included. When we then aggregate the results with GROUP BY and COUNT, MySQL sees that the results have one record so returns a count of 1.
A better way to do this (as suggested by Tom Davies) is instead of counting all records, only count post ids:
| SELECT users.user_id, COUNT(post_id) AS post_count |
| FROM users |
| LEFT JOIN posts ON posts.user_id = users.user_id |
| GROUP BY 1 |
| +———-+————+ |
| | user_id | post_count | |
| +———-+————+ |
| | 1 | 3 | |
| | 2 | 2 | |
| | 3 | 0 | |
| +———-+————+ |
Attempt 3: SUM/IF, and LEFT JOIN
Another way we can achieve what we want (for a single table join) is to use SUM/IF:
| SELECT users.user_id, SUM(IF(post_id IS NULL, 0, 1)) AS post_count |
| FROM users |
| LEFT JOIN posts ON posts.user_id = users.user_id |
| GROUP BY 1 |
| +———-+————+ |
| | user_id | post_count | |
| +———-+————+ |
| | 1 | 3 | |
| | 2 | 2 | |
| | 3 | 0 | |
| +———-+————+ |
Here we’re saying “When the post id is null, give it a 0, otherwise a 1, then sum the results” which gives us the correct count. This is actually I trick I learned from the real Jen while at Automattic :).
The problem arises when we need to join multiple tables:
| SELECT |
| users.user_id, |
| SUM(IF(post_id IS NULL, 0, 1)) AS post_count, |
| SUM(IF(page_id IS NULL, 0, 1)) AS page_count |
| FROM users |
| LEFT JOIN posts ON posts.user_id = users.user_id |
| LEFT JOIN pages ON pages.user_id = users.user_id |
| GROUP BY 1 |
| # or |
| SELECT |
| users.user_id, |
| COUNT(post_id) AS post_count, |
| COUNT(page_id) AS page_count |
| FROM users |
| LEFT JOIN posts ON posts.user_id = users.user_id |
| LEFT JOIN pages ON pages.user_id = users.user_id |
| GROUP BY 1 |
| +———-+————+————+ |
| | user_id | post_count | page_count | |
| +———-+————+————+ |
| | 1 | 3 | 0 | |
| | 2 | 4 | 4 | |
| | 3 | 0 | 5 | |
| +———-+————+————+ |
Now instead of Simon (user id 2) having 2 posts, he has 4 – what happened? Again, we can look at the un-aggregated results:
| SELECT * |
| FROM users |
| LEFT JOIN posts ON posts.user_id = users.user_id |
| LEFT JOIN pages ON pages.user_id = users.user_id |
| +———-+——-+———+———+———+———+ |
| | user_id | name | post_id | user_id | page_id | user_id | |
| +———-+——-+———+———+———+———+ |
| | 2 | Simon | 4 | 2 | 1 | 2 | |
| | 2 | Simon | 5 | 2 | 1 | 2 | |
| | 2 | Simon | 4 | 2 | 2 | 2 | |
| | 2 | Simon | 5 | 2 | 2 | 2 | |
| | 1 | Matt | 1 | 1 | NULL | NULL | |
| | 1 | Matt | 2 | 1 | NULL | NULL | |
| | 1 | Matt | 3 | 1 | NULL | NULL | |
| | 3 | Jen | NULL | NULL | 3 | 3 | |
| | 3 | Jen | NULL | NULL | 4 | 3 | |
| | 3 | Jen | NULL | NULL | 5 | 3 | |
| | 3 | Jen | NULL | NULL | 6 | 3 | |
| | 3 | Jen | NULL | NULL | 7 | 3 | |
| +———-+——-+———+———+———+———+ |
The problem is that each post gets joined with each page result. Simon has 2 posts, each of which gets joined with the 2 pages, so when we use COUNT it sees 4 results and returns that amount.
The solution: Subqueries and COALESCE
Here’s how I would solve it:
| SELECT |
| users.user_id, |
| COALESCE(post_count, 0) AS post_count, |
| COALESCE(page_count, 0) AS page_count |
| FROM users |
| LEFT JOIN ( |
| SELECT user_id, COUNT(*) AS post_count |
| FROM posts |
| GROUP BY user_id |
| ) post_counts ON post_counts.user_id = users.user_id |
| LEFT JOIN ( |
| SELECT user_id, COUNT(*) AS page_count |
| FROM pages |
| GROUP BY user_id |
| ) page_counts ON page_counts.user_id = users.user_id |
| +———-+————+————+ |
| | user_id | post_count | page_count | |
| +———-+————+————+ |
| | 1 | 3 | 0 | |
| | 2 | 2 | 2 | |
| | 3 | 0 | 5 | |
| +———-+————+————+ |
To understand how it works, lets focus on counting posts. The first subquery counts how many posts each user has if they have any posts:
| SELECT user_id, COUNT(*) AS post_count |
| FROM posts |
| GROUP BY user_id |
| +———-+————+ |
| | user_id | post_count | |
| +———-+————+ |
| | 1 | 3 | |
| | 2 | 2 | |
| +———-+————+ |
We can then LEFT JOIN users on this derived table:
| SELECT * |
| FROM users |
| LEFT JOIN ( |
| SELECT user_id, COUNT(*) AS post_count |
| FROM posts |
| GROUP BY user_id |
| ) post_counts ON post_counts.user_id = users.user_id |
| +———-+——-+———+————+ |
| | user_id | name | user_id | post_count | |
| +———-+——-+———+————+ |
| | 1 | Matt | 1 | 3 | |
| | 2 | Simon | 2 | 2 | |
| | 3 | Jen | NULL | NULL | |
| +———-+——-+———+————+ |
For the users with posts, the result has the post count. For the users without posts, the result is NULL. This is where COALESCE comes into play. COALESCE takes any number of arguments and returns the first non-NULL result:
| SELECT COALESCE(NULL, 0) |
| +——————–+ |
| | COALESCE(NULL, 0) | |
| +——————–+ |
| | 0 | |
| +——————–+ |
So we’re saying “If the user has a posts count, use that, otherwise use 0”.
We can then LEFT JOIN again on the pages table and do the same thing with the pages count to get the posts and pages count per user. Because each subquery only returns a max of one result per user, we don’t run into the issue we did earlier where posts get joined with pages to return the incorrect count.
Somewhat complicated, but the only way I know how to do it. If you know of a better way, please drop a comment below. Thanks!
Edit:
This simpler method also works:
| SELECT |
| users.user_id, |
| COUNT(DISTINCT post_id) AS post_count, |
| COUNT(DISTINCT page_id) AS page_count |
| FROM users |
| LEFT JOIN posts ON posts.user_id = users.user_id |
| LEFT JOIN pages ON pages.user_id = users.user_id |
| GROUP BY 1 |
| +———-+————+————+ |
| | user_id | post_count | page_count | |
| +———-+————+————+ |
| | 1 | 3 | 0 | |
| | 2 | 2 | 2 | |
| | 3 | 0 | 5 | |
| +———-+————+————+ |
By counting distinct post ids and page ids, we avoid counting NULLs and also avoid counting duplicates due to joining on posts and pages.