SQL GROUP BY Tutorial — Aggregation, HAVING, and the 5 Mistakes Most Analysts Make
GROUP BY is where most SQL learners cross from "I can write a SELECT" to "I can answer real business questions." It's also where most early bugs land — every analyst has produced a wrong number because they put a non-aggregated column in SELECT, filtered an aggregate in WHERE, or forgot that NULL is its own group. This guide walks the mental model, the grouping rule, HAVING vs WHERE, and the five mistakes that defeat most analysts — with runnable examples and a clear practice path.
Contents
- The mental model — collapsing rows into groups
- The grouping rule (the one rule)
- HAVING vs WHERE — the most-asked interview question
- GROUP BY on multiple columns
- The five aggregates and how they handle NULL
- The 5 mistakes that defeat most analysts
- Conditional aggregation — SUM(CASE WHEN)
- ROLLUP, CUBE, GROUPING SETS
- Performance notes
- The GROUP BY checklist
- Practice with real challenges
The mental model — collapsing rows into groups
Think of GROUP BY as a two-step process the engine runs in your head.
- Step 1 — partition. The engine reads every row from the source and drops it into a bucket based on the GROUP BY columns. All rows with the same
departmentvalue land in the same bucket. All rows with the same(department, year)pair land together if you grouped by both. - Step 2 — collapse. Each bucket becomes exactly one output row. The aggregate functions in your SELECT — COUNT, SUM, AVG, MIN, MAX — fold each bucket's rows down into single values. The output table has one row per bucket.
Once that picture is in your head, the grouping rule explains itself: any column you SELECT must either be in the GROUP BY (so every row in the bucket has the same value, and there's one obvious answer) or wrapped in an aggregate (so the bucket's many rows fold into one value).
SELECT department,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
Three departments in the table → three output rows. The engine partitions on department, then for each bucket it computes COUNT(*) and AVG(salary). The headcount and avg_salary columns aren't in GROUP BY because they're aggregates — they collapse the bucket. The department column is allowed because every row in a single bucket has the same department value.
The grouping rule (the one rule)
Every column in SELECT must be in GROUP BY, or wrapped in an aggregate function. No exceptions.
This is the single rule that, if you internalize it, prevents 80% of GROUP BY bugs. The reason: a single output row has to have a single value in every column, but a group of input rows often has many different values in a non-grouped column. The engine has no way to pick "the right one" — so it errors, or worse, it picks arbitrarily and lies to you.
SELECT department,
name, -- ⚠ ambiguous: which name?
COUNT(*) AS headcount
FROM employees
GROUP BY department;
-- Postgres / SQL Server / BigQuery: ERROR
-- Old MySQL: returns ARBITRARY name from each bucket (silent bug)
The lenient-MySQL trap: MySQL versions before 5.7 (and 5.7+ without only_full_group_by) silently picked an arbitrary value from the group, producing wrong results that looked plausible. Modern MySQL enables only_full_group_by by default. If you're inheriting an old codebase, audit every GROUP BY query for non-aggregated, non-grouped columns — they're hidden bugs.
The fix: either group by name as well (creating per-(department, name) buckets), or replace name with an aggregate that defines which name you want — MIN(name), MAX(name), or STRING_AGG(name, ', ') to concatenate them.
HAVING vs WHERE — the most-asked interview question
If you've ever interviewed for a SQL role, you've been asked the difference between WHERE and HAVING. Here's the answer that interviewers want, in two sentences:
- WHERE filters rows BEFORE grouping. It can reference any column in the source table but cannot reference aggregates because the groups don't exist yet.
- HAVING filters groups AFTER aggregation. It's the only place you can put a condition on COUNT, SUM, AVG, etc.
SELECT department,
COUNT(*) AS high_earners
FROM employees
WHERE salary > 100000 -- pre-group: drop low-salary rows
GROUP BY department
HAVING COUNT(*) > 5; -- post-group: drop small departments
Knowing the execution order makes this stick. SQL evaluates clauses in this order, regardless of how you write them:
| Step | Clause | What happens |
|---|---|---|
| 1 | FROM | Read source tables, apply joins |
| 2 | WHERE | Filter rows (no aggregates allowed) |
| 3 | GROUP BY | Partition surviving rows into buckets |
| 4 | HAVING | Filter buckets (aggregates allowed) |
| 5 | SELECT | Compute output columns + aggregates |
| 6 | ORDER BY | Sort (can reference SELECT aliases) |
| 7 | LIMIT | Truncate rows |
That's also why WHERE can't reference column aliases (the SELECT hasn't run yet) but ORDER BY can. And it's why HAVING COUNT(*) > 5 works while WHERE COUNT(*) > 5 errors.
A common shortcut question: "Can I always use HAVING instead of WHERE?" Technically yes for non-aggregate predicates — HAVING department = 'Eng' works — but it's slower because the engine groups every row and then throws away the wrong groups, instead of filtering rows before grouping. Use WHERE for row predicates and HAVING for group predicates. Performance and intent both line up.
GROUP BY on multiple columns
Adding columns to GROUP BY makes the buckets finer-grained. GROUP BY department produces one bucket per department; GROUP BY (department, year) produces one bucket per (department, year) pair.
SELECT department,
EXTRACT(YEAR FROM hire_date) AS hire_year,
COUNT(*) AS hires
FROM employees
GROUP BY department, EXTRACT(YEAR FROM hire_date)
ORDER BY department, hire_year;
Note the GROUP BY repeats the EXTRACT expression — most databases require the exact expression rather than the alias. Postgres, SQLite, and modern MySQL also accept positional references like GROUP BY 1, 2 (the first and second SELECT columns), but expressions are clearer and survive column reordering.
The five aggregates and how they handle NULL
Five aggregates cover ~95% of real work. Each ignores NULL except where noted.
| Function | Returns | NULL behavior |
|---|---|---|
| COUNT(*) | Row count | Counts every row, including all-NULL ones |
| COUNT(col) | Non-NULL count of col | Ignores NULLs in col |
| SUM(col) | Total | Skips NULLs; SUM of all-NULL is NULL |
| AVG(col) | Mean | Skips NULLs; AVG of all-NULL is NULL (not 0) |
| MIN/MAX(col) | Extreme value | Skips NULLs |
The COUNT(*) vs COUNT(col) gotcha: COUNT(*) answers "how many rows" while COUNT(email) answers "how many rows have a non-NULL email." If 30% of users have no email, the two queries return different numbers and that's the bug. Most analysts mean COUNT(*) 95% of the time.
Two more aggregates worth knowing: STRING_AGG (Postgres / SQL Server) or GROUP_CONCAT (MySQL / SQLite) concatenates a column's values per group, often with a separator. ARRAY_AGG (Postgres) packs them into an array. Both are useful for collapsing many rows into a single comma-separated value or array per group.
The 5 mistakes that defeat most analysts
Mistake 1: SELECTing a non-aggregated, non-grouped column
Already covered above — the grouping rule violation. Modern engines error; old MySQL silently lies. Fix: add the column to GROUP BY, wrap in an aggregate, or remove from SELECT.
Mistake 2: Filtering an aggregate in WHERE
SELECT department, COUNT(*) AS headcount
FROM employees
WHERE COUNT(*) > 5 -- ERROR: aggregates not allowed in WHERE
GROUP BY department;
Fix: move the aggregate predicate into HAVING. The grouping rule and the WHERE-vs-HAVING rule together explain almost every "my query won't run" complaint when first learning GROUP BY.
Mistake 3: Forgetting NULL is its own group
If you GROUP BY country and 50 rows have NULL in country, those 50 rows form a single NULL group with one output row. Beginners often expect NULL rows to be silently dropped — they aren't. Fix: filter explicitly with WHERE country IS NOT NULL if you want them excluded, or use COALESCE(country, 'Unknown') to bucket them with a label.
Mistake 4: Sorting with positional GROUP BY then renaming a column
SELECT hire_year, department, COUNT(*)
FROM employees
GROUP BY 1, 2; -- breaks if you swap column order in SELECT
Positional GROUP BY (numeric references to SELECT positions) is a tool for ad-hoc work. In production code or anything you'll edit later, name the columns or expressions explicitly. Future-you will swap two columns in SELECT and not realize the GROUP BY semantics changed silently.
Mistake 5: Using DISTINCT inside an aggregate to "fix" double counting
If your JOIN structure is wrong and rows are duplicated, COUNT(DISTINCT user_id) patches the symptom but hides a real bug. The right fix is usually to find the unintended fan-out — usually a many-to-many JOIN that should have been a subquery — and eliminate the duplication at its source. DISTINCT inside an aggregate is also slower because the engine has to deduplicate before counting. Use it when DISTINCT is genuinely the requirement (count of unique users), not as a JOIN bug bandage.
Conditional aggregation — SUM(CASE WHEN), the workhorse pattern
Conditional aggregation is putting a CASE expression inside an aggregate function. It's the single most useful pattern in business reporting because it lets you compute multiple per-condition metrics in a single query.
SELECT department,
COUNT(*) AS total_orders,
SUM(CASE WHEN status = 'shipped' THEN 1 ELSE 0 END) AS shipped_n,
SUM(CASE WHEN status = 'pending' THEN amount END) AS pending_revenue,
AVG(CASE WHEN status = 'pending' THEN priority END) AS avg_pending_priority
FROM orders
GROUP BY department;
Three things to notice. First, the CASE expression returns a value (or NULL) per row, then the aggregate folds it across the group. Second, when there's no ELSE clause the false branch returns NULL — and aggregates skip NULL — so AVG(CASE WHEN active THEN response_time END) averages only active rows. Third, you can mix conditional and unconditional aggregates in the same query; COUNT(*) totals all rows while the CASE-wrapped sums apply only to subsets.
The alternative — running multiple queries with different WHERE clauses and joining the results — is slower, harder to read, and more bug-prone. Conditional aggregation is also how SQL pivots row-shaped data into column-shaped data, before databases added explicit PIVOT operators.
ROLLUP, CUBE, GROUPING SETS — subtotals without UNION
Once you've used GROUP BY a while, you'll hit a need for subtotals: per-(country, city) row counts plus per-country totals plus a grand total. Three modifiers handle this:
- ROLLUP — adds hierarchical subtotals.
GROUP BY ROLLUP(country, city)gives you (country, city) rows, per-country subtotals (city = NULL), and a grand total (both NULL). - CUBE — adds every possible combination of subtotals.
GROUP BY CUBE(country, city)gives you per-(country, city), per-country, per-city across all countries, and the grand total. - GROUPING SETS — explicit list of grouping combinations.
GROUP BY GROUPING SETS ((country, city), (country))gives only those two levels, no others.
SELECT country, city, COUNT(*) AS users
FROM users
GROUP BY ROLLUP(country, city)
ORDER BY country NULLS LAST, city NULLS LAST;
-- Output: per-(country,city) rows, per-country subtotals, grand total
The subtotal rows have NULL in the rolled-up columns, which can collide with actual NULL values in your data. Use the GROUPING(col) function to tell them apart — GROUPING(city) = 1 means "this is a subtotal row" while GROUPING(city) = 0 with NULL city means "the original data has NULL here." Rare in practice but worth knowing.
Engine support note: Postgres, SQL Server, Oracle, BigQuery, Snowflake, and recent MySQL all support ROLLUP/CUBE/GROUPING SETS. SQLite doesn't — fall back to UNION ALL. Most real reports lean on ROLLUP and skip CUBE because hierarchical totals (region → country → city) are more common than every-combination subtotals.
Performance notes
GROUP BY is one of the more expensive operations in SQL. Two patterns cover most performance work:
- Index the GROUP BY columns. A B-tree index on
(department)lets the engine skip the sort step before grouping. Composite indexes help multi-column groups when the index column order matches the GROUP BY column order. - Filter before grouping with WHERE. Reducing row count with WHERE is dramatically cheaper than grouping all rows and HAVING-filtering the result. Most "slow GROUP BY" queries are actually "GROUP BY operating on too many rows" queries.
For very large fact tables, materialized views or pre-aggregated rollup tables become necessary — refreshing once an hour is much cheaper than computing the same aggregate per query. That's a separate topic; the index + WHERE pattern handles 90% of GROUP BY performance issues you'll hit before then.
The GROUP BY checklist (paste into code review)
Before merging any GROUP BY query, verify:
- Every column in SELECT is either in GROUP BY or wrapped in an aggregate
- Aggregate predicates are in HAVING, not WHERE
- Row predicates are in WHERE, not HAVING (cheaper)
- NULL groups are handled — either filtered out, COALESCE-bucketed, or intentional
- COUNT(*) vs COUNT(col) — you want the right one
- No
COUNT(DISTINCT)hiding a JOIN fan-out bug - GROUP BY columns are named, not positional (in production code)
- If subtotals are needed, ROLLUP / GROUPING SETS, not 5 UNION ALLs
Practice with real challenges
Reading is the easy part. The patterns above stick when you write them cold five times — that's why every SQL Quest aggregation challenge gives you a wrong-answer diagnostic instead of just "incorrect." Here are 4 challenges that drill GROUP BY at increasing difficulty:
📊 Aggregation track — 12+ challengesThen the related deep-dives:
Read the rule, write it cold five times.
The five mistakes above show up in nearly every analyst's first month on the job. Drilling GROUP BY in 12 progressive challenges burns the patterns into muscle memory. The wrong-answer diagnostic catches what the rule didn't.
Start the Aggregation Track — FreeFound a typo or want a topic covered? Email us.