SQL GROUP BY Tutorial — Aggregation, HAVING, and the 5 Mistakes Most Analysts Make

Published May 2026 · 13 min read · Drill 12+ aggregation challenges →

GROUP BY is where most SQL learners cross from "I can write a SELECT" to "I can answer real business questions." It's also where most early bugs land — every analyst has produced a wrong number because they put a non-aggregated column in SELECT, filtered an aggregate in WHERE, or forgot that NULL is its own group. This guide walks the mental model, the grouping rule, HAVING vs WHERE, and the five mistakes that defeat most analysts — with runnable examples and a clear practice path.

Contents

  1. The mental model — collapsing rows into groups
  2. The grouping rule (the one rule)
  3. HAVING vs WHERE — the most-asked interview question
  4. GROUP BY on multiple columns
  5. The five aggregates and how they handle NULL
  6. The 5 mistakes that defeat most analysts
  7. Conditional aggregation — SUM(CASE WHEN)
  8. ROLLUP, CUBE, GROUPING SETS
  9. Performance notes
  10. The GROUP BY checklist
  11. Practice with real challenges

The mental model — collapsing rows into groups

Think of GROUP BY as a two-step process the engine runs in your head.

  1. Step 1 — partition. The engine reads every row from the source and drops it into a bucket based on the GROUP BY columns. All rows with the same department value land in the same bucket. All rows with the same (department, year) pair land together if you grouped by both.
  2. Step 2 — collapse. Each bucket becomes exactly one output row. The aggregate functions in your SELECT — COUNT, SUM, AVG, MIN, MAX — fold each bucket's rows down into single values. The output table has one row per bucket.

Once that picture is in your head, the grouping rule explains itself: any column you SELECT must either be in the GROUP BY (so every row in the bucket has the same value, and there's one obvious answer) or wrapped in an aggregate (so the bucket's many rows fold into one value).

Toy example: per-department headcount and avg salary
SELECT department,
       COUNT(*) AS headcount,
       AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

Three departments in the table → three output rows. The engine partitions on department, then for each bucket it computes COUNT(*) and AVG(salary). The headcount and avg_salary columns aren't in GROUP BY because they're aggregates — they collapse the bucket. The department column is allowed because every row in a single bucket has the same department value.

The grouping rule (the one rule)

Every column in SELECT must be in GROUP BY, or wrapped in an aggregate function. No exceptions.

This is the single rule that, if you internalize it, prevents 80% of GROUP BY bugs. The reason: a single output row has to have a single value in every column, but a group of input rows often has many different values in a non-grouped column. The engine has no way to pick "the right one" — so it errors, or worse, it picks arbitrarily and lies to you.

Bug: name is neither grouped nor aggregated
SELECT department,
       name,                       -- ⚠ ambiguous: which name?
       COUNT(*) AS headcount
FROM employees
GROUP BY department;
-- Postgres / SQL Server / BigQuery: ERROR
-- Old MySQL: returns ARBITRARY name from each bucket (silent bug)

The lenient-MySQL trap: MySQL versions before 5.7 (and 5.7+ without only_full_group_by) silently picked an arbitrary value from the group, producing wrong results that looked plausible. Modern MySQL enables only_full_group_by by default. If you're inheriting an old codebase, audit every GROUP BY query for non-aggregated, non-grouped columns — they're hidden bugs.

The fix: either group by name as well (creating per-(department, name) buckets), or replace name with an aggregate that defines which name you want — MIN(name), MAX(name), or STRING_AGG(name, ', ') to concatenate them.

HAVING vs WHERE — the most-asked interview question

If you've ever interviewed for a SQL role, you've been asked the difference between WHERE and HAVING. Here's the answer that interviewers want, in two sentences:

Departments with more than 5 high-earners
SELECT department,
       COUNT(*) AS high_earners
FROM employees
WHERE salary > 100000          -- pre-group: drop low-salary rows
GROUP BY department
HAVING COUNT(*) > 5;            -- post-group: drop small departments

Knowing the execution order makes this stick. SQL evaluates clauses in this order, regardless of how you write them:

StepClauseWhat happens
1FROMRead source tables, apply joins
2WHEREFilter rows (no aggregates allowed)
3GROUP BYPartition surviving rows into buckets
4HAVINGFilter buckets (aggregates allowed)
5SELECTCompute output columns + aggregates
6ORDER BYSort (can reference SELECT aliases)
7LIMITTruncate rows

That's also why WHERE can't reference column aliases (the SELECT hasn't run yet) but ORDER BY can. And it's why HAVING COUNT(*) > 5 works while WHERE COUNT(*) > 5 errors.

A common shortcut question: "Can I always use HAVING instead of WHERE?" Technically yes for non-aggregate predicates — HAVING department = 'Eng' works — but it's slower because the engine groups every row and then throws away the wrong groups, instead of filtering rows before grouping. Use WHERE for row predicates and HAVING for group predicates. Performance and intent both line up.

GROUP BY on multiple columns

Adding columns to GROUP BY makes the buckets finer-grained. GROUP BY department produces one bucket per department; GROUP BY (department, year) produces one bucket per (department, year) pair.

Headcount per (department, hire-year)
SELECT department,
       EXTRACT(YEAR FROM hire_date) AS hire_year,
       COUNT(*) AS hires
FROM employees
GROUP BY department, EXTRACT(YEAR FROM hire_date)
ORDER BY department, hire_year;

Note the GROUP BY repeats the EXTRACT expression — most databases require the exact expression rather than the alias. Postgres, SQLite, and modern MySQL also accept positional references like GROUP BY 1, 2 (the first and second SELECT columns), but expressions are clearer and survive column reordering.

The five aggregates and how they handle NULL

Five aggregates cover ~95% of real work. Each ignores NULL except where noted.

FunctionReturnsNULL behavior
COUNT(*)Row countCounts every row, including all-NULL ones
COUNT(col)Non-NULL count of colIgnores NULLs in col
SUM(col)TotalSkips NULLs; SUM of all-NULL is NULL
AVG(col)MeanSkips NULLs; AVG of all-NULL is NULL (not 0)
MIN/MAX(col)Extreme valueSkips NULLs

The COUNT(*) vs COUNT(col) gotcha: COUNT(*) answers "how many rows" while COUNT(email) answers "how many rows have a non-NULL email." If 30% of users have no email, the two queries return different numbers and that's the bug. Most analysts mean COUNT(*) 95% of the time.

Two more aggregates worth knowing: STRING_AGG (Postgres / SQL Server) or GROUP_CONCAT (MySQL / SQLite) concatenates a column's values per group, often with a separator. ARRAY_AGG (Postgres) packs them into an array. Both are useful for collapsing many rows into a single comma-separated value or array per group.

The 5 mistakes that defeat most analysts

Mistake 1: SELECTing a non-aggregated, non-grouped column

Already covered above — the grouping rule violation. Modern engines error; old MySQL silently lies. Fix: add the column to GROUP BY, wrap in an aggregate, or remove from SELECT.

Mistake 2: Filtering an aggregate in WHERE

⚠ WRONG — aggregate in WHERE
SELECT department, COUNT(*) AS headcount
FROM employees
WHERE COUNT(*) > 5            -- ERROR: aggregates not allowed in WHERE
GROUP BY department;

Fix: move the aggregate predicate into HAVING. The grouping rule and the WHERE-vs-HAVING rule together explain almost every "my query won't run" complaint when first learning GROUP BY.

Mistake 3: Forgetting NULL is its own group

If you GROUP BY country and 50 rows have NULL in country, those 50 rows form a single NULL group with one output row. Beginners often expect NULL rows to be silently dropped — they aren't. Fix: filter explicitly with WHERE country IS NOT NULL if you want them excluded, or use COALESCE(country, 'Unknown') to bucket them with a label.

Mistake 4: Sorting with positional GROUP BY then renaming a column

Brittle: positional GROUP BY breaks if you reorder SELECT
SELECT hire_year, department, COUNT(*)
FROM employees
GROUP BY 1, 2;            -- breaks if you swap column order in SELECT

Positional GROUP BY (numeric references to SELECT positions) is a tool for ad-hoc work. In production code or anything you'll edit later, name the columns or expressions explicitly. Future-you will swap two columns in SELECT and not realize the GROUP BY semantics changed silently.

Mistake 5: Using DISTINCT inside an aggregate to "fix" double counting

If your JOIN structure is wrong and rows are duplicated, COUNT(DISTINCT user_id) patches the symptom but hides a real bug. The right fix is usually to find the unintended fan-out — usually a many-to-many JOIN that should have been a subquery — and eliminate the duplication at its source. DISTINCT inside an aggregate is also slower because the engine has to deduplicate before counting. Use it when DISTINCT is genuinely the requirement (count of unique users), not as a JOIN bug bandage.

Conditional aggregation — SUM(CASE WHEN), the workhorse pattern

Conditional aggregation is putting a CASE expression inside an aggregate function. It's the single most useful pattern in business reporting because it lets you compute multiple per-condition metrics in a single query.

Per-department: total, shipped count, avg priority of pending
SELECT department,
       COUNT(*) AS total_orders,
       SUM(CASE WHEN status = 'shipped' THEN 1 ELSE 0 END) AS shipped_n,
       SUM(CASE WHEN status = 'pending' THEN amount END) AS pending_revenue,
       AVG(CASE WHEN status = 'pending' THEN priority END) AS avg_pending_priority
FROM orders
GROUP BY department;

Three things to notice. First, the CASE expression returns a value (or NULL) per row, then the aggregate folds it across the group. Second, when there's no ELSE clause the false branch returns NULL — and aggregates skip NULL — so AVG(CASE WHEN active THEN response_time END) averages only active rows. Third, you can mix conditional and unconditional aggregates in the same query; COUNT(*) totals all rows while the CASE-wrapped sums apply only to subsets.

The alternative — running multiple queries with different WHERE clauses and joining the results — is slower, harder to read, and more bug-prone. Conditional aggregation is also how SQL pivots row-shaped data into column-shaped data, before databases added explicit PIVOT operators.

ROLLUP, CUBE, GROUPING SETS — subtotals without UNION

Once you've used GROUP BY a while, you'll hit a need for subtotals: per-(country, city) row counts plus per-country totals plus a grand total. Three modifiers handle this:

ROLLUP: subtotals in one query
SELECT country, city, COUNT(*) AS users
FROM users
GROUP BY ROLLUP(country, city)
ORDER BY country NULLS LAST, city NULLS LAST;
-- Output: per-(country,city) rows, per-country subtotals, grand total

The subtotal rows have NULL in the rolled-up columns, which can collide with actual NULL values in your data. Use the GROUPING(col) function to tell them apart — GROUPING(city) = 1 means "this is a subtotal row" while GROUPING(city) = 0 with NULL city means "the original data has NULL here." Rare in practice but worth knowing.

Engine support note: Postgres, SQL Server, Oracle, BigQuery, Snowflake, and recent MySQL all support ROLLUP/CUBE/GROUPING SETS. SQLite doesn't — fall back to UNION ALL. Most real reports lean on ROLLUP and skip CUBE because hierarchical totals (region → country → city) are more common than every-combination subtotals.

Performance notes

GROUP BY is one of the more expensive operations in SQL. Two patterns cover most performance work:

For very large fact tables, materialized views or pre-aggregated rollup tables become necessary — refreshing once an hour is much cheaper than computing the same aggregate per query. That's a separate topic; the index + WHERE pattern handles 90% of GROUP BY performance issues you'll hit before then.

The GROUP BY checklist (paste into code review)

Before merging any GROUP BY query, verify:

Practice with real challenges

Reading is the easy part. The patterns above stick when you write them cold five times — that's why every SQL Quest aggregation challenge gives you a wrong-answer diagnostic instead of just "incorrect." Here are 4 challenges that drill GROUP BY at increasing difficulty:

📊 Aggregation track — 12+ challenges

Then the related deep-dives:

Read the rule, write it cold five times.

The five mistakes above show up in nearly every analyst's first month on the job. Drilling GROUP BY in 12 progressive challenges burns the patterns into muscle memory. The wrong-answer diagnostic catches what the rule didn't.

Start the Aggregation Track — Free

Found a typo or want a topic covered? Email us.