I run into the “top‑N” problem almost every week: highest salaries, best‑selling products, most active users, peak latency windows. You can always sort and slice, but that’s a full reordering of everything even when you only need a handful of rows. In practice, that choice affects latency, memory pressure, and how easily you can explain the intent of your code to a teammate. When the dataset grows from a few thousand rows to a few million, the difference shows up fast.
That’s why I reach for pandas DataFrame.nlargest() whenever I want the biggest values without fully sorting the whole frame. It reads like a sentence, it’s concise, and it signals intent: I want the top N rows by column, not a complete ordering. I’ll show you how it behaves, how I handle ties, missing values, and mixed types, and how to apply it to multi‑column ranking and grouped results. I’ll also call out when I avoid it and pick a different approach, plus a few modern 2026 patterns that keep pipelines clear and reproducible.
Why nlargest exists and how it behaves
When you sort a DataFrame to get top rows, you’re asking pandas to do extra work: order every row, even if you plan to keep only the first five. DataFrame.nlargest() is designed for the common case where you only care about the highest few values. Think of it like pulling the top five resumes from a stack without fully ranking every candidate. You still scan the pile, but you don’t need to meticulously order every page.
The method operates on a DataFrame or Series. For DataFrames, you pass a column name (or list of columns) to rank by. The result is a DataFrame containing the N rows with the largest values in those columns. It keeps the original row labels, which is useful when you need to trace the records back to their source.
The core signature is:
DataFrame.nlargest(n, columns, keep=‘first‘)
That terse signature is deceptive in a good way. You tell pandas how many rows you want, which column(s) define “largest,” and how to break ties. The method returns a new DataFrame; it does not mutate in place. That makes it safe to use in pipelines and chainable expressions.
A rule I follow: if I only need the top N rows, I always start with nlargest. I only switch to sort_values when I need a complete ordering or I need a complex tie‑break across several columns with custom direction.
Core syntax and parameter behavior
Here’s the basic call and what each argument really means in practice:
- n: number of rows to return. This is an integer and must be >= 1.
- columns: the column or list of columns used for ranking.
- keep: how to handle duplicate values at the cut line. Options are ‘first‘, ‘last‘, or ‘all‘ (Series supports ‘all‘; DataFrame support varies by pandas version, so I still prefer explicit tie‑breaks when I need all duplicates).
I like to demonstrate with a realistic dataset rather than toy placeholders. Here’s a small DataFrame with salaries, scores, and last update timestamps:
import pandas as pd
employees = pd.DataFrame(
{
‘name‘: [‘Amira‘, ‘Luis‘, ‘Chen‘, ‘Priya‘, ‘Noah‘, ‘Zara‘, ‘Mateo‘],
‘salary‘: [125000, 98500, 125000, 132000, None, 110000, 95000],
‘score‘: [92, 87, 91, 95, 88, 90, 85],
‘updatedat‘: pd.todatetime(
[‘2025-12-02‘, ‘2025-11-19‘, ‘2025-12-15‘, ‘2025-12-01‘, ‘2025-10-30‘, ‘2025-12-10‘, ‘2025-11-01‘]
),
}
)
Drop missing salaries before ranking
clean = employees.dropna(subset=[‘salary‘])
top3 = clean.nlargest(3, ‘salary‘)
print(top3)
This returns the top three rows by salary. Note the use of dropna with a subset list. I make that explicit because DataFrame.nlargest ignores NaN values in the ranking column, but leaving missing rows in can still produce confusing results if you later compare against the full dataset.
The keep parameter influences which rows make the cut when there are ties around the boundary. If you have duplicate salaries and you ask for the top three, one of those ties might be dropped. With keep=‘first‘, the first occurrence in the original order wins. With keep=‘last‘, the last occurrence wins. If you need deterministic tie‑breaks across multiple columns, prefer a multi‑column nlargest or a sort_values call with explicit ordering.
Single‑column vs multi‑column ranking
Single‑column ranking is the most common usage, but multi‑column ranking is where nlargest becomes a quiet powerhouse. You can ask for the largest values by one column and then use a second column to break ties. The columns list is ordered by priority, similar to sort_values.
# Top 3 by salary, then by most recent update for ties
ranked = clean.nlargest(3, [‘salary‘, ‘updated_at‘])
print(ranked[[‘name‘, ‘salary‘, ‘updated_at‘]])
This gives you the highest salaries, and if two people share a salary, the more recent update wins. I like this pattern when data can be stale or I need a stable tie‑break that matches business logic.
You can also pair nlargest with column selection for more readable pipelines:
(
clean[[‘name‘, ‘salary‘, ‘score‘]]
.nlargest(5, ‘score‘)
)
That selection keeps the output focused and reduces the mental load for anyone scanning the results. It’s a tiny improvement, but in my experience it makes notebooks easier to review.
When I prefer sort_values instead
I still use sort_values when I need a full ordering or I need mixed directions (for example, highest revenue but lowest churn). DataFrame.nlargest doesn’t allow descending on one column and ascending on another in a single call. Here’s the safer approach in that scenario:
sortedrows = clean.sortvalues(
by=[‘revenue‘, ‘churn_rate‘],
ascending=[False, True]
).head(10)
If you need a full ranking or mixed direction, sort_values is the clear choice. That’s not a knock on nlargest; it’s just the wrong tool for that job.
Ties, NaN, and type pitfalls
Most bugs I see with nlargest come from ties, NaN handling, or mixed types. These are the edges I check first.
Ties at the boundary
If two rows share the same value and one of them lands at the Nth position, keep decides which survives. That’s fine for quick exploration but risky for reports that need stable selection. I avoid accidental tie drops by adding a second column for tie‑breaks:
# Tie-break on score when salaries are the same
ranked = clean.nlargest(3, [‘salary‘, ‘score‘])
That makes the result deterministic and reflects real‑world logic: highest salary first, then highest score among the tied salaries.
Missing values
nlargest ignores NaN values in the ranking column. That sounds convenient, but it can hide missing data you actually need to surface. I prefer to be explicit by dropping rows for the specific column I rank on, and I leave other missing columns alone:
ranked = employees.dropna(subset=[‘salary‘]).nlargest(5, ‘salary‘)
That makes intent obvious and keeps audits cleaner.
Mixed types and strings
If the ranking column is object dtype and mixes numbers and strings, nlargest will error or behave unexpectedly. I force numeric conversion when I’m ingesting CSV files or user input:
employees[‘salary‘] = pd.to_numeric(employees[‘salary‘], errors=‘coerce‘)
ranked = employees.dropna(subset=[‘salary‘]).nlargest(5, ‘salary‘)
The errors=‘coerce‘ approach avoids crashes but can hide bad data. For production pipelines, I log the count of coerced values so you can track upstream issues.
Datetime ranking
Datetime columns work well with nlargest, which makes “most recent N” trivial:
recent = employees.nlargest(3, ‘updated_at‘)
This is one of those places where nlargest reads like a sentence and saves you from writing a sort + head combination.
Performance and scalability in real projects
On large datasets, nlargest is usually faster than sorting the whole DataFrame, especially when N is small relative to the total size. It uses selection logic that avoids a full ordering. For datasets in the hundreds of thousands to low millions of rows, I typically see nlargest return in the 10–80ms range for a single numeric column, while full sorting can land in the 40–250ms range depending on memory pressure and column width. Those are broad ranges on modern laptops and workstation CPUs, but the relative difference tends to hold.
When N is close to the total row count, the gap closes and sorting might be just as fast or faster. My rule of thumb: if N is less than about 10–20% of the total rows, nlargest is a good bet.
Here’s a comparison table I use to decide between approaches:
Typical time on 500k rows
Best use case
—
—
~20–70ms
top‑N selection
~60–180ms
need sorted output
~10–50ms
very large arrays
If you’re already in pandas, nlargest is the best blend of speed and clarity. I only drop to numpy for very tight loops or when I already have a NumPy array and want the fastest possible top‑N on a single column.
Modern 2026 workflow tip
I often pair nlargest with AI‑assisted data checks. For example, I’ll generate a short data profile that flags extreme values and then use nlargest to inspect the top few rows that triggered the flag. This reduces time spent hunting outliers and helps me explain why a model or report is skewed.
Real‑world patterns I use most often
Top‑N per group
You might need the top three transactions per customer or the highest revenue product per region. There are a few ways to do this. Here’s the one I recommend for clarity:
# Top 2 salaries per department
employees = pd.DataFrame(
{
‘name‘: [‘Amira‘, ‘Luis‘, ‘Chen‘, ‘Priya‘, ‘Noah‘, ‘Zara‘, ‘Mateo‘],
‘department‘: [‘Sales‘, ‘Sales‘, ‘Eng‘, ‘Eng‘, ‘Sales‘, ‘Eng‘, ‘Sales‘],
‘salary‘: [125000, 98500, 125000, 132000, 105000, 110000, 95_000],
}
)
ranked = (
employees
.sort_values([‘department‘, ‘salary‘], ascending=[True, False])
.groupby(‘department‘)
.head(2)
)
print(ranked)
Why not groupby().nlargest()? Because groupby().nlargest is Series‑only and can be awkward to merge back into the full DataFrame. The sort + groupby + head pattern is explicit and stable, especially when you need extra columns.
Largest per category with tie‑breaks
When I need a single row per group with a tie‑break, I use multi‑column sorting first:
ranked = (
employees
.sort_values([‘department‘, ‘salary‘, ‘name‘], ascending=[True, False, True])
.groupby(‘department‘)
.head(1)
)
That makes ties deterministic and easy to audit. It’s not pure nlargest, but it follows the same top‑N intent.
Combining nlargest with filters
Filtering first is almost always faster and clearer:
highscorerecent = (
employees
.query(‘score >= 90‘)
.nlargest(5, ‘salary‘)
)
This pattern prevents large scans and keeps the top‑N selection focused on a relevant subset.
Common mistakes and how I avoid them
- Forgetting to drop NaN in the ranking column. I always call dropna with a subset list, even if nlargest ignores NaN. It avoids confusion later.
- Using nlargest when a full ordering is needed. If a report needs ranks for every row, I use sort_values and then assign ranks.
- Expecting nlargest to apply mixed sort directions. It doesn’t. I switch to sort_values when I need descending on one column and ascending on another.
- Relying on default tie handling. I add a tie‑break column or a sort order so I can explain exactly why one row beat another.
- Ranking on object dtype. I coerce numeric columns upfront and count how many values become NaN after conversion.
If you keep those five points in mind, nlargest becomes a reliable tool rather than a source of surprises.
When I avoid nlargest
I still like nlargest, but there are clear cases where I skip it:
- You need a full sorted output for downstream processing. In that case, sort_values is direct and less confusing.
- You need mixed sort direction. sort_values with ascending list is the right tool.
- You need stable ranking across multiple columns plus custom tie rules. Again, sort_values gives you more control.
- N is almost the same size as the dataset. At that point, a full sort is comparable in cost, and sort_values is easier to explain.
Here’s a practical decision guide I share with teams:
Best choice
—
nlargest
nlargest with columns list
sort_values
sort_values
A short end‑to‑end example
This example shows how I use nlargest in a small pipeline, including data checks and a final output that a report can consume.
import pandas as pd
sales = pd.DataFrame(
{
‘order_id‘: [101, 102, 103, 104, 105, 106],
‘customer‘: [‘Lena‘, ‘Dario‘, ‘Lena‘, ‘Mina‘, ‘Dario‘, ‘Akash‘],
‘total‘: [420.50, 1100.00, 210.00, 860.00, None, 775.25],
‘orderedat‘: pd.todatetime(
[‘2025-12-01‘, ‘2025-12-03‘, ‘2025-11-30‘, ‘2025-12-02‘, ‘2025-12-04‘, ‘2025-12-05‘]
),
}
)
clean = sales.dropna(subset=[‘total‘])
top_orders = (
clean
.nlargest(3, [‘total‘, ‘ordered_at‘])
.reset_index(drop=True)
)
print(top_orders)
I use reset_index here because this is going into a report where row labels aren’t useful. If I need the original row labels for debugging, I keep them and later use them to trace back to raw input.
Deeper internals: what nlargest is really doing
One reason I trust nlargest in production is that it maps to a clear selection approach. Instead of building a fully sorted order, pandas focuses on finding the top N values, which generally means a partial selection step internally. The exact implementation can shift across versions, but the idea stays consistent: don’t pay the cost of a full sort when you only need a small slice.
What that means for you as a user is:
- nlargest scales with N more gently than sort_values. When N is tiny relative to total rows, it’s usually much faster.
- nlargest is not a magic filter. It still has to scan the relevant column(s), so the full DataFrame is read, but not fully reordered.
- If you request multiple columns, pandas still evaluates the tie‑break logic but avoids a full multi‑column sort when possible.
I don’t need to know the internal algorithm to use it well, but I do keep the mental model: “scan and select, not fully rank.” That’s a good heuristic when you decide where it fits in a pipeline.
Extra practical examples for common business questions
I find nlargest really shines when I need answers that are both quick and clearly correct for stakeholders. Here are a few patterns that come up often.
1) Highest‑revenue orders in a time window
import pandas as pd
orders = pd.DataFrame(
{
‘order_id‘: [201, 202, 203, 204, 205, 206, 207],
‘total‘: [150.25, 980.00, 320.00, 1200.50, 75.00, 640.00, 910.00],
‘orderedat‘: pd.todatetime(
[‘2025-12-01‘, ‘2025-12-02‘, ‘2025-12-02‘, ‘2025-12-03‘, ‘2025-12-03‘, ‘2025-12-04‘, ‘2025-12-04‘]
),
‘channel‘: [‘web‘, ‘web‘, ‘mobile‘, ‘web‘, ‘mobile‘, ‘web‘, ‘mobile‘],
}
)
Filter to last 3 days, then get top 3 orders by total
recent_top = (
orders
.query("ordered_at >= ‘2025-12-02‘")
.nlargest(3, ‘total‘)
)
This is a good example of filter first, then rank. It’s fast, readable, and easy to justify in a report.
2) Most active users by event count
Sometimes you don’t even have a numeric column; you need to build the metric first.
events = pd.DataFrame(
{
‘user_id‘: [‘u1‘, ‘u2‘, ‘u1‘, ‘u3‘, ‘u1‘, ‘u2‘, ‘u4‘, ‘u2‘, ‘u3‘],
‘event‘: [‘click‘, ‘view‘, ‘click‘, ‘view‘, ‘purchase‘, ‘click‘, ‘view‘, ‘click‘, ‘click‘],
‘eventat‘: pd.todatetime(
[‘2025-12-01‘, ‘2025-12-01‘, ‘2025-12-01‘, ‘2025-12-02‘, ‘2025-12-02‘,
‘2025-12-02‘, ‘2025-12-03‘, ‘2025-12-03‘, ‘2025-12-03‘]
),
}
)
activity = (
events
.groupby(‘user_id‘)
.size()
.rename(‘event_count‘)
.reset_index()
)
mostactive = activity.nlargest(2, ‘eventcount‘)
This shows a nice pattern: compute a metric with groupby, then use nlargest on the result. It makes “most active users” a simple and reliable step.
3) Top products within each category with stable tie‑breaks
products = pd.DataFrame(
{
‘category‘: [‘A‘, ‘A‘, ‘A‘, ‘B‘, ‘B‘, ‘B‘],
‘product‘: [‘P1‘, ‘P2‘, ‘P3‘, ‘P4‘, ‘P5‘, ‘P6‘],
‘revenue‘: [1200, 1200, 900, 450, 450, 300],
‘updatedat‘: pd.todatetime(
[‘2025-12-01‘, ‘2025-12-03‘, ‘2025-12-02‘, ‘2025-12-01‘, ‘2025-12-04‘, ‘2025-12-02‘]
),
}
)
Sort with tie-break, then take top 1 per category
bestpercategory = (
products
.sortvalues([‘category‘, ‘revenue‘, ‘updatedat‘], ascending=[True, False, False])
.groupby(‘category‘)
.head(1)
)
This keeps the tie‑break logic explicit and makes the selection reproducible even if the input order changes.
Edge cases you should test once and then automate
I like to run a few small tests when I introduce nlargest into a pipeline, then I keep those tests around as guardrails. Here are the edge cases I consider “must check.”
1) All values are NaN
df = pd.DataFrame({‘value‘: [None, None, None]})
result = df.nlargest(2, ‘value‘)
This should return an empty DataFrame. If your downstream code assumes at least one row, it will break. I usually protect against this by checking result.empty and handling it explicitly.
2) N larger than the number of rows
df = pd.DataFrame({‘value‘: [10, 20]})
result = df.nlargest(5, ‘value‘)
Pandas will just return all rows. That’s reasonable, but you should not assume the output has exactly N rows if you don’t control the input size.
3) Non‑numeric ranking columns
df = pd.DataFrame({‘value‘: [‘10‘, ‘20‘, ‘3‘]})
result = df.nlargest(2, ‘value‘)
This often fails or sorts in unexpected ways if values are strings. Convert to numeric first to avoid subtle issues.
4) Multi‑column ranking with different dtypes
df = pd.DataFrame(
{
‘score‘: [90, 90, 88],
‘updatedat‘: pd.todatetime([‘2025-12-01‘, ‘2025-12-03‘, ‘2025-12-02‘])
}
)
result = df.nlargest(2, [‘score‘, ‘updated_at‘])
This works nicely, but if updated_at is still a string, the tie‑break becomes lexicographic, which is a recipe for mistakes. Always enforce dtypes before ranking.
nlargest in production pipelines: structure and clarity
A lot of production data pipelines have a simple pattern: ingest, clean, filter, aggregate, rank, output. nlargest fits into that pattern cleanly if you keep a few structure rules:
1) Validate dtypes early
2) Drop or fill NaN values intentionally
3) Filter before ranking
4) Rank with explicit tie‑breaks
5) Keep a small, relevant output for downstream steps
Here’s a compact pattern I like for scheduled reports:
import pandas as pd
sales = pd.read_parquet(‘sales.parquet‘)
1) Enforce numeric types early
sales[‘total‘] = pd.to_numeric(sales[‘total‘], errors=‘coerce‘)
2) Remove rows that can’t be ranked
sales = sales.dropna(subset=[‘total‘])
3) Limit to the last 30 days
cutoff = pd.Timestamp(‘today‘).normalize() - pd.Timedelta(days=30)
recent = sales.query(‘ordered_at >= @cutoff‘)
4) Rank with tie-break on most recent order
toporders = recent.nlargest(10, [‘total‘, ‘orderedat‘])
5) Keep only report columns
report = toporders[[‘orderid‘, ‘customer‘, ‘total‘, ‘ordered_at‘]]
This is short, predictable, and easy to review. I often wrap this into a function so it can be reused across dashboards and scheduled jobs.
Practical tie‑break strategies I actually use
Here are three tie‑break choices I’ve found to be the most defensible in real projects:
- Most recent update wins: great for data that changes over time.
- Highest secondary score wins: good when you have a related metric that indicates quality or reliability.
- Stable ID ordering: helpful when you need deterministic results but no meaningful tie‑break exists.
Here’s how that looks in practice:
ranked = (
employees
.dropna(subset=[‘salary‘])
.nlargest(5, [‘salary‘, ‘updatedat‘, ‘employeeid‘])
)
This sequence makes the rule clear to anyone reading the code.
nlargest vs. rank: when you need positions, not just top rows
If you need to assign ranks to all rows, nlargest is the wrong tool. Use rank or sort_values. For example:
employees[‘salary_rank‘] = (
employees[‘salary‘]
.rank(method=‘dense‘, ascending=False)
)
That gives every row a rank, which is useful for reporting or dashboard visuals. nlargest is best when you only need a small selection of rows and you want to skip the cost of ranking everything.
A quick note on stability and reproducibility
By default, keep=‘first‘ means “first occurrence wins.” If your input order changes between runs (common in distributed pipelines), your results might shift. If you need stable output across runs, add an explicit tie‑break column and keep your ranking deterministic. I view that as a best practice for any report that might be audited later.
Advanced: using nlargest with windowed data
Sometimes you want “top N per time window.” Here’s a simple pattern for weekly windows:
sales = pd.DataFrame(
{
‘order_id‘: [301, 302, 303, 304, 305, 306],
‘total‘: [220, 340, 150, 520, 180, 410],
‘orderedat‘: pd.todatetime(
[‘2025-11-25‘, ‘2025-11-27‘, ‘2025-12-01‘, ‘2025-12-02‘, ‘2025-12-06‘, ‘2025-12-07‘]
),
}
)
sales[‘week‘] = sales[‘orderedat‘].dt.toperiod(‘W‘).astype(str)
weekly_top = (
sales
.sort_values([‘week‘, ‘total‘], ascending=[True, False])
.groupby(‘week‘)
.head(2)
)
This is another case where a full nlargest per group is not as convenient as sorting by the group key, then using head.
Alternative approaches and when they are better
I’m a big fan of nlargest, but it’s not the only option. Here’s how I think about alternatives in everyday work:
sort_values + head
Best when you need a fully ordered output or mixed direction. It’s more flexible but can cost more time and memory for large tables.
numpy.argpartition
Best when you have a huge NumPy array and need the absolute fastest top‑N on a single column. The downside is readability and losing DataFrame context like index labels and mixed dtypes.
SQL ORDER BY + LIMIT
If your data lives in a database, pushing the top‑N query to SQL is often best. Let the database handle the sorting, especially when it has indexes. I still use pandas for follow‑up processing, but I try not to pull millions of rows into memory just to take the top 10.
Streaming top‑N
If you’re processing data in chunks, you can keep a rolling top‑N using a heap. This is useful when data doesn’t fit into memory. It’s more complex than pandas but can be essential for very large datasets.
Comparing nlargest to nsmallest for symmetry
When you learn nlargest, nsmallest is the obvious companion. They’re symmetric in syntax and behavior, so if you memorize one, you get the other for free. I use nsmallest for “lowest latency,” “smallest error,” or “minimum price” questions. The same tie‑break and NaN rules apply.
fastest = latencydata.nsmallest(5, ‘p95latency‘)
It’s just as readable and keeps your code consistent.
Debugging tips when nlargest gives a surprising result
When the output doesn’t match expectations, I run through this checklist:
1) Check dtype of ranking columns
2) Check missing values in ranking columns
3) Confirm whether ties exist at the boundary
4) Verify row order if keep=‘first‘ or ‘last‘
5) Reproduce with sort_values for comparison
That last step is helpful: if nlargest and sort_values + head disagree, the issue is almost always dtype or NaN handling.
A small helper function I reuse
When a pattern appears often, I wrap it in a function so I can enforce behavior consistently.
def top_n(df, n, columns, dropna=True):
data = df
if dropna:
data = data.dropna(subset=columns if isinstance(columns, list) else [columns])
return data.nlargest(n, columns)
Usage
reporttop = topn(sales, 10, [‘total‘, ‘ordered_at‘])
This helps keep the same dropna logic across notebooks and scripts. It also makes it easy to add logging around the number of rows dropped or to enforce data validation in one place.
How I document nlargest usage for teammates
Code clarity matters a lot in shared data projects. I usually add a brief comment or a docstring near the top‑N selection explaining the tie‑break rule. Something like:
- “Top 10 by revenue, tie‑break on most recent update.”
- “Top 5 by score, ties resolved by smallest customer_id for stability.”
This keeps the logic obvious and reduces confusion when someone sees a record that didn’t make the cut.
Final checklist before shipping a top‑N result
This is the checklist I run mentally when I finish a top‑N selection:
- Did I explicitly handle NaN values in the ranking columns?
- Are the ranking columns the correct dtype?
- Are tie‑breaks deterministic and aligned with the business question?
- Is N small relative to the dataset size (so nlargest makes sense)?
- Did I limit the output columns to what the downstream system needs?
If I can answer “yes” to those, I’m confident the selection will be stable and defensible.
Closing thoughts and next steps
When I need top‑N results, I reach for DataFrame.nlargest first because it reads like intent, avoids full ordering, and keeps my pipelines short. It shines when N is small, the ranking column is numeric or datetime, and ties are either unimportant or easy to resolve with a second column. The method isn’t magic, though. It can surprise you if you ignore NaN handling, leave columns as object dtype, or assume it supports mixed sort direction.
A simple way to decide: if the question is “Which N rows are biggest by this column?” use nlargest. If the question is “How do all rows rank?” use sortvalues. If the question is “How do I break ties with custom logic?” use sortvalues with explicit ordering. That three‑step mental check has saved me a lot of review cycles and a few production bugs.
If you want to deepen your workflow, start by adding a small data profile step before you rank: count missing values, check min/max, and confirm dtypes. Then build a repeatable top‑N helper function so your team can use the same logic across dashboards and scheduled reports. It’s a small investment that pays off in clarity, speed, and confidence.


