Brantley Fights Functors

We Made the Isospectral Drums and it Went… Fine

2026-03-01T00:00:00+00:00

“Can One Hear the Shape of a Drum?”

This question was popularized back in the ’60s by the mathematician Mark Kac via an article published in the American Mathematical Monthly. If you haven’t heard the question before, I recommend Kac’s original article if you have some mathematical background, or the Wikipedia page if you don’t. Either way, I’ll summarize the meaning of the question here.

The Question

Say you have a drum. The head of the drum is a membrane stretched taut over a shell. The shape of that membrane is usually circular in real-world drums, but it doesn’t have to be; your drum head could be an ellipse, a square, or some weirder 2D shape.

Image by Paul Zoetemeijer

You might not be surprised to learn that different drum head shapes will produce different sounds. But you might be surprised to learn how tight the connection between the geometry and the sound actually is.

If you hit the drum, the membrane will vibrate. This vibration is what creates the sound of the drum. That sound can be broken down into a bunch of individual frequencies $f_1, f_2, f_3, \dots$. Here’s the surprising part. The set of frequencies does not depend on how hard you strike the drum or where you choose to strike it. Those variables matter for the amplitudes of those frequencies, but not the set of frequencies. The frequencies depend only on the geometry of the drum. So if you know the geometry of the drum, you can figure out the frequencies it is capable of producing.

\[\text{Geometry} \longrightarrow \text{Frequencies}\]

When Kac asked “Can one hear the shape of a drum,” he was asking this: can we reverse this arrow?

\[\text{Geometry} \overset{?}\longleftarrow \text{Frequencies}\]

That is, if I listen very closely to the sound of a drum, and I have very sensitive ears and perfect pitch so that I’m able to pick out all those frequencies $f_1,f_2,f_3,\dots$, can I in principle reverse-engineer the shape of the drum head that produced them? Is all of the geometric information encoded in those frequencies? Or is it the case that there are two drum heads that are different shapes, but nevertheless produce identical frequencies?¹

As an aside, why do mathematicians care so much about this question? It’s not because they’re especially interested in drums. It’s because “hearing the shape of a drum” can be reformulated as a natural question about the eigenvalues of the Laplace operator, which is an object that mathematicians just love. The Laplacian shows up in geometry, analysis, physics, topology, graph theory, data science, and on and on. The reason the Laplacian is related to drum vibrations is that the operator appears in the wave equation. If you want to play around with the wave equation in a data science context, check out my data sonification page!

The question stood for over 20 years until the mathematicians Carolyn Gordon, David L. Webb, and Scott Wolpert published a paper with the wonderfully definitive title One cannot hear the shape of a drum. The paper describes (but does not explicitly depict) a pair of distinct shapes that, in theory, would produce identical frequencies if they were drum heads. Such a pair of shapes is called an isospectral pair.

Image credit: Wikipedia user Keenan Pepper. Based on a depiction of the Gordon-Webb-Wolpert pair by S. J. Chapman.

It’s been over 30 years since that paper was posted. Since then, many more isospectral pairs have been discovered, many pairs of drums that should theoretically produce identical frequencies.

I was shocked to learn that, to the best of my knowledge, no one has ever tried to actually build the drums.²

The Project

I recruited three undergraduate students at Ohio State via Cycle, an undergraduate research program at OSU: Yicheng Lin, Mohamed Musa, and Alex Theis. Together, we resolved to build the drums. Over the course of a school year, we met, learned about the underlying math, and iterated on drum designs.

Note the strangeness of the engineering task. Ideally, we would like to produce drums that we can strike, record, and verify that they produce the same frequencies. For this project to work, our drum has to conform to the assumptions of the drum model as closely as possible. For instance, the math assumes that the head of the drum is a perfectly elastic membrane of uniform thickness, stretched uniformly in all directions, and perfectly fixed in place where it meets the shell. If that’s what the model assumes, we’ll have to make that happen if we want the process to work. It’s usually the task of the modeler to make their assumptions fit the real world as closely as is practical. It was now our task to make a real-world drum that conformed to the modeling assumptions.

We talked through many designs and materials. What we eventually landed on was this:

It’s more of a tambourine than a drum. Or maybe more of a sandwich than a tambourine.

For each drum, we 3D printed two discs, each half an inch thick. Each disc had a hole in the shape of one of the polygonal drum shapes. We stretched a latex balloon over one disc and sandwiched it under the other. We then clamped the sandwich together with wood clamps (which I inherited from my late grandfather, an accomplished carpenter. Thank you Grandpa!)

Reasons for our design decisions:

Latex balloons have a nice elasticity, which is an important part of the model. Also, Party City was going out of business at the time.
The model assumed that the edge of the membrane was fixed in place. We didn’t want the discs vibrating at all, so the discs were thick with a heavy infill.
Similarly, just stretching the balloon over one disc would leave the membrane free to lift up and away from the edge as it vibrated. Hence the sandwich.
Stretching the balloon uniformly in all directions would be easiest if we were stretching it over a circular object, hence the disc shape.
The model ignores everything but the head of the drum, so we didn’t bother to build a shell.
The more clamps, the slower the system lost its energy and the longer the membrane vibrated. We needed to get a good recording of the sound, so longer vibrations were important. This is why the four clamps were necessary.

The Results

We flicked the drums and they sounded… like strangely-stretched balloons.

Here is the sound from the first drum, which we called the “snake” drum:

Here’s the second drum, which we called the “cat” drum:

Note that the sounds from the drums are different pitches. This is expected. The pitch is determined in part by the amount of tension in the drum head, and we didn’t try to make the tensions match. Still, the relative pitches should match in theory. It was going to take more work to determine whether or not this was the case.

We took the above recordings, which were taken using a recorder with high sample rate and bit depth, and visualized the underlying frequencies as a spectrogram using the program Wave Candy.

Drumroll please. (haha)

The horizontal axis represents time, and the vertical axis represents frequency. Each horizontal streak comes from a frequency peak that persisted after striking a drum. Each of the two patterns of streaks comes from striking one of the two drums.³

The spectrograms are… kinda similar? But not identical. What happened?

We expect our biggest divergence from the model was because of the way we stretched the balloons over the disc when forming the drum; the mathematical model assumes the drum head is under uniform tension in all directions, but our method for stretching the balloons over the discs was kind of haphazard. We expect we could get the spectrograms to match more closely by using a more precise method for stretching the balloons.

But alas, we used university money for the project, so the discs are now in a display cabinet in the math building. If anyone wants to give it a go, here are the 3D printer files:

Drum 1 (STL) | Drum 2 (STL)

Try it out! And if you do, please tell me about it!

Left to right: Yicheng Lin, Mohamed Musa, Brantley Vose (me), and Alex Theis.

If you’d like to say something, feel free to leave a comment on the related posts on mastodon or bluesky! Follow me on either one (or RSS) for more posts!

Many thanks to my friend Yang Yang for handling the 3D printing, and to Dr. Jim Fowler for design ideas. Thanks also to the Cycle program at Ohio State for funding the project.

Updated March 29, 2026:

Added sound files
Linked to the Sridhar and Kudrolli experiment

The question is sometimes phrased by saying that a pair of counterexample drums should “sound the same”. This is not quite true. The geometry of a drum determines the set of frequencies that a drum can produce, not the relative amplitude of those frequencies. If you’ve ever heard muffled music through a wall, you know that the relative amplitudes of frequencies make a big difference in the way that something sounds. This is why I stick to the more precise phrase “produce identical frequencies”. ↩
While it’s true that no one has built the drums, someone has performed a physical experiment to verify their properties! Sridhar and Kudrolli, physicists at Northeastern, tested the spectral properties of these shapes by creating thin cavities shaped like the drum heads and observing how microwaves behaved inside. The details are in this paper (which unfortunately requires an institutional login). ↩
As I mentioned, the geometry only determines the relative frequencies. The two drums produced different fundamentals when I struck them, so to make this image I pitched the second one up to get the spectrograms to match as closely as I could. ↩

Playing Cards with Bayes’ Theorem

2025-12-02T00:00:00+00:00

Wizard 330 is a home-brewed variant of Wizard, a trick-taking card game. (Think Spades or Euchre. That sort of game.) While strategizing, an interesting problem arises that can be solved very cleanly with some probability. To understand the problem, you don’t need to understand the (admittedly very complicated) details of the game. All you need to know is the following:

Wizard 330 is played with a deck of 90 cards whose contents are unknown at the beginning of the game. The deck is a random subset of a larger library of 120 cards.
In particular, at the beginning of the game, the deck could have anywhere from zero to eight wizards, which are special, powerful cards. It is important for strategy to have a sense of how many wizards are in the deck.
The game has 15 rounds, and during each round, some number of cards are revealed. The first round reveals $1\times 6$ random cards, the next reveals $2\times 6$ random cards, up to the final round which reveals all $15\times 6 = 90$ cards.

This is what the eight wizards look like.

(The number of players, cards, and rounds actually vary from game to game, but this is a reasonable configuration so we’ll stick with it.)

From these three facts, you might correctly guess that it’s a good idea to look at the cards played each round and keep a running estimate of how many wizards are in the deck. The more wizards you see, the higher your guess should be.

This sounds like a job for statistics! We have an unknown parameter: the number of wizards in the deck, which we’ll call $w$. Round $k$ gives us $6k$ samples from the deck without replacement, though the deck is replaced and re-shuffled between rounds. Let’s cook up an estimator for the number of wizards!

Maximum Likelihood

When making a point-estimator, maximum likelihood is a good place to start. Unfortunately, the maximum likelihood estimator here is quite bad. For instance, it’s very common to see no wizards at all in the first round of the game. After all, there are at most eight wizards in the whole deck, and the first round only reveals six cards. If no wizards are seen in the first round, the maximum likelihood estimator for $w$ is zero, since the likelihood of seeing no wizards when $w=0$ is a flat 100%. However, zero-wizard games are phenomenally rare, so zero seems like a poor estimate.

Implicitly, when I talk about the rarity of zero-wizard games, I’m leaning on a prior belief about the number of wizards in the deck. This suggests we should do something Bayesian. Let’s try it!

Bayesian Estimator

Remember that our deck is a random subset of a larger 120-card library. This gives us a natural prior distribution for $w$; it should follow a hypergeometric distribution! After each round, we can use the number of wizards we saw to update the distribution using Bayes’ theorem. This gives us more than just a point-estimate; we get a probability distribution for the number of wizards. If we really want a point-estimate, we can always take the expectation of this distribution.

Let’s nail down some specifics. Let $w$ be the number of wizards in the deck and $X_j$ be the (random) number of wizards seen in round $j$. Suppose we observe the first $k$ rounds of play and see that $X_1=x_1,\dots, X_k=x_k$. Then an application of Bayes’ theorem tells us that the weight assigned to the hypothesis that $w=w_0$ should be proportional to

\[\mathbb P(w = w_0) \prod_{j=1}^k\mathbb P(X_j=x_j \mid w=w_0).\]

The distribution of $w$, and the distributions of each $X_j$ conditioned on $w$, are all hypergeometric distributions, so we can easily compute these values.

I coded this up! Here are the running estimates using some real data from a game that turned out to have five wizards:

This is a bubble plot. Each column shows the distribution of wizard counts after a certain number of rounds. The game progresses from left to right. For instance, the first column shows the prior distribution; it’s the distribution for the number of wizards in the deck after zero rounds of play. After each round, we use the observed cards to update the distribution, creating the next column. The red line tracks the expectations of these distributions.

If we watch the columns from left to right, we see the distribution start out dispersed across the possibilities and slowly coalesce as we learn more and more about the deck. Finally, after the fifteenth round, every card has been revealed, so the distribution condenses at the true value of five. This gives us the big bubble in the last column.

In this example, this estimator leaves something to be desired. The estimates are consistently higher than the true value. The distributions favor a six-wizard outcome until the very end.

I think we can do better.

Better Bayesian Estimator

We are not actually using all of the information at our disposal. You see, wizards actually come in four distinct suits.

There are two wizards of each suit in the full 120-card library. The four suits are unnamed because they’re irrelevant for gameplay, so we just call them (from left to right in the image) sun, grape, bee, and seahorse wizards. If we keep track not just of how many wizards we see each round, but also how many wizards of each suit, we’ll get a better estimator!

We’ve bought ourselves a trickier math problem. We’re no longer estimating $w$, the number of wizards, we’re now estimating $\vec w$, a vector of length four, where the entry $\vec w_i$ is the number of wizards of the $i$th suit in the deck. (I usually hate the vector arrow notation like $\vec w$, but in this case it is helpful for telling vectors and scalars apart.) Our estimator is defined similarly to the previous one. If we observe the first $k$ rounds and find that the vectors of observed wizards are $\vec X_1 = \vec x_1,\dots,\vec X_k = \vec x_k$, then the hypothesis that the true vector of wizards is $\vec w = \vec w_0$ should have probability proportional to

\[\mathbb P(\vec w = \vec w_0) \prod_{j=1}^k\mathbb P(\vec X_j=\vec x_j \mid \vec w= \vec w_0).\]

The distributions involved here are a little more complicated, but still simple enough to have a name! The distribution of $\vec w$ and those of each $\vec X_j$ conditioned on $\vec w$ are all multivariate hypergeometric distributions.

Anyway, once we have a posterior distribution for $\vec w$, the number of wizards of each suit, we can easily sum up some probabilities to get a distribution for the total number of wizards.

Let’s code this up and apply it to the same five-wizard game. For comparison, here are the estimates we got from the old univariate estimator, followed by the estimates from the new multivariate one.

The difference is subtle but critical. Whereas the old estimator failed to coalesce on the five-wizard outcome until the very end, the new estimator successfully guesses five wizards as the modal outcome after round ten!

How practical is this for Wizard 330 strategy? The actual probability calculations would be very difficult to do without a computer. Maybe there’s a simpler calculation that’s approximately correct but much easier to do in your head, similar to blackjack card-counting approaches. I’m not sure.

So what do I do in practice? Thankfully, the people I play Wizard 330 with love to over-analyze the game, so we just run my script on a laptop where everyone can see live round-by-round analysis of wizard distributions. Half the fun of Wizard 330 comes from the simple joy of over-analysis.

What’s So Great About Polynomials?

2025-10-22T00:00:00+00:00

American math education is notorious for introducing mathematical concepts long before it introduces the reason to care about them. Case in point: polynomials. You know them. They’re the functions that look like this:

\[f(x) = x^2 - x\]

or like this

\[f(x) = -3x^3 + x^2 - \frac{1}{2} x - \pi.\]

You probably learned about these when you took Algebra. If you’re like most students, you weren’t immediately given a very good reason to care!

In this post, I’ll try to patch this hole by motivating polynomials for a student at a low level. They’re not as unnatural as they might first appear; there is at least one good reason to care about polynomials!

Writing Down Functions is Hard

In your math education, you’ve probably spent a lot of time staring at weird functions like this:

or this:

Drawing weird squiggly functions is easy. But what if you actually wanted to write one of these down? Like, think about the function in the last picture here. How would you even start to write down a formula for that thing?

The thing I’m trying to draw your attention to here is a certain gap:

It’s way easier to draw a graph than to write down a formula.

If I give you some axes and a pen and ask you to draw a function, you have a lot of flexibility. But if I ask you to write a formula? Where do you start? It would be nice to have a similarly flexible way to write down formulas too.

Start Simple

Imagine I give you a little graphing calculator. You can type in a formula and it will graph it for you.

But the calculator is super simple. It doesn’t have many buttons, just the very basics:

The usual number buttons, which are 0 through 9 and a decimal point.
A “+” and “-“ button (but no multiplication or division!)
I’ll even throw in parentheses (so long as you don’t use them to smuggle in any multiplication).

You can use those buttons however you want to come up with formulas, then press the “Go!” button to plot out the graph. Here. I made a detailed 3D rendering of what the calculator might look like:

Let’s try it out. You type in “1” and hit “Go!”. It makes a plot like

You type in “(6-8)+1.2”. That just simplifies to -0.8, so it makes a plot like

You quickly find that no matter what you do, all of your graphs are boring horizontal lines. This makes sense! The “formulas” you can type in don’t even include the variable $x$. Without more buttons, you can only make constant functions, like $f(x)=1$ or $f(x)=-0.8$. You can make (almost) any constant function you want! But nothing more.

Okay, I’m going to give you a new button now. It’s an $x$ button. It’s not multiplication! Don’t get greedy! It’s the variable $x$. You can now include $x$ in your formulas.

That unlocks some new possibilities! You punch in $1+x$ and hit “Go!”

It slopes! That’s new! You try $(x-1)+x-3.1x+5$.

As you keep messing around, you find that you can only make graphs that are straight lines. The functions you can make with this calculator are the linear functions! That’s something, but it’s not quite the level of flexibility we’re after. It’s still kind of boring.

Notice the pattern: each new capability we give to the calculator unlocks a larger class of functions for you. There is a tension between the simplicity of the calculator and the flexibility of the functions it lets you create.

Okay, I’m going to give you one more button. This will make the calculator a little more complicated, but this button will change everything.

Now you can go wild. You can type in things like

x*x
x*(3-x)
-(2.1-0.1x)*x-2x*x
(1-(2-(3-x)*x)*x)*x
x*x*x*x*x*x*x*3*x

and the calculator is spitting out graphs that are all over the place:

We started with very basic arithmetic operations. We added one button at a time, each one unlocking a little extra power. With that last button (multiplication), we hit a critical mass of possibilities. The functions that you have unlocked now are the polynomials.

The usual definition of “polynomial” is about being a sum of monomials or whatever. We’ve just laid out a very different but equivalent definition: polynomial functions in $x$ are the functions one can produce using addition, subtraction, multiplication, constants, and $x$.

Polynomials Can (Approximately) Do Anything

What do I mean when I say we’ve hit a “critical mass of possibilities” when we reach the polynomials? It’s more than just the fact that we can make wiggly functions. You can, in some sense, make any function you want using a polynomial.

Specifically I mean the following. You give me a drawing of a function. If that function doesn’t do anything crazy (like jumps or holes or weird stuff like that) then I can come up with a polynomial which matches the graph you drew as closely as you want. (This fact is called the Weierstrass approximation theorem.)

In short, one good reason to care about polynomials is that they are the simplest class of function that is still flexible enough to let you match any other function.

You might be thinking about a missing button on our calculator: “What about division?” Well, if you add a “$\div$” button to that calculator, you’ll get even more freedom. The functions you’ll unlock with that extra button are the rational functions! They can do things that polynomials can’t do, like shoot off to infinity.

Probabilities Are Less Real Than You Think

2024-12-15T00:00:00+00:00

“Sir, the possibility of successfully navigating an asteroid field is approximately three thousand seven hundred and twenty to one!” – C-3PO

Probability statements tend to carry some of the gravitas and authority of mathematics. I claim that we tend to take these statements too seriously. Real world events do not have intrinsic probabilities.

Let’s start with an anecdote.

A Story About A Big Number

I once worked as a counselor at the Ross program, which is a summer camp for mathematically inclined high school kids. (If you are such a high school kid, you should apply!) During the afternoons, the counselors would often collect around tables in the library to “grade,” meaning we would all collaborate to distract each other from our grading duties.

One “grading” afternoon, a counselor pulled a random book from the library shelf behind him. He found a checkout receipt inside with a transaction number printed on it. He laid it in the middle of the table.

“Do you guys think this number is prime?”

We all stared at the number for a moment. It was about twelve digits long. It wasn’t obviously not prime.

I don’t know if you’ve ever been in a room with a bunch of mathematicians who don’t want to do their work. If you have, it will come as no surprise that we spent an hour trying to determine whether this number was prime. Finally we gave up and punched it into an online primality tester. It was prime. Our distraction had ended.

“Huh. It’s prime,” one counselor mused. “What were the odds?”

Another pause.

“Hey yeah, what were the odds?”

On this new question, we successfully burnt another hour.

The question we had produced was “What was the probability that this specific number would be prime?” The interesting and odd thing about this question is that, at face value, it doesn’t make sense. The statement “$N$ is prime” is deterministic, not random. There are no probabilities to be seen here. At the same time, intuitively it makes perfect sense. We knew that large prime numbers were relatively rare, and our specific twelve-digit number felt randomly selected. So maybe there’s some way to make the question make sense? Here are some formalizations we kicked around.

What is the probability that $N$ is prime if $N$ is selected…

uniformly at random from all 12-digit numbers?
uniformly at random from all 12-digit numbers that aren’t divisible by 2 or 5 (since otherwise we wouldn’t have asked if it was prime)?
uniformly at random from all 12-digit numbers that aren’t divisible by 2, 5 or 3 (since these were all the divisors we checked for before we “really got into it”)?
from a Poisson distribution whose mean is our 12-digit number?

In other words, we needed to select a random model for our deterministic question. I think any one of the above models would be a reasonable choice, and they would all give different answers. We must conclude that there is no such thing as “the” probability that $N$ is prime. That probability will depend on modeling decisions.¹

The Flipped Quarter

The prime number example shows that sometimes it doesn’t make sense to talk about “the” probability of an event until you’ve made some modeling decisions. Even the simplest real-world probability calculation depends on modeling decisions. For instance, suppose I take out a quarter and flip it into the air. What’s the probability that it comes up heads?

The simplest answer is, of course, 50%. Each of the two faces will come up with equal probability. Easy.

But wait, I seem to remember that a quarter isn’t perfectly balanced, and that one side has a roughly 51% probability of coming up. But I don’t remember which face, so I guess the answer is either 51% or 49% and I just don’t know which.

On the other hand, if I don’t remember which face is heavier, maybe I should factor in my uncertainty by taking the expected probability over both the heads-more-likely scenario and the tails-more-likely scenario. That will bring me back to an answer of 50%, but for more sophisticated reasons this time.

But wait, let’s back way up. I said that I’ve already flipped the coin in the air. A coin tumbling through the air is a classical system. If I had the time and enough information, I could calculate how the coin will fall. It’s no different than if the coin had already landed and I just haven’t looked at the result yet. So I guess it also makes sense to say the probability is either 0 or 1 and I just don’t know which.

As long as we’re getting into physics, if we want to be super pedantic, is this system actually deterministic? Isn’t there some basically-but-not-quite-zero probability $\varepsilon$ that some freak quantum event will reverse the outcome of my coin flip? So maybe we should amend our “0 or 1” answer to an answer of either $\varepsilon$ or $1-\varepsilon$.

You may have strong feelings about which model is the best, but in actuality, it depends! What do I want my probability to represent? The unknown starting conditions of a classical system? My uncertainty about the imbalance of the coin? Quantum randomness? This may sound like a Bayesian vs. frequentist question, but I don’t think it is. I could argue for any of the above models from either a Bayesian or a frequentist perspective.

Even the simplest of applications of probability, the flip of a coin, forces you to choose a model. There is no such thing as “the” probability that a quarter will come up heads. That probability only makes sense relative to a random model.

The Weather And Beyond

We’ve discussed quarter flipping. Now think about more complicated real-world systems about which we routinely make probability statements:

travel times
stock prices
elections
God forbid, the weather

If the probabilities involved with a simple quarter flip are already hiding implicit modeling decisions, how fraught must be our probability statements about something as complicated as the weather? If there’s no such thing as “the” probability of a coin coming up heads, then any discussion about “the” probability of rain tomorrow is just outlandish. In order to rigorously calculate such a probability, one would have to make a million judgement calls about how to formalize the question. If some weather prediction service claims that the probability of rain tomorrow is 11%, they may just mean that they ran their favorite stochastic model a hundred times and it rained in eleven of those simulations. A different stochastic model would produce a different, equally valid number.²

My point is not that no one should use the phrase “the” probability. My point is simply that, when someone discusses probabilities of real-world events, be aware that they have hopped over a conceptual gap. They have taken a messy, real-world system and made implicit decisions about how to model it with the formalism of probability theory. They have performed an act of translation, of interpretation, which requires real decisions. The number is not a fact about the world. It is a fact about a model.

I encourage you to take two lessons away from this post:

1) If you want to calculate probabilities of real-world events, nothing can save you from making modeling decisions.

2) If someone tells you a probability, know that it is only as true as the model from which it was calculated.

In the end, we selected a model where $N$ was deterministic, but its primality was random, with probability determined by the asymptotic density of primes given by the prime number theorem (modified to ignore multiples of 2 and 5 since they are “obviously composite”). ↩
This post, especially the weather example, was inspired by discussions with Dr. Robert C. Williamson, who is fond of using scare quotes when discussing “the” probability of an event. He and his research group study, among other topics, the assumptions underlying applications of probability theory. ↩

Paper Explainer: Geometry and Stability of Supervised Learning Problems

2024-03-16T00:00:00+00:00

Just released a new paper! In it, my coauthors and I try to make sense of some challenges in machine learning by creating a “space of all problems”. If you don’t know what that means, that’s okay! This post explains the big ideas for non-mathematicians.

What is Supervised Learning?

Suppose you’ve got some data on the IQ and SAT scores of a bunch of people, and the data looks like this:

(Note: I made this data up. Don’t believe it.) Using this data, can you use someone’s IQ score to get a rough estimate for their SAT score? Sure, you could fit a trendline to the data using some good ol’ linear regression. It’ll look something like this:

Now if you know someone’s IQ (say, 110), you can predict what their SAT score might be using the trendline (in this case, about 1207).

Congratulations! You just took part in supervised learning! You used an algorithm to…

take data about the relationship between two variables $x$ and $y$, and
use that data to choose a prediction function that maps any known $x$ to a prediction for $y$.

For the most part, that’s all that supervised learning is! Everything from linear regression to neural networks follows this same basic blueprint. Of course, neural networks use more complicated data to select fancier functions, but at heart it’s the same idea. (There are some extra details in supervised learning, like “How do I pick the actual function?” and “How do I know my chosen function is a good fit?” These are details we won’t need to worry about here.)

The “learning” in supervised learning is there because the algorithm is “learning” the relationship between your input and output variables, such as IQ and SAT scores. The “supervised” part is because the examples you give to the algorithm come with the correct answers. Each data point is an example input (the $x$ value) together with the “correct answer” (the $y$ value). The algorithm tries to come up with a relationship that matches the correct answers as closely as possible. If, instead, your algorithm just took a bunch of $x$ values and tried to find patterns without knowing the $y$ values, that would be an example of unsupervised learning, which we won’t get into here.

Supervised Learning Headaches

Let’s say you sit down to work on a supervised learning problem. In a perfect world, you would…

Have access to unlimited amounts of data
Have data with no noise or inaccuracies
Formulate the problem in a mathematically elegant way without having to worry about how hard anything is to compute
Be able to select the function that represents the actual true relationship between the inputs and outputs.

In practice, of course, you get none of these things. You instead must…

Make the best of less data than you want
Use noisy or inaccurate data
Approximate your elegant math with simpler math that is easier for computers to work with
Guess what kind of function will be a good fit and narrow your attention to those

For instance, in the IQ/SAT example above, we had very little data (only 40 points). Our data might be inaccurate because people might misremember, round, or embellish their scores. Or maybe the data is inaccurate because I asked people on, say, a college campus, who will tend to have higher test scores than the general population. Also, when we decided to fit a trendline, we were narrowing our focus to only linear functions. If the true relationship between the input and output doesn’t look like a straight line, this is a serious limitation.

We could actually think of these as two different problems. First there is the ideal, perfect-world problem. This is the problem you want to solve. There is also the corrupted, imperfect, real-world problem, which is the problem you get to solve. The problem you want to solve is certainly different than the problem you get to solve. How different will depend on the magnitude of the noise, bias, etc. We know that the ideal and actual problems are different, but we’d like some kind of guarantee that they aren’t too different.

Now, there’s a lot of research on how much a certain compromise will change a problem. “If I add X amount of Y kind of noise to a problem, it will change the problem in Z way.” Things like that. But those results tend to be pretty specific. They’re about what happens to a problem if you change one thing about the problem. But that’s not how it works in the real world; you gotta accept a whole bunch of compromises at once. Is there a good way to think about making a whole bunch of changes to your problem at once?

Thinking with Geometry

Let’s think about all these changes geometrically. Here’s a picture (adapted from the paper) to help us do so:

We start with the “ideal problem” on the left. That’s the perfect problem we’d like to get our hands on. Each compromise, represented by the red arrows, pushes the problem a little bit in some direction. Adding noise? That changes the problem. Approximating some hard math with easier math? That changes the problem too. Each of these changes push the problem, and those pushes add up, leaving us with the “actual problem” in the upper right.

This picture is a helpful visualization, but kind of vague. If we could make actual mathematical sense of this picture, we could use geometry to make sense of the difference between the ideal problem and the actual problem.

This is exactly what we do in the paper! We invent the space of supervised learning problems, in which each point represents a different problem. In particular, the “ideal problem” and “actual problem” are represented by points in this big space, just like in the picture above. Want to know how different those two problems are? Well, if you’ve got a good idea about how long each of the red arrows are, you’ve got a good idea about how far apart the problems can be! Geometry solves the problem!

Here’s the reason this is really useful: any kind of change to a problem could be represented with a red arrow. And you can chain together any number of the red arrows. So our framework can handle all sorts of simultaneous changes to a problem.

What’s Actually in the Paper?

For anyone interested, we’ll get into the technical weeds just a little bit more to explain how we did any of this. In the paper, we invent the space of supervised learning problems. To actually pull this off, we take inspiration from another field of math called optimal transport. The name sounds pretty dry, but that’s just because it’s a field of math that was founded to deal with problems in economics, and economists are famously dry people. All that you need to know about optimal transport is that it turns out to be really useful for doing geometry on things that don’t obviously have any geometry going on. Anybody can do geometry in 2D space, but only with optimal transport can you do geometry in, for instance, the space of spaces themselves! Very meta! Very cool!

Anyway, the good people in optimal transport theory have a tried-and-true method for putting geometry where it doesn’t belong. All we have to do is apply this method to supervised learning problems and presto! We’ve unlocked the power of geometry!

Most of the actual paper is dedicated to developing this geometry on the space of all supervised learning problems and showing that the geometry isn’t weird. Specifically, we prove that…

problems you’d expect to be close together actually are in our geometry, and
problems that are close together under our geometry really do deserve to be called “similar”.

If you’re interested in more details, check out the paper! The whole thing is pretty long, but the introduction is a more technical overview of the paper and it’s short.

Thanks for reading!

What Does an OSI Layer Even Mean?

2022-08-31T00:00:00+00:00

I had a tough time wrapping my head around the OSI network model. In my experience, the layers are always presented just as layers, without any explanation of what a layer actually represents or what it means for one layer to sit on top of another, or even what is being “modeled” by this model. Instead, people just say “Layer 4 includes things like TCP and UDP, it worries about letting processes talk, it introduces ports, yada yada yada.”

Eventually I figured it out, and in this post I want to lay out what I think is a useful perspective: a layer is a level of abstraction that solves a specific problem.

(Note: I’ll be talking about the five-layer OSI model, as opposed to the more detailed seven-layer model. While the specific layers are different, the ideas are the same.)

What is a Layer?

Consider two computers communicating through a network. Maybe they’re sending text messages, or video chatting, or requesting and producing webpages. There is a deep stack of technology on which this whole ongoing communication rests, so we have multiple scopes from which we can think about this communication process. For instance, if we are the users of the machines in question, we are probably thinking of this process from the most zoomed-out perspective we can: as data being sent from one application to another (i.e. Firefox is making requests to Apache, my Skype is sending live video to your Skype, etc.) This perspective is practical for an end-user. If, on the other hand, someone asked you if that process is magic, you would probably drop to a lower level of abstraction and say “Of course not, my laptop is just sending out a stream of pulses over the air to my wifi router.” These are two very different abstraction levels. There are also many additional levels in between these two extremes.

People who have to get their hands dirty with networking stuff have to be comfortable switching between these layers of abstraction. How do we keep them all straight in our head? This is where the OSI model comes in. Each layer in the OSI model represents a layer of abstraction. This fact gets muddled when people say that layers consist of protocols or data units.

In the example above, thinking about communication as “my program talking to your program” is thinking at the highest level of abstraction, which the OSI model calls the Application Layer. Thinking of the communication as pulses flying over the air and across wires is thinking at the lowest level of abstraction, called the Physical Layer. These layers are well-named. If we’re thinking on the Application level of abstraction, there’s no rubber-meets-the-road computery stuff to worry about, just my Skype application and yours, and the video they’re sending back and forth. Thinking this abstractly, we aren’t thinking about anything too nitty-gritty, least of all the physical bits flying through the air and across the internet. In this sense, the application layer covers the layers below it, which is why it is useful terminology to call it a “layer”. On the other hand, if we think about this process on the Physical layer of abstraction, the bits are all that matter. My laptop is sending bits across the air to my wifi router, which is sending bits to somewhere else via ethernet, etc.

For something as complicated as computer networks, multiple layers of abstraction are necessary. This is what makes the OSI model useful.

Why these specific layers? And where do the protocols come in?

The layers in the OSI model aren’t just arbitrarily chosen abstraction levels. There are five fundamental problems of networking, each happening at a different level of abstraction, and each layer of the OSI model is chosen to match up with one of those problems.

Let’s pretend we’re inventing computer networks from scratch and see what problems we run into. Let’s say my end goal is to send your Skype a video from my Skype.

I want to figure out how our Skype applications can talk.
Before I can do that, I need to figure out how to let two processes talk to each other.
Before I can do that, I need to figure out how to let two distant machines talk to each other across a network.
Before I can do that, I need to figure out how to let two connected machines talk to each other.
Before I can do that, I need to figure out what it even means for two physical objects to talk to each other.

As we go down the list, the entities that want to talk get less and less abstract. We of course must solve these problems from the bottom up. As we solve each one, we will make a library that handles our solution and lets us forget about all the details, effectively abstracting that problem away.

The solutions to each of these problems will follow this general format: just use the library I just came up with for the last problem, but with a bit of extra overhead information attached to solve my new problem. An agreement about the format for that extra information, along with any promises about how that information will be handled, is called a protocol.

For instance, let’s say I’ve solved problem 1 and I’m ready to move on to problem 2. I have a software library that will send bits over a wire. Great! It’s tempting to just say that I’ll communicate machine-to-machine by sending the necessary bits over a wire, but we’ll run into some practical problems. I can send you bits, but how do I tell you where one transmission starts and the next one stops? And what if a transmission is really big? How do I send that whole thing without overflowing your buffer? Also, how do I ask if you’re the machine that I think you are? The solution is to agree on a protocol; we make a contract about how many bits I can send you at once, how I tell you where transmissions start and stop, how we’re going to deal with names, etc, and we wrap each chunk of the transmission in a header and footer with all that extra information in an agreed-upon, precisely defined format. Presto! We have invented a layer 2 protocol and solved the fundamental problem that sits at this layer. We get to make a library implementing this brand new protocol, letting us forget about the details of our solution, and move on to the problem for layer 3. Rinse and repeat.

Each layer ends up getting its own protocol. In fact, most of the problems listed above will have multiple perfectly valid solutions which are useful in different circumstances, so most layers will get multiple alternative protocols to choose from, representing those alternative solutions, each with their own pros and cons. For instance, suppose I’m inventing a protocol to solve the 4th layer problem. If it’s really important in my situation that no data is lost in transit, my solution will be different than if speed is a higher priority. In the first case I might invent TCP, while the second might lead to UDP. One problem, two solutions, means two protocols at the same layer.

Once these problems are solved, I can send you Skype data. To do this in practice, we end up unraveling our solutions at each layer in reverse order. It might look something like this:

Skype packages up a chunk of my live video into Skype's proprietary format. It wants to send it to your Skype process. How will it get there? Not this layer's problem, so Skype calls a library and feeds it the chunk of video data. Skype's job is done.
The data is given a UDP header with the port number for your process so that when it gets to your computer, it gets to the Skype listening process and not your Plex server or whatever other network-enabled processes you have running on your machine. How will the data get to your machine? Not this layer's problem, so it calls a library. This layer's job is done.
THAT data, header and all, is wrapped in an IP header. This header has information about how to transport this data across the internet to your computer, but not data about where to go next. That's not this layer's problem, so it calls a library. This layer's job is done.
THAT data, double header and all, is wrapped up with ANOTHER header (and footer). This one is an Ethernet header (footer) with information about which physical machine this packet needs to get to next. Where physically is that physical machine? Which physical wire does my machine need to send this across? Not this layer's problem, so it calls a library. This layer's job is done.
THAT data, triple header and all, is translated from bits into little pulses and sent out of ethernet jack number 2. Our machine's job is now done; the data is out on the network now. Other machines will unwrap and rewrap the necessary headers to get it where it needs to go. There are no more problems to solve.

These five steps represent five perspectives on the data transmission process of decreasing abstraction. Those five layers of abstraction are called, in order,

The Application Layer
The Session Layer
The Network Layer
The Data Link Layer
The Physical Layer

And there you have it. The OSI model is born.

(The above Skype example is a simplification, since our PCs are not communicating directly but instead both communicating with a server.)

Related: For much more detail than you want on how computers communicate, check out What happens when….

The Logarithmic Black Sheep

2021-02-14T00:00:00+00:00

Behold, the integral power rule:

For any real $p$,

\[\int x^p\, dx = \frac{1}{p+1} x^{p+1} +C\]

Dependable. Ubiquitous. Cursed with an asterisk:

… unless $p=-1$, in which case

\[\int x^p\, dx = \ln(x)+C.\]

I have gotten used to this fact through years of working with integrals. All the same, it bothers me. Every time I am forced to check if my exponent falls into this one exceptional case, a voice somewhere in my head protests. “How does this make sense?? How do you get a continuous family of monomials except for the one case where, of all things, you get a logarithm!?”

This is not to say I don’t understand why $\int dx/x$ spits out a logarithm. I can see that the usual formula would force you to divide by 0 at $p=-1$, and I could even prove the antiderivative in a few lines. It’s just that it feels like an unexplained discontinuity at a point. I never learned how this logarithmic black sheep fits in with the big happy family of monomials.

Fitting the Logarithm into the Family

I determined today that I was going to figure this out. Can I make sense of this exceptional case? So I did what everyone should do when they are confused by real-valued functions: I opened Desmos. My first question was this: does the formula from the power rule converge in some way to $\ln$ as $p\to -1$? So I made a few graphs. I plotted the formula from the power rule

\[f_p(x) = \frac{1}{p+1}x^{p+1}\]

adding a slider for $p$, and plotted $\ln(x)$ as well for comparison. I pulled that slider for $p$ slowly to the left towards $-1$ and…

(Interactive version on Desmos)

I can hear a slide whistle when I watch that gif. It definitely diverges at every positive $x$. I mean, why wouldn’t it? The $x^{p+1}$ term is approaching $x^0=1$ while the denominator is going to 0, blowing the fraction up to infinity. This function $f_p$ definitely does not converge to $\ln$ as $p\to -1$.

But something about the picture was still tickling at my brain. While the values were diverging, the shape of the graph DID seem to approximate the shape of the logarithm more and more as $p$ got closer to $-1$. The vertical placement of that distinctive shape just got less and less correct as $p$ approached its target. This is exactly the kind of problem that can be solved with one last trick up our sleeve: the arbitrary constant. While $f_p$ doesn’t converge the way we want, maybe $f_p + C$ does converge for some choice of $C$. What we’ve done so far amounts to setting $C=0$. It might seem like adding a constant won’t change the convergence, but remember that it just needs to be a constant with respect to $x$. We can make $C$ depend on $p$. To signify this, we’ll refer to it as $C_p$.

So what value of $C_p$ do we choose to make $f_p(x) + C_p$ converge at all, hopefully to $\ln(x)$? We could just blindly try a few functions of $p$ to counteract the divergence, or we could be smart about it; what if we choose $C_p$ specifically so that $f_p +C_p$ always agrees with $\ln$ at some point? Anchor it down, so to speak. For instance, let’s choose $C_p$ in a way that makes sure $f_p(1)+C_p = 0=\ln(1)$ for every $p$. If you solve for $C_p$ in this equation, you get $C_p = - 1/(p+1)$. Returning to our friend Desmos, we can plot

\[f_p(x) + C_p = \frac{1}{p+1}x^{p+1} - \frac{1}{p+1}\]

and compare it to $\ln(x)$. Lo and behold…

(Interactive version on Desmos)

as we slide $p$ towards $-1$, we get a curve that lies tangent to $\ln$ at $x=1$, hugging it tighter and tighter as $p$ approaches its target. In other words, it appears that

\[\lim_{p\to -1} f_p(x) + C_p = \ln(x)\]

for all positive $x$. Indeed, this can be proven with a quick application of L’Hôpital’s rule. The apparent discontinuity of our exceptional case at $p=-1$ was not exceptional at all; it appears as a natural choice if you only choose a nice value for the arbitrary constant! Our black sheep $\ln$ really does fit into this family of antiderivatives quite nicely!

Deducing the Exception Using the Rule

We can go even further. Let’s suppose we didn’t know the antiderivative of $1/x$. Maybe we’ve figured out the general integral power rule, but aren’t sure what to do when $p=-1$. We could actually use the above argument to prove that the answer is $\ln(x)$. (The reader that loses their way in this argument is welcome to skip to the “Ta-da!” a few paragraphs down.)

Let’s try it out. The general integral power rule tells us that for any $x>0$,

\[\int_1^x t^p\, dt = \left.\frac{t^{p+1}}{p+1}\right\rvert_1^x = \frac{x^{p+1}}{p+1} - \frac{1}{p+1}.\]

Aha! Our chosen constant $C_p$ from earlier appears by itself clear as day! We already argued that the limit of this expression as $p\to -1$ is $\ln(x)$. So

\[\lim_{p\to -1}\int_1^x t^p\, dt = \ln(x).\]

(The concerned reader may wonder if $\ln$ appears in the evaluation of that limit because we smuggled in our knowledge that the antiderivative of $1/x$ is $\ln(x)$, the very fact that we are trying to prove. This is not the case; $\ln(x)$ appears when applying L’Hôpital’s rule in the derivative of $x^p$ with respect to $p$. The proof of this fact requires only that $\ln(x)$ is the inverse function to $e^x$ and no knowledge of the result we are proving.)

Now since $t^p$ converges uniformly to $t^{-1}$ between $x$ and 1 as $p\to -1$, we can interchange the integral and the limit to get

\[\int_1^x t^{-1}\, dt = \int_1^x \lim_{p\to -1} t^p\, dt = \ln(x).\]

Ta-da!

Of course, there is a much simpler proof that the antiderivative of $1/x$ is $\ln(x)$:

\[\begin{aligned} 1 = \frac{d}{dx} x & = \frac{d}{dx}e^{\ln(x)} \\\\ & = \frac{d}{dx}\ln(x) \cdot e^{\ln(x)} = \frac{d}{dx}\ln(x) \cdot x \end{aligned}\]

and dividing by $x$ gives that the derivative of $\ln(x)$ is $1/x$. This proof has the advantage of brevity and simplicity, since it uses only techniques that are familiar to Calculus students (the chain rule). Our proof above, however, provides something else.

Conclusion

Our new proof, while a bit longer, shows us something the shorter proof does not. It demonstrates that, not only does $\ln(x)$ fit into the family of monomial antiderivatives after all, it actually fits so nicely that we can deduce its existence just by considering the monomials around it. Not only is $\ln(x)$ not such a black sheep in the power integral rule, it is in fact the only function that can hold the whole collection of antiderivatives together as one big continuous family.

The Two Guards Riddle: A Couple of Insights

2020-08-05T00:00:00+00:00

Remember that riddle with the two doors and the two guards, where one always lies and one always tells the truth? I was reminded of it when it came up in a podcast recently. After some thought, I had a couple of insights.

Just to recap, the riddle goes like this.

You are trapped in a room with two exits. One leads out to freedom, while the other leads to certain death. You can’t tell the difference without taking the risk and stepping through a door. There are two guards in the room with you. These guards will happily let you step through a door. They have a peculiar quirk: one guard always lies, while the other always tells the truth, but again you don’t know which is which. They know which door leads out and which leads to death. You can ask ONE of the guards ONE yes-or-no question. Your goal, of course, is to use that one question to determine which door you should pass through and avoid certain death.

An example question that would not work: “Does the left door lead to death?” The guard will answer you, but you can’t tell if they are telling the truth or not. Your question has been wasted. Another bad question would be asking the guard something with an obvious answer, the goal being to discover if he is the liar (e.g. “Are you a guard?”). This will reveal who is the liar and who is not, but now you’ve spent your question and still know nothing about the doors. Hopefully you have a handle on the flavor of this riddle.

The standard solution is to ask “Would the other guard tell me the left door is dangerous?” If he says yes, the left door leads to safety. Otherwise, the right door does. If you haven’t heard this before, take a second to think through why this is true. It’s pretty clever.

Now for the insights. I hope you find them as interesting as I did.

1: Spot the Liar or Leave. You Can’t Do Both.

I think most people come at this for the first time by trying to ask a tricky question that would both ask the guard which door is safe and simultaneously determine if that guard is the liar. An interesting feature of the standard solution is that when you’re done, you still don’t know which guard was the liar. This is actually a necessary consequence of asking the right question.

We can see this from an information standpoint. There are four possibilities that we are trying to sort out as the prisoner. Either guard could be the liar and either door could be the right one, which makes four possible combinations. To find out which of those four situations you are in, you’re going to need two bits of information. No way around that. However, if you ask a yes or no question, you’re going to get a single bit in response. You will never be able to both get the correct door and determine which guard is the liar. You’ve gotta pick one or the other, and the point of the riddle is that you get the former.

2: Another Solution that Sounds Less Tricky

The standard solution is a bit complicated and would fail what I’ll call the XKCD rule.

A much simpler solution is this: “If I asked you if your door was the correct one, what would you say?” It’s subtle because in everyday conversation, we would probably parse this as essentially asking “Which door is the correct one?” But it is not the same. The truthful guard will just tell you the truth. Nothing complicated there. But the lying guard has to think about it. He knows that if you asked him directly, he would lie. So to answer your question, he has to lie about that too and ends up telling you the truth about the door. He basically ends up telling two lies that cancel each other out. In either case, the guard will point you to the correct door and you are free to leave. Note that, in line with the earlier insight, you still don’t know which guard was the liar once you’re done.

A neat feature of this solution is that, while the standard solution requires two guards, this solution only needs the one. It would apply just as well to a similar riddle where there is only one guard who you are told either always lies or always tells the truth.

“Why Do I Need Complex Numbers?”

2020-04-27T00:00:00+00:00

I have heard students ask this question over and over in one form or another. A recent incarnation was featured in this video by 3Blue1Brown. I think this version strikes at the heart of the vendetta that many students have against complex numbers.

“What is an application that is impossible to achieve without complex numbers? Convince me that they are needed. They are fun to work with, and maybe they make things easier, but we can do without them.”

(Edited for grammar and spelling.)

At the heart of this question is a reasonable request. When introduced to a new mathematical concept, we should demand reasons, even if the answer is necessarily a bit delayed. High school students in the US are often introduced to complex numbers with little motivation. To them, this new number system can seem like a lot of complexity with little payoff. Maybe a few applications are presented, but one might argue they could be achieved without complex numbers, even if it requires sacrificing some elegance. Why put in the extra work?

I would like to formulate a satisfying response to this question. While one could try responding to this challenge with the fundamental theorem of algebra (which I would challenge anyone to state with clarity while avoiding complex numbers), I think this misses the heart of the question. To the general public, this is just another abstract theorem. What people want is a reason to care.

I think a better response would be to issue a challenge of my own: convince me that I need fractions. Why can’t the world get through life using only decimals? My banana bread recipe can call for 0.3̅ cups of milk. We can call the coin a 25 cent piece instead of a quarter. Bon Jovi can sing about being 50% of the way there. Who needs the extra complexity of introducing a second system for dealing with pieces of whole numbers?

The answers to the two challenges are the same: maybe you could do without, but why would you do that to yourself? For many applications, fractions are just so much simpler. If I’m making 50% of a recipe that calls for 0.333 cups of water, the math involved is much more cumbersome than if I am halving a recipe that calls for a third of a cup. It is much easier to remember “three sixteenths of an inch” than to retain the decimal 0.1875. And even if an exact answer is not needed, mentally dividing fractions is certainly faster and less error-prone than doing the same with decimals. There is a reason that the world uses fractions in everyday life. We use them not because of any task that is impossible without them, but because they make things easier.

If you agree that fractions are useful (and I hope you do), then you must agree that a mathematical object does not need to conquer the impossible in order to be worth learning. We should judge complex numbers by the same standard as fractions: do they make things significantly easier? This is the right question. And in the right setting, the answer is yes. They are natural tools for dealing with rotations and oscillations, or computations in plane geometry, to name a couple of broad examples. Complex numbers can simplify computations and enhance our understanding.

To answer the original question more directly, I think the answer is no. Any direct applications of complex numbers could be achieved without them if you were willing to work hard enough to avoid them. But the same could be said of fractions, or decimals, or real numbers, or any mathematical structure. The reason we use any of these constructions is because they are the easy way out. They make things simpler to work with and understand, which is exactly what math is supposed to do.

A commenter on the above video with the username “Garbaz” addresses the question from an electrical engineering standpoint.

“[Once] my electrical engineering professors said that if mathematicians hadn’t come up with complex numbers, electrical engineers would have. Dealing with electrical circuits that involve capacitors, inductors (and alternating currents) without complex numbers is very difficult, having to deal with differential equations and trig identities, but if you interpret inductors & capacitors like resistors, but with an imaginary resistance, you get an incredibly beautiful and simple way to work with them. In general, there is pretty much no area of electrical engineering that does not benefit greatly from using complex numbers. Especially everything involving AC.”

Fight Functors with Fire

2018-12-17T00:00:00+00:00

Once during undergrad, my topology professor was trying to prepare the students for an exam. He was an interesting guy. His look and disposition were unusually relaxed, and his clothes looked more like those of a teenager on summer vacation than those of a professor. Typical attire included shorts, flip flops, and a t-shirt referencing a cartoon from the 90’s. Less than 30 years old, he was by far my youngest instructor. He often told stories of his own mathematical struggles in grad school, which always seemed to end with his adviser yelling at him. No one could fault his enthusiasm; his excitement about topology was too pure for this world. I always enjoyed his lectures.

Needless to say, the students loved him.

To prepare for the upcoming exam, he’d collected some practice problems. Before diving in, he devoted a few precious minutes of lecture time to reading us a quote that he thought perfectly crystallized the way we should learn mathematics.

“Don’t just read it; fight it! Ask your own question, look for your own examples, discover your own proofs. Is the hypothesis necessary? Is the converse true? What happens in the classical special case? What about the degenerate cases? Where does the proof use the hypothesis?” - Paul Halmos

Welcome to Brantley Fights Functors, where I will chronicle some episodes in my fight against anything I try to learn on my own time. This could include neural networks, Galois groups, Pandas for Python, and forgetful functors. I believe in Halmos’ strategy; the only way to truly learn these concepts is to step into the ring and fight them for yourself, tooth and nail, until they surrender themselves to your intuition.

Wish me luck.