Email is the hardest easy problem, and I built a business in it

A follow-up to turning EmailEngine from open source into a business: the revenue grew, but the real work turned out to be fighting the endless swamp of email provider quirks.

Email is the hardest easy problem, and I built a business in it

Two and a half years ago, I wrote a post here called “How I turned my open source project into a business.” It did better than anything else I had published, and since then, people have kept asking what happened next.

That first post was about the change itself: dropping the permissive license, going commercial, and learning to sell to companies instead of hoping for donations. By the time I wrote it, that part had already happened. I had already made the jump.

What I did not yet know was what came after. It turns out that changing the license and charging money was not the hard part. The hard part was running the thing afterwards.

This is that post.

The numbers, since you’ll ask

Let me get the interesting part out of the way first.

When I wrote the first post, EmailEngine, was doing about €6,100 a month in recurring revenue, or around $6,600 at the exchange rate back then. Today, Stripe tells me that EmailEngine is doing $15,991 a month, an annual run rate just shy of $200,000, across 204 paying companies.

So, in two and a half years, revenue has grown about two and a half times. A nice symmetry, even if it did not feel that neat from the inside.

Customers are now on every continent except Antarctica. I keep checking. Still nothing from Antarctica.

There is still exactly one employee. Me. No funding, no co-founder, no team. I do have a desk in a co-working office now, so technically I can no longer claim “no office.” In the two and a half years since that post, I have pushed around 1,100 commits and cut 123 releases. That is roughly a release a week, every week, for two and a half years.

This is not because I am especially disciplined. It is because email is a swamp, and the swamp keeps moving.

That is the actual subject of this post.

I thought the hard part was the business model

When I wrote the first post, I genuinely thought the hard part had been the business side: the license change, the pricing, the decision to sell to companies instead of hoping that random users would sponsor me.

That part felt like the big puzzle.

In hindsight, it was not.

The pricing was a weekend of agonising followed by two years of barely touching it. The license change was scary, but once it was done, it was done. The hard part, the part that never stops, is that EmailEngine sits between customers and other people’s mail systems.

And other people’s mail systems are a mess.

On paper, EmailEngine is simple. It connects to email accounts using IMAP, Gmail API, or Microsoft Graph API, keeps these accounts in sync, and exposes a clean REST API and webhooks. This means that customers do not have to deal with IMAP commands, Gmail API quirks, Graph subscriptions, OAuth flows, and all the other details themselves.

Someone might say that EmailEngine is “just an IMAP client with an API.”

Sure. And a submarine is just a boat with a lid.

Here is what the work actually looks like.

A connection that drops every 3.5 minutes, on purpose, silently

A customer in China reported that their 163.com accounts kept dropping. Sync would start, run fine for a few minutes, and then die. Every time. Like clockwork, at about the three-and-a-half-minute mark.

163.com runs Coremail and does not support IMAP IDLE, the command that lets the server push changes to the client. That is not unusual. Many IMAP servers do not support IDLE properly, or at all, so the fallback is to send a periodic NOOP command to keep the connection alive.

The NOOPs were going out. The server was responding to them. The logs looked fine.

The connection still died.

It turned out that 163.com keeps its own internal idle timer, and NOOP does not reset it. The server happily responds to the keepalive command and then disconnects the client anyway because, from its point of view, the connection was still idle.

The fix was to stop using NOOP as the keepalive command for that server and use STATUS instead, because STATUS counts as real activity.

Seven lines of code. Several hours of staring at packet logs that claimed nothing was wrong.

Multiply this by every weird mail server on earth. Yahoo cuts IDLE connections aggressively. Rambler breaks them constantly. Each server has some specific undocumented behaviour, and the only way to learn about it is when someone’s accounts stop syncing, and I have to figure out why.

The bug where nothing happened, so we missed everything

Here is another one. This is my favourite kind of bug: the kind where the code is doing exactly what it was told to do, and that is the problem.

To notice new mail efficiently, you do not want to re-scan the entire mailbox every time. You keep some state: how many messages were in the mailbox last time, what the next UID should be, and so on. When the server says that something changed, you compare the old state with the new state.

If the message count went up, you fetch the new messages.

Simple.

Except, what happens if one message arrives and one message is deleted in the same window?

The count stays the same.

The code sees “100 messages before, 100 messages now” and decides that nothing happened. But something did happen. A new email arrived, the customer’s automation should have fired, and there is no error anywhere because, as far as the code is concerned, the check passed.

The fix was to stop trusting the message count alone and compare multiple signals: message count, next UID, and modification sequence. If one signal stays the same because two operations cancel each other out, another one should still move.

Obvious in hindsight. Most of this work is obvious in hindsight.

The flood that fed itself

This one nearly took down whole accounts, and it is a good example of why reliability work is mostly about what happens when something else breaks.

EmailEngine keeps its sync state in Redis. If that state disappears for a folder, for example, because Redis was configured with an eviction policy and started dropping keys to save memory, the old code would reconnect, see a folder it had “never” seen before, and start indexing it from the first message.

That also meant firing a “new message” webhook for every existing email in that folder.

Thousands of webhooks. For messages that had already been synced long ago.

Now it gets worse. The flood of webhook jobs grows Redis. Redis, under the bad eviction policy, drops more keys to make room. More dropped keys mean more folders look new. More folders looking new means more webhook floods. More floods mean more memory pressure.

In the right unlucky conditions, the system became a machine for eating itself.

The fix had two parts.

First, recovery is now non-destructive. If EmailEngine notices that a folder’s sync state is missing, it rebuilds the index from the server’s current state and marks that as the new baseline. It does not emit thousands of fake “new message” events for old mail. Instead, it emits a single event saying that the folder state was reset.

Second, the dashboard now warns when Redis is configured to drop data. The real bug was not just the recovery behaviour. It was the bad Redis configuration that was invisible until it caused damage.

Sometimes the most useful feature is the one that points at the actual problem.

Gmail and Microsoft do not sit still either

Independent IMAP servers are the folklore part of email. The big providers are the treadmill.

Gmail API does not have IMAP IDLE. Instead, you subscribe to push notifications. These notifications are supposed to tell you when something changes, but sometimes they just do not arrive.

So EmailEngine has a fallback. If Gmail has been suspiciously silent for too long, EmailEngine checks the account anyway. It is not a busy polling loop hammering the API. It is a canary for when the push channel has stopped being trustworthy.

Microsoft has its own version of the same problem. Graph subscriptions for mail expire after 4,230 minutes, which is 70.5 hours. I have no idea why that number is so specific, but there it is.

So EmailEngine has to run a renewal system. It checks subscriptions, renews the ones close to expiry, retries failures with backoff, and uses locks so that two workers do not try to renew the same subscription at the same time.

Microsoft also sends a notification when it knows that some notifications were missed. At that point, EmailEngine has to go back and figure out what was lost.

All of this exists so that an email landing in an Outlook inbox becomes a webhook on the customer’s server. None of it is visible when it works. All of it is the product.

And the ground keeps shifting. Google and Microsoft have spent the last few years killing off plain passwords and basic auth, pushing everyone into OAuth, then service accounts, and now keyless federated identity for customers who do not want long-lived Google service account keys sitting in Kubernetes secrets.

Every one of these migrations eventually lands on my desk. I shipped the keyless version a few weeks ago. I am sure the next migration is already being planned somewhere.

What the subscription actually buys

The EmailEngine source code is still readable on GitHub. Anyone can clone it, run it, and read every line. So it is fair to ask what 204 companies are paying for when the code is right there.

The answer is that they are not really paying for the code.

There is no secret algorithm. No special sauce. If anything, I just described several of the more interesting fixes in this post.

What customers are paying for is that someone keeps doing this work.

Nobody wakes up on a Saturday excited to fix 163.com’s idle timer. Nobody volunteers to babysit Microsoft’s 70.5-hour subscription expiry cycle. Nobody wants to spend their evening debugging why a Redis eviction policy caused a webhook flood of old messages.

I know, because during my open source years, I hoped people would help with that kind of work.

They did not.

The contributions that arrive for a free project are usually for the easy, fun, visible parts. The boring parts, the strange provider-specific bugs, the endless compatibility work, the “Google changed something again” work, all fall to the person whose name is on the project.

So that is the actual product.

Not just the code, but the promise that when something changes in Gmail, Outlook, Yahoo, Coremail, Redis, OAuth, or some random IMAP server, somebody whose livelihood depends on EmailEngine will deal with it.

That promise is hard to fund with donations and good vibes. Which, now that I think about it, is basically the moral of the first post again, just viewed from two and a half years later.

The boring stuff I am quietly proud of

If there is an engineering lesson in all of this, it is that most of the important decisions were about doing less, not more.

I do not use circuit breakers in the usual way. A circuit breaker is a pattern where a failing component gets temporarily disabled so that it does not keep causing damage. In many systems, this is a good idea.

In EmailEngine, it can be poison.

EmailEngine runs thousands of independent accounts through shared workers. If one customer’s broken mailbox trips a shared circuit breaker, then other customers can be affected by a problem that has nothing to do with them.

So the design is different. Accounts should fail alone, in their own corner, while other accounts keep working.

I also do not catch and ignore unexpected errors just to keep the process alive. If something truly unexpected bubbles all the way up, the worker should die loudly and restart cleanly. A process limping along in a state I did not anticipate is scarier than a process that crashes and starts again.

The general rule is not heroic. Do the dull, safe, legible thing. Make failures visible. Do not hide cracks with cleverness.

That is most of what reliability turned out to be. Not cleverness. Stubbornness.

Two and a half years in

So that is the sequel.

The business is bigger, the customer list is longer, and the work turned out to be mostly the work I did not see coming: a slow, never-ending argument with other people’s mail servers, conducted alone, for a couple of hundred companies that hopefully never have to think about any of it.

I still work at a one-person company, as the one person.

The swamp is still there.

So am I.

Ask me again in another two and a half years.