When the queue became a bus

The first time it really hit was when our regional API, which we were writing in Go, needed to know whether a job in BullMQ had failed. There is no Go client for BullMQ. So what the Go service was doing was opening a Redis connection and reading BullMQ's internal keys directly, things like bull:executions:wait and bull:executions:failed, and parsing the hash fields out by hand. Polling, every second, for the job's current state.

It worked. The code was hideous.

I had been uncomfortable with the BullMQ shape for a while before that. We had picked it for the obvious reason, which was that we were a Node.js shop and BullMQ is the right answer when you need a job queue in Node.js, with retries & concurrency and a dashboard you can stare at. Recordings went into a queue, the supervisor consumed them, results came back into another queue, the API picked them up. The shape fit.

What it had become by late 2022 was an event bus. Step events from the supervisor, execution progress, results split across two messages because they were too big for one, status updates to the dashboard, failures going back to the API for retries. Most of them were events, passing through a job queue because the job queue was what we had.

Then the regional API in Go, and the screen-scraping-Redis hack. That was the moment I gave up on the shape.

We picked Kafka semantics. Real consumer groups, partition-keyed ordering, offsets that consumers held themselves, replay if we needed it, and clients in every language we cared about. Both sides got a proper Kafka client library, kafkajs on the Node side & kafka-go on the Go side, both speaking to the same brokers.

We picked RedPanda over Kafka because we did not want to run a JVM broker. A 3-node Kafka quorum with the JVM and ZooKeeper was a lot of memory baseline for what we were actually pushing through it. RedPanda is wire-compatible with Kafka, single binary, no JVM, no ZooKeeper. The clients talked to it without knowing the difference.

There were a lot of small things that BullMQ had given us for free that we now had to design. We came up with topic prefixes per environment, consumer group IDs that didn't collide between pods of the same service, and partitioning keys that kept events for the same execution in order on the same partition. We also put a small internal Go API in front of kafka-go, because the raw client is low-level and we wanted that complexity in one place. None of it was hard, but all of it was work that didn't ship a feature.

The caveat is that BullMQ had cost us nothing to operate. We were already running Redis for other things, and BullMQ just sat on top of it. RedPanda was new infrastructure. Three nodes with persistent storage per environment for quorum, one cluster per env, all of it ours to keep alive. For a small startup, "we are now operating a Kafka-compatible broker" is not a small thing to take on.

I had self-doubt about it during the whole migration. The honest worry was that we were larping as a distributed-systems shop, building plumbing for a problem we did not yet have. I should be honest about the other thing too, which is that part of why I went ahead was that I had wanted to try RedPanda.

The bus held up. The doubts mostly went away on their own once it had been running for a few months. The Go service stopped reading BullMQ's Redis schema, and the events found their right shape. I'd still make the same call today, slightly less guilty about the part that was curiosity.

❦

Thanks for reading. Questions, disagreements, or corrections, .