NATS: You Need it Now!

2022-09-17

If you are running Kubernetes, or really any kind of microservice architecture, you will eventually run into challenges with communication and synchronization between your instances. To solve this, I recommend deploying an instance of NATS as part of your initial infrastructure setup. NATS is great because:

It's tiny, light-weight, and easy to run
A single instance will likely be sufficient for the needs of your whole cluster
It will be there, ready and waiting, when you need it
It solves the problem of one-to-many communication
It can be used to build extensible event-driven systems

What is NATS?

NATS is a light-weight, easy to deploy service that provides pub-sub functionality with very little fuss. It is a tiny application, written in Go, that listens on a port for connections from clients.

The NATS executable is a few MB in size and runs out of the box with sensible defaults. It has no dependencies or required configuration parameters. As a Kubernetes service, it can be deployed very easily with this yaml. With that simple deployment, your microservices can use NATS by connecting to nats://nats:4222.

Clients can send and receive messages to each other by publishing and subscribing to subjects. For example, two clients could be subscribed to subject x. If any client publishes a message to subject x, all subscribed clients will receive that message.

NATS Use-Cases

NATS can replace and streamline many service-to-service communication scenarios. The following sections describe a few of them:

Broadcast to All Instances of a Distributed Service

This was my first use for NATS. I had a deployment with multiple instances running in the cluster, and when a configuration change is made, I need all instances to reload their configurations from a database.

To solve this problem, every instance of my service subscribes to myapp.refresh. When the configuration changes, I publish a message to that subject, and all instances will take action by reloading their configurations.

Ping-Pong

Want to get some information or a status report from all instances of your service? To fetch information about all running instances:

All instances listen to myapp.ping
Start listening to some unique temporary subject called myapp.pong.[UNIQUE_GUID] for example
The single instance will publish a message such as replyto=myapp.pong.[UNIQUE_GUID] to the myapp.ping subject
Every instance listening to myapp.ping will then respond to the myapp.pong.[UNIQUE_GUID] subject with the relevant information

You can listen to the myapp.pong.[UNIQUE_GUID] subject for a certain amount of time and then unsubscribe from it. It should only take a few milliseconds to receive messages from all listening instances.

Event-Driven Systems

The beauty of NATS is that multiple clients can subscribe to the same subject without any fancy configuration or setup. This can be very handy when building a future-proof system that can easily be extended. Take the following scenario for example:

Imagine you are running a microservice-based e-commerce system. One microservice handles payments and another one handles the front-end UI that customers see. The front-end might send a message requesting that a payment be processed (using NATS or a REST API), and then it might listen on a predetermined subject (payments.updates.[TXN_ID] for example) for a notification that the payment has completed.

Imagine now that you want to add a quota system that automatically updates inventory numbers whenever a purchase is made. You might be tempted to add that logic to either your front-end or payment microservice. However, this functionality doesn't logically fit into either of these services. With NATS, you could create a new microservice that subscribes to payments.updates.* to receive notifications of all payment updates. It could then perform the desired action, and we did all this without modifying any of the existing systems.

Performance Concerns

A simple instance of NATS should be fine for most workloads. Some possible concerns might be:

Speed

Although using NATS involves an extra network hop compared to direct communication, you have to remember that this is all done over an already open TCP connection (no handshake overhead), and that this will all likely be communication between machines that are quite close to each other physically. The round-trip times I typically observe in my Digital Ocean cluster is less than 70ms (which means a one-way of about 30ms).

Volume

You might be worried that a single instance won't be able to handle the number of services and messages that you need to send. But remember, a single instance of NATS should be able to easily handle thousands of simultaneous connections. Furthermore, NATS is fairly stateless and should not be demanding in terms of CPU or memory. It simply receives a message, forwards it to all the subscribers, and then forgets about it.

Reliability

What happens if NATS goes offline? What about network issues? These are valid concerns, but:

Statistically speaking, the smaller your cluster the more rare outages are
For non-mission-critical applications a small outage is likely not going to cause major issues
NATS doesn't really make this problem worse, it exists either way

If your reliability needs really aren't met by a simple NATS instance, there are solutions: ACKs and retries, periodic refreshes from persisted source of truth, or running NATS in a high-availability configuration (also see Jetstream documentation).

Conclusions

NATS is an easy to use service that provides extremely useful functionality for today's distributed microservices. It provides the right balance simplicity vs. performance to be useful for many applications, and it can grow as your needs do. I also highly recommend checking out this Changelog podcast episode about NATS.