;

Nabeel Sulieman

Project Discontinued: Traefik Certs in Azure

2020-03-07

A few months ago, I wrote about a side project I intended to work on. The idea was to extend the community version of Traefik by implementing an ACME cert store backed by Azure Table Storage. I made some pretty good progress on the feature, but have finally decided not to continue.

The intermediate progess can be found here. If anyone wishes to continue this work, please have at it!

More details below.

Why did I Stop Working in This?

Well, there are so many projects that I'd like to work. For a while, this was at the top of my list, so I worked on it.

However, a few weeks back, my Kubernetes cluster had a major outage. Traefik stopped working all of a sudden and all my sites were unreachable. I tried for several hours to get it working, but just couldn't figure it out. Traefik was working fine using the Let's Encrypt staging endpoint, but it would then fail when switching to production.

I finally got tired of messing with Traefik and decided to look again at alternatives. I found a good nginx+CertManager tutorial, and this time (I had tried before), the setup went smoothly and I got all my sites up and running without Traefik. Once I was completely off of Traefik, I lost the motivation to continue work on it.

How far did I get?

I got pretty far. In terms of what's currently in the branch, here's what got done:

  • Wrote an Azure Table Storage client
  • Created a new implementation of all the required ACME cert store interfaces (backed by Azure Table Storage instead of local disk)
  • Hooked up required configuration parameters through Traefik's settings system

I was able to build and run this test version of Traefik in my cluster. The code is able to create certificates, save them in Azure storage, and reload them when the service restarts. The code does throw some errors sometimes that I haven't quite figured, but that just needs a little more testing and debugging.

What's left to do?

The biggest feature that is not implemented yet is the support for concurrency. Azure Table storage has a great story for concurrency. ETags prevent multiple writers from overriding each others' changes, and that model should work perfectly for multiple instances of Traefik.

When a certificate needs to be created or renewed, Traefik's behavior needs to be changed such that:

  • The multiple instances of Traefik attempt to "lock" the job
  • One instance will win the lock, while all the others will fail and back off
  • The instance that won the lock completes the task at hand and releases the lock
  • If the winning instance fails, the lock needs to expire so other instance can retry

There is also likely some work needed to let all instances know when a new certificate is created. This is necessary so that all instances of Traefik refresh their stores and use the latest certificates. However, this could be as simple as refreshing certs from Table Storage periodically. I don't think anything more complex is needed.

The final piece of the puzzle is testing. I did not write any unit tests and I don't know how much work is involved there. The code will of course also need some good old-fashioned exercise in different test environments to weed out all the bugs.

If You Want to Continue This Work

I believe the code is quite clean and readable. I tried to follow good coding standards and all that. The code is also light-weight in that it doesn't pull in any new dependencies.

So overall, if you want to work on this feature, I think my work could be a really good base to start from. I would also be willing to offer assistance or pair up with someone to finish this project. I'm just not willing to continue to work on it by myself at the moment.

Final Note

Although this didn't result in delivering a finished product, I have no regrets over the many hours I spent working it. It was a fantastic opportunity to build some Go Language skills, and there are always useful lessons to learn when working in other people's code.

That said, I also think there's value in realizing when it's time to stop working on something. While the idea of perservering to the very end is highly romantacised, I don't think it's always the best thing to do. I would have loved to complete the project, but there are several side projects in my mind (or partially started) that I would like to work on. Besides the remaining feature development and testing, I'm sure there would be a lot of work to get this code approved and rolled into the official release of Traefik.

So all-in-all, I would say I got 80% of the benefit, and put 20% of the work, needed to complete this project.