CI/CD - Broken Systems

thoughtfeatured

created 2026-04-01updated 2026-04-25


Background

I've been using GitHub Actions for a while. If you want an entertaining read about its downfalls, check out this blog from Ian Duncan. My team has some pretty intense workloads on GitHub Actions, and we've stress tested the self-hosted runners in ways I wouldn't wish on anyone.

We started with EC2-deployed runners - lots of bugs. We switched to the ARC runners in Kubernetes - different bugs, one of which spun up hundreds of persistent volumes. Then GitHub officially adopted that library, and we traded some bugs for other bugs during the upgrade.

One thing has remained true through all of it - it has never been a joy to use.

I've used other platforms too. In my experience, the main problems still tend to exist. Pipeline configuration is some flavor of YAML that grows until I no longer want to look at it. Sharing code between pipelines is awkward at best. Running things locally is usually painful, and leads to me wanting to port everything to a bash script. Iterating on the design takes forever because of the lack of "runs anywhere" support. And, when something breaks, the debugging loop is often - push a commit, wait for a runner, read the logs, repeat. It has not been super pleasant.

Here's what gets me. We write software for a living. We build interesting, complex systems that handle immense load, and are incredibly fault-tolerant services. But when it comes to CI/CD, we stitch the yaml together, slap on some duct tape, and say a prayer?

For a long time, I've wanted to just run my pipelines locally. I've wanted dynamic pipelines that can make decisions. Instead I've been writing static yaml files where I have to trick the workflow into skipping steps - which requires either writing multiple workflows or registering a runner just to skip the step so it can register as a "success" to the following "needs" step. Whew!

Perhaps I'm looking under the wrong rocks, but how is there not a major CI/CD platform that is easy to use, and written in a regular programming language? Is it too difficult to do well? I can't imagine that would be the case. So - against my better judgement (you'd have to be crazy to think you can break into the world of CI/CD platforms) - I started building one. I had the pestering thought - if this were solvable, someone would have solved it already. But I also know what I want, I know my own struggles, and I feel pretty good at platform work. Maybe that's enough to start with.

What I'm building

The project is called Sparkwing. It's a self-hosted (or managed) CI/CD platform that runs on Kubernetes. Pipelines are written in Go instead of YAML, which means they're real programs - they can branch, loop, call functions, share code, and do anything else you'd expect from actual software.

Go compiles into a binary that can run just about anywhere. Dockerfiles are super portable too. The combination creates a "nearly runs anywhere" CI/CD system. It's pretty great. I've even tested running a sparkwing pipeline inside github actions. Totally works (in single node mode) - and I can still see the DAG, logs, etc... in the sparkwing UI.

Getting the interface right is probably the most important thing. After a bunch of toil, I decided to write a DSL. Huuuuge improvement. Authoring pipelines became a breeze. AI agents can do it pretty much first try - and I'm still early in this process. Also, the DAG for nodes, and the inner dag of steps, can be explained without running the pipeline. So, agents have a ton of information to work with (it's good for humans too). I want this to be incredibly simple to use. Go (or any programming language) can add some overhead complexity. Admit it to yourself - that YAML was never that bad to read. Well, that is until it got more complex than a super simple docker build, test, push. The nice thing about Go is a ton of people know it, it's very simple, and it's amazing for infrastructure work (which CI/CD largely is).

The thing I love about Sparkwing so far is that it was trivial to stand up a release pipeline that gives me sub 15s build-test-deploy times locally, but then also about the same time running in the cluster. ARE YOU KIDDING ME??? That's bloody insane! The caching system that I've built so far is bananas!! The immediate benefit was that I could cache things locally and utilize that for faster times, but the fact that I replicated it in an ephemeral environment is pretty amazing. I'm really hoping this scales - I've tried to build it so that it would, but I haven't tested it rigorously yet.

I also love that it "runs anywhere". If you have permissions to ECR and EKS and your Gitops repo, you can run the pipeline from your mac, making use of all of your local caching, and deploy straight to prod the dev cluster. Or prod, ya know, if Github Actions and Sparkwing are both down - you can deploy your working tree wherever you have permissions.

Before you go, I want you to think about this. You probably have logic shared across a bunch of pipelines. You might even have duplicate pipelines due to conditionals that are too costly to express in yaml. But, imagine how much nicer that should be. What if you had full access to a real programming language, and the system could express the types of conditionals you need without fighting with yaml? You could dry up a bunch of duplicated code while increasing clarity and improving runtime performance. Authoring pipelines becomes faster, safer, and cheaper.

Where it's at

I've been using Sparkwing for my own projects for a while now. It's ready for others to use, but I'm not committing to having a stable api surface yet so I wouldn't recommend it for anyone that isn't willing to have some code churn. Luckily, AI is really good at authoring sparkwing pipelines, so the risk isn't too bad - and I'm trying to make the upgrade path pretty nice. Overall, there are rough edges and decisions I'm still working through. But it's insanely fast, and it has a full platform behind it, nice caching capabilities, and an awesome CLI. It works well enough that I don't want to go back.

I'll write more about the technical decisions and the things that surprised me along the way. For now I just wanted to say - I think CI/CD can be better than what we have, and I'm having a lot of fun trying to build it.

Sparkwing