FinOps is more than just a trendy word for us at Heureka Group. As we continue to rely on GCP services for our #OnePlatform, it's crucial that we take FinOps seriously. We aim for each developer to understand the importance of understanding the logic behind FinOps to enhance our effectiveness. Please join me on my mission to chart our path towards explaining FinOps in Heureka Group.
The problem
As Heureka Group (h!g) progresses with its #OnePlatform project, we have more and more moving gears in our cloud provider of choice – GCP. The thing is, our bill for it goes up as well. All teams working in GCP are responsible for their workloads. They are not only responsible for the lifecycle, reliability, and availability point of view but also for the spending.
The way we work with GCP is that we have three environments (VPCs) with several Google projects in them. You can think of a Google project as a folder or a namespace inside the VPC. Thanks to this system, you can assign (several) projects to a team, and they are free to do what they need in their (permission and finance-wise) isolated sandbox.
Next to these projects, we, as an infrastructure team, provide several shared services like the Kubernetes cluster, Grafana observability stack, and others. It is partly for saving purposes but mainly so we can ease the developers' work as much as possible.
In the end, there are several questions:
- How do you bill shared services?
- How do teams check their GCP spending?
- How do teams check another set of spend (shared services)?
- How to reliably prevent (or catch as soon as possible) a "fincident"?
I don't have an answer for you right now, but we are walking the path, and I will document our progress here.
The first attempt
We have been giving time to this problem before. Last year, we tested the apparent solution – Kubecost. We tried the Kubecost solution as is. We also tried using some of its OSS components within our observability stack.
Kubecost, at the time we were trying it, was really pricey. Once you wanted some of the enterprise features, it just went straight down to hell. It also lacked the capability to detect and get the correct prices of preemptible / spot instances, which we use mainly. We use the Grafana observability stack, and the metrics we got from Kubecost were not really friendly to work with. We built a solution around it, but there were many, many edge cases where the data was just completely wrong.
A hidden gem in budgeting
Nevertheless, a new fiscal year approached, and we had to think of a number to cover the following year's cloud cost. And what’s more – preferably with monthly granularity! Last year, we just gave six tiers to developers to say to which tier the application will fall and estimate the price from there. We did not have any data from running in the cloud at all at that time. It was a disaster. The estimated final cost was huge! If I remember correctly, we cut the number in half, then subtracted something, and it just so happened that we guessed more or less correctly.
So what to do? We had one advantage for this year – We already have data about what we are running in GCP and what it looks like.
For developers to be able to guess how the spend for their application will develop, it is vital to know how it looks now, and a brilliant idea occurred to me. Let’s just straightforwardly show them how their application behaves from a resources point of view so that they can estimate from there. This is how the CUs were born!
CUs
Our standard Kubernetes node pools have the ratio between CPU and GB of RAM 1:4. Therefore, I went to our Mimir and took Kubernetes metrics about CPU and RAM usage for each namespace for a given month and calculated how much CUs each namespace spent. The CUs were calculated for both CPU and RAM, and the higher value was used. For example, 1 CPU and 6GB RAM of usage resulted in 1.5 CU.
We gave all teams the list of namespaces with calculated CUs spending and let them „guesstimate“ what would happen in the next FY. Of course, this sounds too easy to be reliable. We also had to account for some teams that we knew would have a massive network spend or would need some other special things. However, these unique cases were handled manually.
Now, the budget is in money, not resource capacity. There are other shared services, not just Kubernetes compute, but the mentioned observability stack and others, so what to do? This was the easiest part!
We calculated the price of all shared services, then took the total amount of CUs from the „model“ month and divided the shared price with the CUs. We then just used our „price per CU“ and used it with the developers' estimates. The logic behind this is that the more CUs the application uses, the more logging and metrics, networking, etc., it will need. So all the shared stuff is something like a tax.
Ultimately, it did not go that smoothly because, like with the tiers, devs were overestimating, and we needed to cut the estimated price, but it was about ⅓, which is way better than before.
But it proved to be an approach that was comfortable for everybody. We succeeded in rolling it into the teams. It was easily understandable and overall a success.
The way we want to go
So. We have a way to budget a cloud spend. It is a semi-success but a significant improvement from last year. It is easy to calculate and does not require any special tooling. Why not use it for our Finops observability?
In theory, it should be easy. Just create our own exporter, which would recalculate the CUs on the fly, publish it as another metric, and then work with the data! Easy-peasy right? And once you have that, each team can implement their alerting based on CUs/price!
This is the path we have chosen to walk. Next time, I will share more details about implementation and how the face-check with reality went.