FinOps Starts With Engineers
By Jim Treinen, CEO & Co-Founder
As engineers, working in the cloud presents us with the opportunity to work with tremendous flexibility and efficiency. It allows us to be responsive and to just get stuff done, but it also comes with a downside. When we utilize the public cloud, everything we do is a proxy for cost. Ultimately, every action we take, every resource we consume, and every service we use, shows up on a bill.
In the early days of a new cloud deployment, this is often not an issue. The architecture and workloads are well understood, and the resultant costs are nominal. As the application and infrastructure evolve, there eventually comes a point where this is no longer true. We have found that generally there is a slow and steady increase in consumption that goes relatively unnoticed, and then there is a series of events that causes the scales to tip, and all work is stopped to “fix the cloud cost problem”.
When the alarm bell is rung, feature development stops. Our flow is interrupted and we must now divert our attention away from building and running our core application, and focus on figuring out what happened.
While the public cloud vendors do make some basic tooling available to look at what is driving up costs, the exercise of truly understanding what happened generally involves the collection, storage, and analysis of vast amounts of detailed data, followed by the creation of reports that are used once, and then thrown away until the next cost crisis arises, and the process is repeated. At Strake, we refer to this phenomenon as the “Cycle of Pain”.
Anyone who has tackled this problem more than once knows that it is not enough to be aware that there is a problem. In order to fix a problem, you must truly understand its root cause. The only people that can ultimately understand what is driving cloud costs, at the level of detail required to fix them, are engineers. That’s why we’re here.
The Creation and Removal of Work
The trick to getting to root cause is uncovering how individual actions result in consumption relationships. This would be relatively easy if the cloud was static, but it is not. The elasticity of the cloud means the number, type, and nature of these relationships change over time. The very structure of what we are attempting to map and understand changes. It is because of this, that looking at a single resource on its own is of very limited value. We must understand the nature of the system the resource exists in, and understand how its behaviors, performance, and utilization impact the behavior, performance, and utilization of every other system and resource with which it interacts. Until now, no solution has existed to collect and analyze all of the required cost, audit, and operational data required to solve this problem. Thus the creation of work for the engineering team, the lost feature velocity, and the constant interruption to understand what’s going on.
As mentioned above, in the beginning, this work is often done by hand in a one-off or ad-hoc manner. As the team and infrastructure grow, the problems generally grow as well, and in an attempt to minimize interruptions, we are forced to carve time out of our schedule to give up feature velocity and build our own data collection and analysis platforms that attempt to remove some of this toil.
Data architectures of this type are generally labor intensive to build and maintain, and produce code that is now subject to regular debugging and maintenance. We must now focus on accuracy and reconciliation instead of getting things done. As we think about an ever expanding set of resources and services that can potentially be used, the end result is more work, and less ability to focus on our core job.
Strake removes that work.
The core of what we do is data collection, relationship mapping, aggregation, correlation and deep time-series based analysis that tells us where to start looking for problems. Think of it as automated triage. We collect, stitch together, and analyze both cost AND operational data such as performance, utilization, lineage and other variable factors that drive spend, and we present this in an interface designed specifically for engineers.
We collect in a manner that is easy to install, secure in transfer, and safe to access.
All of our data is indexed and searchable, and presented in a way that makes the underlying structure and patterns easy to understand and share.
Until now, most FinOps tools have been designed to increase the awareness of the Finance team. Once an issue is uncovered, the work of figuring out “why” is given to Engineering, and the cycle repeats.
The “why” consists of two parts. The stable recurrent cost of items such as compute nodes is amplified by the not so visible “other” variable costs that occur as a result of the compute execution. These costs can manifest as fluctuations in network traffic, I/O/ storage, or interaction with a metered service. We think of these interactions as the rich Context (more on that soon) in which a resource is being consumed, and we examine this context as a means of understanding the consumption that it drives in other parts of the cloud. It is the relationships between the various resources that are being consumed, and how the nature of those relationships changes over time, and how the consumption model not only at the individual resource changes, but in the aggregate of the whole, that determines a cloud-based application’s overall cost and efficiency.
It’s All About Communication
After all the work of collecting, analyzing, interpreting, and understanding all of this data has been done, there is still the issue of communicating back to the numerous other stakeholders 1) what we have found, 2) what it means, and 3) what we are going to do about it. Historically, this has taken the form of copying and pasting charts and screen shots into slide decks, and either mailing or presenting the findings in an ad-hoc and work intensive format.
Strake has been built with the ultimate goal of communication in mind. Our core vision is that from Engineering to the C-Suite, everyone has a shared understanding of their cloud. This includes Devs, Ops, the Product Team, Finance, and the Executive team. We create a shared understanding with the sole goal of facilitating easy communication between these teams.
A natural evolution is occurring in the FinOps movement, where the focus and concentration of the practice is moving out of Finance and Business Operations, and into Engineering. Why is this? The only people who can actually impact cloud spend are engineers. It is for this reason that we are building tools to enable engineers to address the problem as efficiently as possible.
We would love to have you join us on this journey. Click here to join our beta.
Strake has agreed to assume ownership and maintenance of the AWS Pricing add-on for Google Sheets. The AWS Pricing add-on allows AWS users to query public AWS Pricing data using customer Google Sheets functions. Strake is committed to keeping the project open-source and is excited to continue building on the idea Mike Heffner brought to the AWS community.
Amazon RDS is a managed database service that makes it easy to spin up, operate and scale relational databases to support any size production infrastructure. RDS is one of Amazon Web Services’ most utilized services and has a complicated billing structure encompassing instance running costs, data transfer costs, and provisioned IOPS and Storage.
Elastic Compute Cloud (EC2) is the backbone of AWS. EC2 allows customers to access hundreds of different instance types across the globe in seconds. This guide walks through the different types of EC2 costs and how to isolate them using SQL analysis on your Cost and Usage Report (CUR). If you don’t have a CUR active in your account, you can check out the first guide which walks through permissions and the process of creating a CUR.