▲Why We're Building Stategraph: Terraform State as a Distributed Systems Problemstategraph.dev

31 points by lawnchair 4 hours ago | 27 comments

eschatology 2 hours ago [-]

Hmm

I don’t see the state file as a complete downside. It is very simple and very easy to understand. It makes it easy to tell or predict what terraform will do given the current state and desired state.

Its simpleness makes troubleshooting easier: the state files are easy to read and manipulate or repair in the event of a drift, mismatch, or botched provider update.

With the solution proposed it feels like the state becomes a black box I shouldn’t put my hands in. I wonder how the troubleshooting scenarios change with it.

Personally, I haven’t ran into the scaling issue described; at any given time there is usually only one entity working with the state file. We do use terragrunt for larger systems but it is manageable. ~1000 engineer org.

lawnchair 1 hours ago [-]

You are right that the simplicity of the state file is a strength and we do not want to lose that. One of our goals with Stategraph is to make state just as easy to inspect through both the command line and the UI.

Not every Terraform setup runs into scaling pain. The trouble tends to show up in larger repos with thousands of resources where teams share big chunks of infra. That is where global locks and full refreshes become a bottleneck and where we think graph semantics help.

philipallstar 11 minutes ago [-]

> inspect through both the command line

This is a bit worrying, though. Do you mean through regular tools like cat or vim, or do we have to install a stategraph-manager tool (and upgrade it ad nauseum) just to look at the state?

lawnchair 6 minutes ago [-]

Regular tools (jq, cat, etc.) still work. That ability doesn't go away.

sylens 59 minutes ago [-]

It's an interesting proposal because they correctly call out that segmenting state files by workspace/environment in a very judicious way causes its own issues as you approach scale or have to work across environments. There is an entire industry of tools and services that help to streamline this process for you, but it still feels very hacky.

I'm curious if this will be compatible with tools like Spacelift or Env Zero, or if they are going to build their own runner/agent to compete in that space.

lawnchair 57 minutes ago [-]

We are already in that space [0] though that's not the focus of this post. Working with teams at scale on orchestration is what pushed us to look deeper at state itself and eventually create this project.

0: https://terrateam.io

giveita 2 hours ago [-]

Not an expert, but doesn't microservices help with this. Each microservice has its own YAMLesque resource descriptor (TF, cloudformation, whatever) and is managed independently. My team can add a SQS or S3 without locking your team.

I might be wrong regarding more sophisticated infra though.

mystifyingpoi 38 minutes ago [-]

It is the usual DRY/WET concern. Having microservices be completely independent and relying only on shared message broker or service discovery has its benefits, but the cost is generally duplication of things. Things like "whitelist this inbound IP for all services" or "configure telemetry endpoint" often end up in making N changes to N separate repos, and it becomes hell if you have to talk to N teams.

sausagefeet 2 hours ago [-]

Not necessarily. The guidance is to split your TF code across multiple states which might feel like it make sense but for your microservices to communicate that beed to share some base infrastructure, such as networking, so where does that live? Putting dependencies in their own state means that you lose the ability to understand how changing them impacts all of your infrastructure because you have this information black hole at the boundary of their state.

With Stategraph, you'll get all the benefits and isolation of separate state files, but when you changed resources, you'll get meaningful plans around all of the infrastructure they impact, not just the statically defined boundaries of a state file.

lawnchair 2 hours ago [-]

Author here. You are right that splitting by microservice reduces overlap. The problem is shared resources never go away such as VPCs IAM or databases so contention shows up there.

Splitting state files is the common workaround but that only creates new problems like cross state dependencies and orchestration glue. The real issue is the storage model which is a single JSON blob with a global lock. Treating state as a graph with proper concurrency control avoids contention while keeping a cohesive view of infrastructure.

spinningarrow 2 hours ago [-]

Do you have an example you can share?

We have about 30 services with each managing their own terraform state. We also have a shared infra repo managing some top level items. We haven’t run into any issues (with any regularity at least) that I can think of but I’m wondering if this could be a good tool for us as we grow and things become even more complex?

lawnchair 1 hours ago [-]

The pain really shows up when teams manage large sets of infrastructure in one place with thousands of resources. Even a small change forces a global refresh and a global lock, so you end up waiting on operations that have nothing to do with your change. Splitting reduces contention but fragments your view of the system. We want state to behave like the dependency graph it already is.

sausagefeet 2 hours ago [-]

Hey! One of the Stategraph developers here and can answer any questions. The major motivation is just how small scale Terraform/Tofu start to breakdown and creates work for users when they have to refactor for performance issues that shouldn't exist. So we want a drop in solution that just dissolves those issues without the user having to do anything.

anonymousDan 48 minutes ago [-]

Are there any statistics/analyses for the popularity of these different configuration management languages/frameworks (Terraform, Pullumi etc) in cloud settings? Trying to figure out which one(s) are worth learning.

pst 2 hours ago [-]

This is awesome. Having a single state for all resources in an environment is critical for keeping all the moving pieces in check and a core design aspect of Kubestack. But the growing state files quickly become a bottleneck. I'm definitely giving this a good test drive. Very excited.

sausagefeet 2 hours ago [-]

Thank you, that is great to hear! We're pushing pretty hard to get a pre-alpha out to get some foundations testable by the community.

dwroberts 2 hours ago [-]

If you use a tool like Atmos (https://atmos.tools/) you kind of fix this issue already for free - because it takes the place of the root module, it actually manages the state of each sub module separately (they each have their own individual state file rather than being converged into one).

lawnchair 2 hours ago [-]

I don't think it fixes it. Atmos makes splitting and managing multiple states easier, but it still splits the graph. It doesn't change the underlying execution model.

angio 1 hours ago [-]

How does this compare with Pulumi? AFAIK they also don't have a state file and relay on an external database to store state. Is your locking granularity better?

lawnchair 1 hours ago [-]

I don't know enough about Pulumi to make a fair comparison on locking granularity. Pulumi's model is pretty different from Terraform/OpenTofu in general and state management is only one part of that. We're focused on optimizing the Terraform execution model and making the state layer match the graph semantics it already uses.

johanneskanybal 1 hours ago [-]

I think that’s the article but tl;dr that’s only part of the problem and already widly adopted with mutexes in say dynamo or whatever flavor you chose. This is about not having global locks or 10 arbitary random locks per subdomain but rather figuring out the exact resources affected and locking only those.

Sounds very neat if you’re an big enough org.

cyberpunk 1 hours ago [-]

I mean take this with a grain of salt and purely anecdotal; but everywhere I've heard of who chose pulumi over tf are no long using pulumi. I'd love to hear some opposing experiences to that though!

cedws 49 minutes ago [-]

I was in a platform team using Pulumi (TypeScript) for a while. An issue I observed is that the team members with weaker programming skills were contributing not so great changes, and parts of the codebase diverged in style. The Output type also took some time for us to get our heads round and it felt awkward to work with, we were having to chain a lot of calls and had callback hell sometimes.

We were all experienced with Go but at the time the Go SDK was very awkward, although I think some of that has been resolved with generics now. TF is less expressive but I think that’s actually better for 99% of cases.

arccy 2 hours ago [-]

so kind of like crossplane where each resource is managed individually?

tuananh 2 hours ago [-]

can it be a sqlite db in s3 with locking implemented with s3?

sausagefeet 2 hours ago [-]

Hello, Stategraph developer here, the answer is: probably not. That doesn't resolve the core issue of state being managed as a big blob.

tuananh 27 minutes ago [-]

but that big blob is a database. surely it's better than a json file right?

Loading comments...

eschatology 2 hours ago [-]

Hmm

Its simpleness makes troubleshooting easier: the state files are easy to read and manipulate or repair in the event of a drift, mismatch, or botched provider update.

With the solution proposed it feels like the state becomes a black box I shouldn’t put my hands in. I wonder how the troubleshooting scenarios change with it.

lawnchair 1 hours ago [-]

philipallstar 11 minutes ago [-]

> inspect through both the command line

This is a bit worrying, though. Do you mean through regular tools like cat or vim, or do we have to install a stategraph-manager tool (and upgrade it ad nauseum) just to look at the state?

lawnchair 6 minutes ago [-]

Regular tools (jq, cat, etc.) still work. That ability doesn't go away.

sylens 59 minutes ago [-]

I'm curious if this will be compatible with tools like Spacelift or Env Zero, or if they are going to build their own runner/agent to compete in that space.

lawnchair 57 minutes ago [-]

0: https://terrateam.io

giveita 2 hours ago [-]

I might be wrong regarding more sophisticated infra though.

mystifyingpoi 38 minutes ago [-]

sausagefeet 2 hours ago [-]

lawnchair 2 hours ago [-]

Author here. You are right that splitting by microservice reduces overlap. The problem is shared resources never go away such as VPCs IAM or databases so contention shows up there.

spinningarrow 2 hours ago [-]

Do you have an example you can share?

lawnchair 1 hours ago [-]

sausagefeet 2 hours ago [-]

anonymousDan 48 minutes ago [-]

pst 2 hours ago [-]

sausagefeet 2 hours ago [-]

Thank you, that is great to hear! We're pushing pretty hard to get a pre-alpha out to get some foundations testable by the community.

dwroberts 2 hours ago [-]

lawnchair 2 hours ago [-]

I don't think it fixes it. Atmos makes splitting and managing multiple states easier, but it still splits the graph. It doesn't change the underlying execution model.

angio 1 hours ago [-]

How does this compare with Pulumi? AFAIK they also don't have a state file and relay on an external database to store state. Is your locking granularity better?

lawnchair 1 hours ago [-]

johanneskanybal 1 hours ago [-]

Sounds very neat if you’re an big enough org.

cyberpunk 1 hours ago [-]

I mean take this with a grain of salt and purely anecdotal; but everywhere I've heard of who chose pulumi over tf are no long using pulumi. I'd love to hear some opposing experiences to that though!

cedws 49 minutes ago [-]

arccy 2 hours ago [-]

so kind of like crossplane where each resource is managed individually?

tuananh 2 hours ago [-]

can it be a sqlite db in s3 with locking implemented with s3?

sausagefeet 2 hours ago [-]

Hello, Stategraph developer here, the answer is: probably not. That doesn't resolve the core issue of state being managed as a big blob.

tuananh 27 minutes ago [-]

but that big blob is a database. surely it's better than a json file right?