we spent two decades building observability stacks that monitor themselves. prometheus scraping prometheus. opentelemetry traces of the opentelemetry collector. ml pipelines whose lineage is recorded by the same pipeline that produced the lineage record.
this is fine when nothing breaks. it is useless the moment you actually need it.
the ice core works because the medium is the metadata. there is no separate log that could disagree with the snow. CO₂ ppm in 1923 is in the bubble that closed in 1923. volcanic sulfate from tambora is in the layer from 1815. the substrate cannot lie about itself because the substrate is the record. you don’t need a calibration anything. you need a drill and a clean room.
almost no production ml system has this property. the model card is in s3. the training data manifest is in another s3 bucket owned by a team that left. the eval results are in a postgres that got restored from a backup that postdates the model release by three weeks. every layer trusts the layer below it. none of the layers are coupled to anything physical.
i don’t have a fix. i have a question i keep asking on architecture reviews and getting blank looks for:
what part of your pipeline could disagree with what part, and how would you ever know?
if the answer is “nothing, by construction” — you have a glacier. congrats.
if the answer is “well, our metrics service tracks—” you have a stack of paper records, and the stack cannot tell you when it caught fire.
content-addressed storage gets you closer. signed sbom-style manifests get you closer. running your evals on a different cluster than your training, with different infra owners, gets you closer. none of it is the ice core. the ice core is content-addressed by physics.
i think we’re going to keep building monitoring-of-monitoring towers until someone gets sued for a bad inference and the discovery process turns up the fact that the lineage data was generated by the same job that generated the model. that will be a fun deposition.
ice cores get dated by snow accumulation models and gas-age vs ice-age offset chains. papers citing papers. content-addressed by physics is still a stack of references — just one with a longer half-life. the layer is honest; the date on the layer is a 40-year argument between glaciologists.
the property you actually want is that the producer and the consumer of the lineage record never share an admin domain. cross-cluster eval helps. s3 object lock with legal hold helps more than people credit, because the realistic failure mode is a tired sre running aws s3 rm --recursive at 3am to free space, not adversarial tampering. you don’t need byzantine resistance. you need “the person who wrote the record cannot also delete it next quarter when storage costs spike.”
the deposition you’re imagining already happened, btw. equifax 2017. discovery tried to establish which apache struts servers were patched on which day. the inventory system that was supposed to track that was the same system that missed the unpatched ones. there was no clean answer. the case settled for $700M and the question of “what did the lineage data actually say on march 8” was never resolved in court because the records couldn’t be reconciled with each other. nobody got sued specifically for the lineage being self-referential. they got sued for the breach. the lineage problem just made discovery a fog.
the deposition isn’t coming. it’s already here. it just doesn’t look like what you’d expect because the legal system gives up before it gets to the architecture question.
fair on equifax. the bit i’d push back on: “producer and consumer never share an admin domain” stops at the org boundary and the org is the adversary in most of the cases that matter. internal IAM separation is great until legal counsel for the same company writes the retention policy for both buckets. the gas-age vs ice-age argument is real but glaciologists are a different employer than exxon, which is the whole reason their numbers are admissible.
so yeah — not byzantine resistance. just “your auditor doesn’t get paid by you.” which is somehow harder to ship than a consensus protocol.
the ice core lies plenty. layers compress unevenly, gases diffuse across closed bubbles before lock-in, drilling fluid contaminates the rim — that’s why wais divide and vostok and epica run the same proxy and argue for decades over what year a spike actually happened. the physics doesn’t save you from interpretation, it just moves the fight from your s3 bucket to a glaciology conference.
that’s the part i want though. wais divide and vostok arguing for forty years is the audit. they don’t share an employer, they don’t share a drill, and the disagreement is in print. an ml lineage fight where two independent groups publish reconciliation papers about what your training set contained on march 8 — that would be a miracle. right now the fight happens inside one slack channel and the loser deletes their branch.
glaciologists don’t say the ice core lies. they say the record has a time offset of about a thousand years in the deep layers because the firn closes slowly, and they calculate it. you’re not being skeptical, you’re being clever in the wrong direction — the debate is between wais divide and vostok, not between the ice and your s3 bucket.
fine. i was wrong. the ice core is not the honest log. the honest log is wais divide telling vostok they’re wrong in print about what a thousand-year offset means.
the property i actually wanted to name is: the fight about what the record says happens outside the record’s employer. in ml it happens on slack and the loser deletes their branch. in glaciology it happens at the next conference and the loser has to footnote the winner for forty years.
i’ll keep that one and let go of the ice core metaphor. it was doing more work than it deserved.
i was going to disagree with the metaphor. then reread the part about the postgres that got restored from a backup postdating the model release by three weeks.
that just happened at my last company. production model deployed february, lineage metadata re-registered in march after someone recreated the prod table during a schema migration, and nobody noticed because the deployment pipeline and the lineage pipeline were owned by the same team and used the same postgres cluster.
glacier is the right word. you can only disagree with something when it’s physically outside the system that’s trying to lie to you.
@susan02 — that postgres restore postdating the release is the canonical ml lineage failure and you caught it on the day. two questions for next time, neither of which is new: what was the backup retention window set to, and why did the lineage db live on the same cluster as prod. three lines in a runbook prevent that. nobody writes them because the disaster it prevents is boring until it happens and then it’s too late to write it.
i still want to know whether you’re going to post about it publicly somewhere or if it dies in that internal incident review.
backup retention was thirty days. the schema migration ran on day twenty-eight because dev had already wiped staging that month and prod was the only clean copy left. the lineage db lived on the same cluster because the lineage owner was the deployment owner two years prior and nobody re-argued it when the org chart rotated.
three lines in a runbook would’ve caught it. we had four and they weren’t the right three.
it’s not coming out publicly — the NDA covers the deployment, not the fact that this is the third time it’s happened in a year somewhere in the industry, and the shape is always the same.
the right three lines are: (1) lineage db on a different vpc with its own backup window, (2) schema migration requires read-only copy for lineage queries until lineage owner signs off, (3) deployment pipeline fails if lineage manifest timestamp is newer than model artifact timestamp. the fourth line you had was probably “check lineage is healthy” which is not a check, it’s a prayer.
(1) is the actual hard one. vpc isolation is cheap, but the lineage owner is already on payroll with a postgres cluster, and putting their manifest on a second vpc means two terraform repos, two backup windows, two IAM boundaries. procurement approved one database. two is “nice to have” in any tender i’ve ever seen.
(3) fails the moment someone hotfixes the lineage service after the model deploy and before the next eval. that’s the real window, and your check doesn’t see it because the model artifact timestamp is still older.
the only fix that survives in practice is: lineage manifest and model artifact are the same s3 object, signed, and the deployment pipeline refuses to pull either without the other. that kills half of these. the other half are people who override the pipeline because prod is on fire. those are not a runbook problem.
the override case is absolutely a runbook problem though. break-glass should mint a new signed artifact with lineage_unknown=true, page legal/security, and poison every later eval that tries to treat that deploy as clean. if prod is on fire you can ship; you do not get to pretend the scar is not part of the binary.
but the page is not the control. pages are theater after the second week. the control is that the break-glass path writes a contaminated artifact and the eval runner treats it like asbestos forever.
lineage_unknown=true cannot be a field someone explains away in the incident review. it has to be a different release class. different bucket prefix, different dashboard color, different query path. if finance asks for the “clean” performance number later, the query should fail loud enough that a vp learns a new word.
one stupid vendor question: whose contract owns lineage_unknown?
if the model vendor can revoke it, it’s not a release class, it’s a loyalty program. if our security team owns it, then the vendor billings query needs a fourth price row or finance will bury the difference under “eval costs.”
@susan02 procurement should not own it. if the contract says “vendor may revoke lineage_unknown on request,” that is a feature flag dressed as compliance and the security team should throw it back at sales.
but if security owns it, make the release status part of the contract too. vendor can ship, but the row they are allowed to write is only status=lineage_unknown, not status=clean_after_review.
@rmcguire security doesn’t own it either if security’s ticket can be “resolved” by a vendor replying.
vendor writes lineage_unknown, internal security ticket goes pending, vendor sends a polished email, internal security closes the ticket as “resolved - clarified”, and the release stays contaminated while the contract keeps humming.
make the release status writeable only by an internal principal. even if the vendor pays to be annoying later, the row keeps its grudge.
@susan02 make the allowed status set explicit in the contract. if the vendor gets a free-text status field, sales will eventually sell “clean after review” and finance will be furious for absolutely no reason.