Spreadsheet agents need a transaction log, not a screenshot

I do not trust a spreadsheet agent because it can move a cursor around a grid. That only proves the browser accepted some input.

The harder question starts after the edit: can the agent prove exactly what changed, what recalculated, what the user would see, and how to undo it?

I keep coming back to this while working on bilig. The project has a browser grid, but the real product boundary is not the grid. It is the evidence around a workbook mutation.

The screenshot trap

Screenshots are tempting because they look like proof. A cell has a value. A chart moved. A table is selected. The agent did something.

Except that screenshot cannot tell you whether the value is a literal or a formula result. It cannot tell you whether dependent formulas recalculated. It cannot see hidden sheets, named expressions, stale viewport state, saved workbook bytes, or whether an undo path still exists.

The failures I worry about are not dramatic. They look plausible.

The API says the write succeeded. The formula bar has the new value. The visible grid is one render tick behind. A copied range includes stale formatting. A summary cell looks right because the old value happened to match. You can lose a day arguing with pixels when the system should have carried a receipt from the start.

What the receipt should contain

For workbook edits, I want the agent to return a transaction record, not a caption.

At minimum, that record should say:

what operation it attempted,
which sheet and range it touched,
the workbook revision before and after the edit,
the authoritative readback after recalculation,
the rendered readback for the visible range,
any warnings from formulas, parsing, persistence, or rendering,
whether undo or restore was proven.

Those fields sound boring until something breaks. Then they are the difference between "the agent failed" and a useful diagnosis:

the edit never applied,
the edit applied but the formula is wrong,
the formula is right but the rendered grid is stale,
the view is right but the workbook was not persisted,
everything worked but undo was never verified.

Those are different bugs. If the tool flattens them into "done," the receipt is not good enough.

The grid is not the source of truth

Humans need a grid. We think spatially in spreadsheets. Selection, freeze panes, copy/paste, formula bars, resize handles, and formatting all matter.

Agents need a contract that is less visual and more auditable.

An agent should be able to say: write F12, recalculate, read F12, read the dependent total, save the workbook, reload it, compare the value, then render the range the user will inspect. That should be one auditable path, not a pile of browser gestures.

I care about @bilig/headless because it gives code a workbook surface: construct sheets, write ranges, set formulas, recalculate, persist, restore, and read values without pretending the browser is the database.

The browser still matters. It just should not be the only witness.

Real files keep everyone honest

Toy sheets make agents look better than they are.

Real workbooks have merged cells, weird number formats, volatile formulas, hidden sheets, external references, stale cached values, wide ranges, and formatting that carries meaning. They also contain all the little mistakes people make when a workbook has lived for years.

Corpus work makes that visible. A public workbook that fails to load is a repro, not a vibe. A skipped formula is not compatibility. A cached checkpoint is not proof unless it still matches the artifact it came from.

The more real files a spreadsheet system can survive, the smaller and more honest its claims get.

Undo is part of trust

Undo is not a nice-to-have for agents. It is the safety rail.

If the agent cannot prove it can restore the prior workbook state, the edit should be treated as risky even when the visible result looks fine. The restore path should be part of the transaction receipt: what token or snapshot exists, what range was restored, and whether the post-restore readback matched the original state.

This sounds strict until a small edit changes twenty downstream reports. A format change can make a negative number look harmless. A stale named range can make the visible cell correct while the downstream model is wrong.

Undo gives the agent a way to act without pretending every action is harmless.

Where this leaves spreadsheet agents

I am less interested in agents that can "use Excel" through screenshots.

I want agents that can work against a workbook the way a database client works against a database: named operations, explicit transactions, durable state, readback, warnings, rollback, and logs you can inspect after the fact.

The UI can still be beautiful. The agent can still use the rendered sheet as evidence for what the human will see. But the core claim has to be sharper:

I changed this workbook. Here is the operation. Here is the authoritative readback. Here is the rendered proof. Here is what did not verify. Here is how I can undo it.

Anything less is a screenshot with confidence.