Mar 9, 2026 - 9 MIN READ

Document the Parts of Your System People Need to Operate and Hand Off

If critical operating knowledge lives in one person’s head, your system is harder to support, change, and hand off than it looks. You should document the parts that matter first: ownership, rollback, failure modes, dependencies, and business context.

Bo Clifton

I was working with a team recently who has a critical service that only one person fully understands. They've done a pretty good job of extracting knowledge about the service into documentation, but the problem is that the documentation is not complete enough to let someone else operate the service if the primary owner were unavailable. That is a common problem because it creates operational risk, handoff risk, and "bus factor" risk (that is, if the primary owner is hit by a bus, the knowledge goes with them). If the primary owner is out for a week, can someone else still run the service, recover it if it fails, or make safe changes to it? If the answer is no, you have a documentation problem that matters.

To be fair, some tribal knowledge is normal and not problematic.

A small team will always have informal knowledge: preferred debugging shortcuts, opinions about log filters, small setup tips, or the fastest way to inspect a table during development. You don't need a documentation program to capture every single small habit people build while doing real work.

But critical knowledge is different.

If your team can only run, recover, or safely change a system because one person remembers the sequence, the exceptions, and the business consequences, that is not harmless team memory. That is an operational dependency.

You should document that first.

Use one test for deciding what matters: if the person who knows it is out for a week, can someone else still operate the system and make safe decisions?

If the answer is no, you have a documentation problem.

An example: a simple invoice export job

Imagine you run a scheduled invoice export job every weekday at 6:00 a.m.

It collects approved invoices, transforms them into a vendor-specific file, drops that file into a storage location, and notifies finance that the export is ready. On good days, nobody thinks about it. On bad days, it becomes urgent fast.

The code may be well tested. That still does not tell a new owner what they need to know when the job fails at 6:10 a.m.

They need answers to questions like these:

Who owns the job?
What upstream systems must be healthy before it runs?
What counts as a successful run?
Is rerunning it safe, or will it create duplicates?
What is the rollback or recovery path?
Who in the business needs to know if the export is delayed?
Which odd behaviors are vendor quirks rather than defects?

That is the kind of knowledge you should extract first. It has operational value, handoff value, and business value.

Documentation vs Tests: different jobs, different value

Tests and documentation do different jobs.

You still need tests. Tests tell you whether code behaves as expected. They protect correctness, regressions, and refactoring work. If the invoice export job calculates totals incorrectly or writes the wrong schema, tests should catch that.

Documentation matters in a different part of the system.

For the same invoice export job, tests usually don't tell you:

why the export runs at 6:00 a.m. instead of earlier
why finance wants failed runs escalated by 7:00 a.m.
which storage path downstream reporting depends on
whether you should rerun the whole batch or only a subset
what to do if the vendor is up but slow
which person or team can approve a temporary workaround
why the file naming convention cannot change casually

That is why documentation can matter more than tests in operations, rollback, ownership, handoffs, and business context.

This is not because documentation is inherently more important than tests. It is because tests answer a narrower question: “Does the code behave correctly?” Operations and handoffs require broader answers: “How does this work in production, who depends on it, and what should someone do when it goes wrong?”

A passing test suite is excellent news. It is not a recovery plan.

Where to begin?

As stated above, you don't need to document all tribal knowledge. You need to document the knowledge your team keeps paying for when it is missing.

Start with the parts that make the system operable by more than one person.

For the invoice export job, that usually means:

purpose and business impact
owner and backup owner
schedule and triggers
dependencies and prerequisites
failure modes and first checks
safe rerun or rollback steps
alerts, dashboards, and logs to inspect
downstream consumers and who to notify
known quirks that are intentional

That is enough to make the workflow survivable.

Don't start with lower-value material like personal preferences, one-off debugging tricks, or detailed history that has no current operational consequence. Those may be worth capturing later, but they should not come before the knowledge that keeps the system running.

A simple rule helps here: if missing information would slow incident response, make a handoff risky, or force people to guess during a production change, document it now.

3S: Stay Small and Specific

Most teams don't need a large documentation taxonomy. They need a few durable documents that answer the real questions people have.

For an operational workflow like the invoice export job, four document types usually cover most of the need.

Local first

Keep a concise README in the repository or service folder that explains what the job does, where it runs, how to start it locally if applicable, where the operational docs live, and who owns it.

This is the entry point, not the whole story.

A runbook for the critical workflow

If the invoice export fails, the runbook should tell someone exactly what to do under time pressure.

Keep it procedural:

symptoms
first checks
validation steps
safe rerun or rollback steps
escalation path
business notification rules

This is usually the highest-value document for operational knowledge.

Why on earth did we do it this way? An architecture decision record

If the job writes a slightly awkward file format, tolerates a vendor delay, or avoids real-time processing for business reasons, record that decision.

You don't need a grand architecture archive. You need enough history to stop future owners from “fixing” constraints they don't understand.

An ownership note when handoffs matter

If ownership has changed or rotates, make it obvious who is responsible, what they are responsible for, and what they need to review regularly.

That is especially useful when a staff engineer moves on, a team reorganizes, or support responsibility spreads across several people.

That is the core set. You don't need a catalog of document types unless your operating reality actually demands it.

Keep docs close to the work, but don't let it get too heavy

If documentation explains how to run, change, deploy, or recover a system, keep it near the code or infrastructure it describes.

For the invoice export job, that usually means repo-local Markdown is enough:

README.md for entry-point context
docs/runbooks/invoice-export.md for incident handling
docs/adr/ for durable decisions
a small ownership note near the job or service docs

This keeps updates in the same pull request as the code change. That is usually the difference between documentation that lives and documentation that drifts.

A separate docs site is worth it when your real problem is discovery, navigation, or publishing across many documents and many readers. If several teams need to browse shared standards, operational maps, or cross-repo documentation, a docs site becomes easier to use than raw repository browsing.

If you do have that problem, Retype is a reasonable option. It turns Markdown into a navigable documentation site, and its getting started guide is straightforward. If you want a simple publishing path, Retype also supports GitHub Pages.

But plain Markdown is often enough for a 5–15 person team. Don't add a docs platform just to feel organized.

There are also two clear exceptions to the “keep it in the repo” rule:

Sensitive documentation, such as credentials handling details, restricted network paths, or security response procedures, may belong in a more controlled location.
Cross-repo documentation, such as shared platform standards or incident rules used by many systems, may belong in a central docs location.

Even then, the repo should still point people to the source of truth. Don't make them guess where the important instructions live.

Use GitHub Copilot to enforce the habit, not replace the thinking

GitHub Copilot is useful here if you use it to reinforce a documentation workflow around real changes.

Don't ask it to invent operational truth it cannot know. Ask it to help you keep documented truth aligned with the work you are already doing.

A believable example looks like this:

You change the invoice export job so failed retries now stop after three attempts instead of five, and reruns must happen on a smaller scope to avoid duplicate downstream records.

That change should not ship alone.

In the same pull request, you should update:

the runbook steps for failure handling
the ownership or escalation note if responsibilities changed
any decision record if the retry policy changed for a lasting reason
the README if the operator-facing behavior is now different

This is where Copilot can help. In the PR or working session, it can point out that the retry logic changed and suggest the related runbook update in the same change set. That is a useful workflow because it keeps code and documentation moving together.

If you want better consistency, start small:

Use repository instructions for rules that apply broadly, such as “operational workflow changes require runbook updates in the same PR.”
Use response customization and path-specific instructions for folders where the rules differ, such as infrastructure or operational docs.
Use agent skills only when you have a repeatable process worth encoding, such as converting incident notes into a standard runbook update.
Use custom agents only when a team benefits from a focused workflow with tighter boundaries.
Use the coding agent when you want this behavior inside the normal review and pull request loop rather than as an isolated chat exercise.
If you need to create a repeatable skill, GitHub also documents how to create skills.

You don't need all of that on day one.

For a small team, repository instructions and a clear PR expectation are often enough. Skills and custom agents become worth it when the team has enough repeated volume and variation that lightweight instructions are no longer keeping things consistent.

That is the boundary: use the simplest Copilot setup that helps the team update docs with the work. If your team has one repo, a handful of engineers, and straightforward workflows, a full agent setup is usually overkill.

Begin in the beginning: document the critical workflows first

For the invoice export job, a solid first runbook does not need to be impressive. It needs to answer the questions a capable engineer or operator will have when the usual owner is unavailable.

A first version might cover:

what the job does and why it matters
when it runs
where to check status
how to tell whether the run partially succeeded
whether rerunning is safe
how to recover if the vendor endpoint is slow or unavailable
who to notify if the export misses the finance deadline

That is enough to reduce operational risk immediately.

If you later add more detail, good. But don't wait for a perfect format before you capture the knowledge that keeps the workflow safe.

Make the next 30 days realistic

If your team has 5–15 people, don't launch a sweeping documentation initiative. Run a small, bounded effort tied to real systems.

Week 1: pick the risky workflows

Identify three to five workflows that would be painful to operate or hand off if the primary owner were unavailable.

Good candidates include:

scheduled exports
payment or invoice processing
deployment and rollback procedures
vendor integrations with known quirks
month-end or reporting workflows with business deadlines

Choose based on operational risk, not prestige.

Week 2: write the first useful docs

For each workflow, create only what is needed:

a short entry-point README or section in an existing one
one runbook
one brief decision record if the workflow has non-obvious constraints
an explicit owner and backup owner

This is enough for most small teams.

Week 3: move docs next to the work and close obvious gaps

Put repo-local docs where the people changing the system will actually see them. Add links from the relevant code, folders, or service entry points.

If you discover a document belongs somewhere restricted or cross-repo, move it there deliberately and leave a pointer behind.

Week 4: make updates part of delivery

Add a simple expectation to your pull request process: if an operational workflow, ownership rule, rollback path, or business-facing behavior changes, the relevant docs change in the same PR.

If you use Copilot, reinforce that expectation with repository instructions and minimal path-specific guidance where needed.

That is enough to change the habit without creating a side project.

What you should do next

Pick one workflow that your team does not want to rediscover during an outage or handoff.

If the invoice export job fits that description, start there.

Write the first version of the runbook. Name the owner. Document the rerun and rollback rules. Record the business consequence of failure. Put the file next to the code or infrastructure that controls the job.

Then hold one line going forward: if the workflow changes, the docs change with it.

You don't need exhaustive documentation. You need durable documentation for the parts of the system other people must be able to operate, support, and inherit.

We can help!

If you want help identifying critical workflows, writing durable documentation, or setting up a Copilot workflow to keep docs updated, reach out and we can talk about how we can help your team.

How Small Technical Teams Turn Delivery Problems Into Shared Capability

You should stop separating team learning from delivery and start using releases, reviews, incidents, and rotations to build capability where the work already hurts.

Your Software Has No Owner's Manual -- That's Why Your Team Is Afraid to Touch It

If your team is afraid to touch an inherited system, the first fix often isn't a rewrite. Start with a one-page owner's manual covering ownership, deployment, failure modes, dependencies, and rollback.