
If critical operating knowledge lives in one person’s head, your system is harder to support, change, and hand off than it looks. You should document the parts that matter first: ownership, rollback, failure modes, dependencies, and business context.
Bo Clifton
I was working with a team recently who has a critical service that only one person fully understands. They've done a pretty good job of extracting knowledge about the service into documentation, but the problem is that the documentation is not complete enough to let someone else operate the service if the primary owner were unavailable. That is a common problem because it creates operational risk, handoff risk, and "bus factor" risk (that is, if the primary owner is hit by a bus, the knowledge goes with them). If the primary owner is out for a week, can someone else still run the service, recover it if it fails, or make safe changes to it? If the answer is no, you have a documentation problem that matters.
To be fair, some tribal knowledge is normal and not problematic.
A small team will always have informal knowledge: preferred debugging shortcuts, opinions about log filters, small setup tips, or the fastest way to inspect a table during development. You don't need a documentation program to capture every single small habit people build while doing real work.
But critical knowledge is different.
If your team can only run, recover, or safely change a system because one person remembers the sequence, the exceptions, and the business consequences, that is not harmless team memory. That is an operational dependency.
You should document that first.
Use one test for deciding what matters: if the person who knows it is out for a week, can someone else still operate the system and make safe decisions?
If the answer is no, you have a documentation problem.
Imagine you run a scheduled invoice export job every weekday at 6:00 a.m.
It collects approved invoices, transforms them into a vendor-specific file, drops that file into a storage location, and notifies finance that the export is ready. On good days, nobody thinks about it. On bad days, it becomes urgent fast.
The code may be well tested. That still does not tell a new owner what they need to know when the job fails at 6:10 a.m.
They need answers to questions like these:
That is the kind of knowledge you should extract first. It has operational value, handoff value, and business value.
Tests and documentation do different jobs.
You still need tests. Tests tell you whether code behaves as expected. They protect correctness, regressions, and refactoring work. If the invoice export job calculates totals incorrectly or writes the wrong schema, tests should catch that.
Documentation matters in a different part of the system.
For the same invoice export job, tests usually don't tell you:
That is why documentation can matter more than tests in operations, rollback, ownership, handoffs, and business context.
This is not because documentation is inherently more important than tests. It is because tests answer a narrower question: “Does the code behave correctly?” Operations and handoffs require broader answers: “How does this work in production, who depends on it, and what should someone do when it goes wrong?”
A passing test suite is excellent news. It is not a recovery plan.
As stated above, you don't need to document all tribal knowledge. You need to document the knowledge your team keeps paying for when it is missing.
Start with the parts that make the system operable by more than one person.
For the invoice export job, that usually means:
That is enough to make the workflow survivable.
Don't start with lower-value material like personal preferences, one-off debugging tricks, or detailed history that has no current operational consequence. Those may be worth capturing later, but they should not come before the knowledge that keeps the system running.
A simple rule helps here: if missing information would slow incident response, make a handoff risky, or force people to guess during a production change, document it now.
Most teams don't need a large documentation taxonomy. They need a few durable documents that answer the real questions people have.
For an operational workflow like the invoice export job, four document types usually cover most of the need.
Keep a concise README in the repository or service folder that explains what the job does, where it runs, how to start it locally if applicable, where the operational docs live, and who owns it.
This is the entry point, not the whole story.
If the invoice export fails, the runbook should tell someone exactly what to do under time pressure.
Keep it procedural:
This is usually the highest-value document for operational knowledge.
If the job writes a slightly awkward file format, tolerates a vendor delay, or avoids real-time processing for business reasons, record that decision.
You don't need a grand architecture archive. You need enough history to stop future owners from “fixing” constraints they don't understand.
If ownership has changed or rotates, make it obvious who is responsible, what they are responsible for, and what they need to review regularly.
That is especially useful when a staff engineer moves on, a team reorganizes, or support responsibility spreads across several people.
That is the core set. You don't need a catalog of document types unless your operating reality actually demands it.
If documentation explains how to run, change, deploy, or recover a system, keep it near the code or infrastructure it describes.
For the invoice export job, that usually means repo-local Markdown is enough:
README.md for entry-point contextdocs/runbooks/invoice-export.md for incident handlingdocs/adr/ for durable decisionsThis keeps updates in the same pull request as the code change. That is usually the difference between documentation that lives and documentation that drifts.
A separate docs site is worth it when your real problem is discovery, navigation, or publishing across many documents and many readers. If several teams need to browse shared standards, operational maps, or cross-repo documentation, a docs site becomes easier to use than raw repository browsing.
If you do have that problem, Retype is a reasonable option. It turns Markdown into a navigable documentation site, and its getting started guide is straightforward. If you want a simple publishing path, Retype also supports GitHub Pages.
But plain Markdown is often enough for a 5–15 person team. Don't add a docs platform just to feel organized.
There are also two clear exceptions to the “keep it in the repo” rule:
Even then, the repo should still point people to the source of truth. Don't make them guess where the important instructions live.
GitHub Copilot is useful here if you use it to reinforce a documentation workflow around real changes.
Don't ask it to invent operational truth it cannot know. Ask it to help you keep documented truth aligned with the work you are already doing.
A believable example looks like this:
You change the invoice export job so failed retries now stop after three attempts instead of five, and reruns must happen on a smaller scope to avoid duplicate downstream records.
That change should not ship alone.
In the same pull request, you should update:
README if the operator-facing behavior is now differentThis is where Copilot can help. In the PR or working session, it can point out that the retry logic changed and suggest the related runbook update in the same change set. That is a useful workflow because it keeps code and documentation moving together.
If you want better consistency, start small:
You don't need all of that on day one.
For a small team, repository instructions and a clear PR expectation are often enough. Skills and custom agents become worth it when the team has enough repeated volume and variation that lightweight instructions are no longer keeping things consistent.
That is the boundary: use the simplest Copilot setup that helps the team update docs with the work. If your team has one repo, a handful of engineers, and straightforward workflows, a full agent setup is usually overkill.
For the invoice export job, a solid first runbook does not need to be impressive. It needs to answer the questions a capable engineer or operator will have when the usual owner is unavailable.
A first version might cover:
That is enough to reduce operational risk immediately.
If you later add more detail, good. But don't wait for a perfect format before you capture the knowledge that keeps the workflow safe.
If your team has 5–15 people, don't launch a sweeping documentation initiative. Run a small, bounded effort tied to real systems.
Identify three to five workflows that would be painful to operate or hand off if the primary owner were unavailable.
Good candidates include:
Choose based on operational risk, not prestige.
For each workflow, create only what is needed:
README or section in an existing oneThis is enough for most small teams.
Put repo-local docs where the people changing the system will actually see them. Add links from the relevant code, folders, or service entry points.
If you discover a document belongs somewhere restricted or cross-repo, move it there deliberately and leave a pointer behind.
Add a simple expectation to your pull request process: if an operational workflow, ownership rule, rollback path, or business-facing behavior changes, the relevant docs change in the same PR.
If you use Copilot, reinforce that expectation with repository instructions and minimal path-specific guidance where needed.
That is enough to change the habit without creating a side project.
Pick one workflow that your team does not want to rediscover during an outage or handoff.
If the invoice export job fits that description, start there.
Write the first version of the runbook. Name the owner. Document the rerun and rollback rules. Record the business consequence of failure. Put the file next to the code or infrastructure that controls the job.
Then hold one line going forward: if the workflow changes, the docs change with it.
You don't need exhaustive documentation. You need durable documentation for the parts of the system other people must be able to operate, support, and inherit.
If you want help identifying critical workflows, writing durable documentation, or setting up a Copilot workflow to keep docs updated, reach out and we can talk about how we can help your team.