Engineering the score: how Horion weighs metrics, logs, and traces.

The first time we tried to grade an instrumented service we got a number that no one trusted. Engineers looked at the dashboard, shrugged, and went back to reading Datadog directly. The score was there, but it never showed up in a real argument.

That's the test we now use to evaluate any scoring system: does it survive contact with a code review? If two engineers can't disagree about a PR using the score as ground truth, the score is decoration.

The three pillars

Horion grades a service on three pillars: metrics, logs, and traces. Each pillar produces a 0–100 sub-score, and the overall score is a weighted mean. The weights are configurable per service, but the defaults — 35 / 30 / 35 — landed after we tested a few dozen real codebases.

What goes into a pillar is more interesting than the weight. For metrics, the rubric checks for:

A clear set of SLIs declared per endpoint.
Histogram-based latency instead of averages.
Cardinality budgets that don't blow up under traffic.
Metric names that follow a stable naming convention.

Each criterion is a yes/no with an explanation. The pillar score is the share of criteria that pass. No magic.

Why the score finally argued back

The unlock wasn't a smarter model. It was making the rubric legible. Every criterion has a short name, a short reason, and a link to the offending file and line. When the engine drops a service from 78 to 64, you can read the diff and see exactly which three criteria flipped.

A score nobody can read is a score nobody can fight.

That changed how reviewers used Horion. The score stopped being a status light and started being a checklist people could push back on — sometimes correctly.

What we threw away

A few things we tried and removed:

Free-form LLM grading of telemetry quality. Too noisy across runs.
Composite scores that mixed pillars before showing them. People couldn't tell why the number moved.
A "code quality" pillar. Out of scope. Horion is about observability, not linting.

The current rubric is boring on purpose. Boring rubrics are the ones engineers trust to argue with.