Why the Raw Layer Should Be Your Analytics Contract

Most analytics setups start the same way. Someone drops an SDK into the game, picks a dashboard tool, and within an afternoon there are charts. DAU is going up and to the right. Everyone's happy.

Then, six months later, the questions get harder. "What did our day-1 retention look like for the cohort that came from the Tuesday update, before we changed the tutorial?" And the answer comes back: we can't tell, because the dashboard only kept the aggregate, and the event that mattered was never defined back then.

This is the moment teams discover the thing nobody warned them about: the dashboard was never the source of truth. The raw events were — and they're gone.

This article is about treating the raw event layer as a contract: the one stable thing your entire analytics stack agrees on. It matters whether you're a product manager trying to answer a question that didn't exist last quarter, or an engineer who has to keep the pipeline from collapsing under its own weight.

What "raw layer" actually means

A raw event is the unmodified record of something that happened, captured as close to the source as possible, with no aggregation, no roll-up, and no interpretation baked in.

In a game, a raw event might be:

{
  "event": "level_completed",
  "player_id": "a3f9...",
  "level_id": 14,
  "attempts": 3,
  "duration_ms": 84210,
  "coins_spent": 120,
  "device": "iPhone13,2",
  "app_version": "2.4.1",
  "timestamp": "2026-05-20T14:03:22Z"
}

That's it. One line, one thing that happened. The raw layer is the durable, append-only store of millions of these, exactly as they arrived, kept around indefinitely.

What it is not: it's not "level 14 completion rate = 62%." That number is a derivative — something you compute later, from the raw layer, when you need it. The completion rate is an opinion. The raw events are the facts.

The distinction sounds academic until you realize where most analytics products draw the line. Hosted product analytics platforms — Amplitude, Firebase Analytics, and the rest — are built around the derivatives. They ingest your events, compute the metrics they think you'll want, and hand you a polished interface on top. The raw layer either lives behind an export feature you don't fully control, or it's aggregated away to keep their storage costs down.

When the raw layer is somebody else's implementation detail, you've signed a contract you didn't read.

The contract idea, made concrete

In software, a contract is an agreement about an interface that both sides can rely on, even as the implementation behind it changes. APIs have contracts. Type systems enforce contracts. The whole point is that you can change everything on one side without breaking the other, as long as the contract holds.

Analytics needs the same idea, and the natural place to put the contract is the raw layer.

Here's the shape of it:

Upstream (your game client, your backend) agrees to emit events in a known schema. That's the producer's side of the contract.
The raw layer durably stores every event, unmodified, in an open format. That's the contract surface — the stable thing.
Downstream (dashboards, cohort analyses, ML features, finance reports) all read from the raw layer, computing whatever they need. That's the consumer's side.

The magic is in what this lets you change freely. Your dashboard tool can be swapped. Your metric definitions can be rewritten. A new question can be answered by reprocessing history — because the history is still there, in its original form. None of that touches the contract.

Compare that to the common setup where the analytics vendor is the contract. Change vendors and you start from zero: no history, new event semantics, a migration project nobody scoped. The contract was never yours to hold.

This is the core of how a self-hosted pipeline like Rawbbit is structured: events come in over HTTP, get buffered, and land as partitioned Parquet files in object storage you own. That Parquet layer is explicitly the system of record. Everything downstream — BigQuery external tables, modeling in SQLMesh, dashboards in Metabase — reads from it and can be rebuilt from it. The raw layer is the contract; the rest is replaceable.

Why a product manager should care

If you're a PM, the raw layer isn't an infrastructure detail. It's the difference between "we can answer that" and "we'd have to start collecting it and check back in a month."

New questions about old data. The single most valuable property of a raw layer is that it lets you answer questions you didn't know you'd have. Say you ship a new economy in your game and three weeks later someone asks whether players who bought the starter pack churned less. If you only kept aggregate metrics, the cohort definition you need now didn't exist then — you're stuck. If you kept raw events, you write a query against history and have the answer this afternoon. The raw layer turns "we didn't track that" into "we didn't report that yet."

Metrics you can trust because you can audit them. Every PM has been in the meeting where two dashboards disagree about the same number. With a raw layer, there's a tiebreaker: both numbers are derived, and you can trace each one back to the events. The disagreement becomes a definition problem ("you're counting trial users, I'm not") instead of a mystery. The raw layer makes metrics arguable in the good sense — debatable against a shared set of facts.

Funnels and retention as queries, not features. On a hosted platform, a funnel is a feature the vendor built. If the funnel you need isn't expressible in their UI, you're out of luck. With raw events in a warehouse, a funnel is just a SQL query over timestamps — any funnel, including the weird four-step one with a branch in the middle that your specific game needs. You trade a point-and-click builder for unlimited expressiveness. Whether that trade is worth it depends on whether you have someone who can write the SQL, which is exactly the kind of honest tradeoff worth naming out loud.

Why a developer should care

If you're the engineer who owns the pipeline, the raw layer is the thing that keeps your architecture from turning into a tangle of irreversible decisions.

Reprocessing is your undo button. Aggregations have bugs. You'll define a retention metric, ship it, and three months later discover the window was off by a day. If your pipeline aggregates on ingest and throws away the raw events, that bug is now baked into three months of history with no way to fix it. If you keep raw events, you fix the transformation and recompute. The raw layer means your modeling logic is never a one-way door.

Schema-on-read buys you time. Forcing every event into a rigid schema at ingest time feels disciplined, but it punishes you the first time the game team adds a field you didn't anticipate. Landing raw events in a flexible format (Parquet is column-oriented and tolerant of evolving fields) lets you defer the hard schema decisions to read time, where they're cheap to change. You validate and enrich lightly at the collector, store faithfully, and shape the data into modeled tables downstream — where a mistake costs a query rerun, not a data-loss incident.

Buffering decouples your failure modes. A practical detail: putting a durable message broker (NATS JetStream, in Rawbbit's case) between the collector and the writer means a spike in event traffic — say, a launch day — doesn't take down ingestion. The collector accepts and acknowledges fast; the writer drains the queue at its own pace into Parquet. The raw-layer-as-contract design naturally separates "accept the event" from "store the event," and that separation is what keeps you online when the game goes viral on a Saturday night.

Portability is leverage. When raw events are open-format files in your own object storage, your warehouse is a query engine pointed at your files, not a vault holding your data hostage. BigQuery reads the Parquet today; if you switch warehouses tomorrow, the files don't move. Open format plus your-own-storage is what makes "we could migrate if we had to" a real statement instead of a hope.

What this looks like in a game studio

Abstractions are easier to trust with a concrete story, so here's one.

A mobile studio ships a mid-core RPG. Early on, they wire up a hosted analytics SDK, get their DAU and session-length charts, and move on. Six months in, three things happen in the same week:

Finance asks for revenue per acquisition cohort, split by the campaign that brought each player in. The hosted tool tracks revenue and tracks campaigns, but not the join the finance team wants, and the export is gated to a higher tier.
The design team wants to know whether the players who hit the level-30 difficulty wall are the same ones who stop spending — a custom funnel the platform's UI can't express.
Event volume crosses the platform's free-tier ceiling, and the upgrade quote arrives with a per-event price that scales with exactly the growth they were hoping for.

A studio running a raw layer answers all three without drama. The acquisition-cohort revenue join is a SQL query against raw purchase and install events. The difficulty-wall-versus-spending question is another query correlating level_failed events at level 30 with subsequent purchase events. And the volume ceiling isn't a billing event at all — raw events landing as Parquet in their own bucket cost cloud-storage prices, which are measured in cents per gigabyte, not dollars per million events.

The raw layer didn't make the studio smarter. It made the studio's history queryable, which amounts to the same thing when the question is about the past — and in analytics, the question is almost always about the past.

There's a gamedev-specific reason this matters more than in most domains: games generate enormous volumes of high-cardinality behavioral events, and the questions worth asking — retention curves, economy balance, progression funnels, whale identification, churn prediction — are almost all retrospective and exploratory. They're exactly the questions a fixed set of pre-aggregated metrics can't anticipate. Holding the raw layer is how a small game team keeps the option to ask them.

The tradeoff, stated honestly

Keeping a raw layer isn't free, and pretending otherwise would undercut the whole argument.

You take on operational ownership: a pipeline to run, storage to pay for, and the responsibility for the SQL that turns raw events into the metrics your team reads. A hosted platform hands you the cohort builder and the retention dashboard on day one; a raw-layer setup hands you the foundation and expects you to build (or query) the views on top. If nobody on the team is comfortable with SQL and a warehouse, that foundation is a liability, not an asset.

The honest framing is this: hosted product analytics optimizes for time-to-first-insight. A raw layer optimizes for never losing the ability to get an insight later. Early and small, the first matters more. As you grow — more volume, more money, more retrospective questions, more reasons to control your own data — the second starts to dominate.

Plenty of teams run both: the raw layer as the durable contract, with a subset of events forwarded into a hosted tool for the product team's day-to-day. The two aren't enemies. But only one of them is a contract you actually hold.

The one-sentence version

If you remember nothing else: aggregates are opinions you can always recompute, but only if you kept the facts — and the raw layer is where the facts live.

Everything downstream is replaceable. The raw layer is the thing worth owning.

Rawbbit is a self-hosted event pipeline built around exactly this idea: raw events land as Parquet in your own storage and become the contract your analytics stack reads from. For a concrete example of how this principle plays out in Rawbbit's architecture, see ClickHouse as a Query Layer for Raw Game Events. If you're weighing a self-hosted raw layer against a hosted platform, the Rawbbit vs Amplitude comparison walks through the tradeoff in detail, and How Rawbbit Works shows the full pipeline.

← Back to Blog

Help improve Rawbbit