ClickHouse as a Query Layer for Raw Game Events

ClickHouse as a Query Layer for Raw Game Events

Most game analytics questions eventually become raw-event questions.

A product manager may ask which acquisition cohort retained better after a new tutorial update. A live-ops team may want to know what players did before leaving an event. A data analyst may need to compare IAP conversion for two versions of a starter pack offer. The question changes, but the pattern stays the same: the team needs to go back to the underlying event stream and ask a custom question.

This is where the query layer starts to matter.

Rawbbit is built around a simple idea: the raw event layer should be the durable contract. Events are accepted over HTTP, buffered, and written as Parquet files in object storage. Those Parquet files are the source of truth. Query engines sit downstream from that layer and can be changed, rebuilt, or replaced without changing the ingestion contract.

BigQuery external tables are one useful query path on top of that raw layer. ClickHouse is another.

This article explains why ClickHouse is now supported as a query layer for raw game events, when it makes sense, and how it fits into the broader Rawbbit architecture.

The query layer problem

Game studios rarely have a stable analytics workload.

During quiet periods, a team may run a few daily reports and occasional cohort checks. During soft launch, an update rollout, or a UA test, the workload changes completely. Analysts start exploring player behavior repeatedly. Live-ops teams need fresh answers. Product teams ask more specific questions every day.

That kind of work is difficult to reduce to a fixed dashboard.

Hosted analytics tools are useful for standard product questions, but game teams often need custom analysis over raw events. BigQuery can answer those questions well, especially if the data is partitioned correctly. But BigQuery’s cost model is scan-based: every ad-hoc query has a cost based on the amount of data scanned.

For occasional analysis, this is usually fine. For repeated exploration over large event volumes, the cost model becomes part of the architecture.

ClickHouse solves a different version of the problem. Instead of paying per query scan, a studio can load frequently queried event data into local ClickHouse tables and run repeated queries against infrastructure it already sized for that workload.

The important point is not that ClickHouse replaces BigQuery everywhere. The important point is that both can be downstream query layers over the same raw Parquet record.

How Rawbbit stores the raw layer

Rawbbit’s ingestion path is intentionally simple:

Producer → Collector API → NATS JetStream → Raw Writer → Parquet in object storage

The collector accepts event batches over HTTP. NATS JetStream provides a durable buffer. The raw writer lands events as Parquet files in object storage such as S3-compatible storage or GCS.

The Parquet files are the durable boundary of the system. They are not an internal cache and not a temporary export. They are the long-term record of what happened in the game.

That matters because every downstream query layer becomes optional.

A studio can query the Parquet files through BigQuery external tables. It can load them into ClickHouse. It can inspect them locally with DuckDB. It can later add Trino, Spark, or another system. The raw data does not need to move just because the query engine changes.

This is the core architectural rule:

The raw event layer is the contract. The query engine is important, but replaceable.

Two query paths over the same data

Rawbbit supports two practical query paths on top of raw Parquet.

BigQuery external tables

With BigQuery external tables, BigQuery reads Parquet files directly from object storage. The data does not need to be loaded into native BigQuery storage before it can be queried.

This is a good fit when a studio is already on Google Cloud, already has BigQuery knowledge, or wants to use BigQuery-specific features such as BigQuery ML. It also keeps operations simple: there is no ClickHouse server to run.

The tradeoff is cost behavior. BigQuery charges based on bytes scanned. Good partitioning and query discipline can keep this efficient, but frequent ad-hoc exploration over large event volumes can still add up.

ClickHouse

With ClickHouse, raw Parquet can be used in two ways.

The first option is to query Parquet files directly using ClickHouse’s s3() table function. This is useful for inspection, backfills, and simple lakehouse-style access.

The second option is to load raw events into local MergeTree tables. This is the path that matters for repeated analytics workloads. Once events are loaded into ClickHouse tables, dashboards and ad-hoc queries can use local storage, ordering keys, and indexes instead of scanning raw Parquet from object storage every time.

This creates a serving layer optimized for frequent reads, while the raw Parquet remains the source of truth.

If ClickHouse is rebuilt, upgraded, moved, or removed, the studio can reload from the same Parquet files. The serving copy may change. The raw record does not.

Why this matters for game studios

The query layer is not just a technical preference. It shapes how freely a studio can ask questions during the periods when analytics matter most.

Launches and traffic spikes

Game releases, content updates, and UA tests often create unpredictable spikes in event volume. During those windows, teams ask more questions, not fewer.

With a scan-based query model, every large exploratory query has a direct cost. With ClickHouse loaded from raw Parquet, frequent questions can hit local tables instead. The studio still pays for infrastructure, but the marginal cost of each additional query is not tied to a separate scan meter.

This is especially useful for live-ops teams that need to iterate quickly during launch windows.

Long retention analysis

Games often need retention analysis across long windows: D1, D7, D30, D60, and sometimes longer. They also need to compare behavior across acquisition cohorts, countries, app versions, economy changes, and major content updates.

Those queries can become expensive if they repeatedly scan large ranges of raw data. In ClickHouse, a well-modeled MergeTree table can make repeated cohort and retention queries much faster and more predictable.

The exact performance depends on schema, partitioning, ordering keys, hardware, and query shape. But the operating model is different: the studio sizes ClickHouse for the workload instead of paying per scan for every exploratory question.

Custom funnels and game-specific segments

Game analytics often goes beyond standard funnels.

A team may need to ask:

  • Which players completed tutorial step 4 in under 90 seconds?
  • Which players saw a starter pack offer, skipped it, then converted after a later offer?
  • Which users churned after failing the same level three times?
  • Which acquisition cohorts monetized after interacting with a specific live event?

These are SQL questions over raw events. A generic product analytics UI may cover part of the workflow, but game-specific analysis usually needs direct access to the event stream.

ClickHouse is strong for this kind of repeated analytical querying, especially when the data is modeled around common access patterns.

Cost predictability

A live game with 500k DAU can produce very different analytics workloads depending on event design, session length, retention, and how often analysts query the data.

For that reason, there is no universal VM size or universal monthly cost. But the cost model is still meaningfully different.

With BigQuery, more exploration usually means more bytes scanned. With ClickHouse, the studio primarily pays for the server and storage it chooses to run. For many small and mid-size teams, that predictability can be more valuable than a fully managed query engine.

A concrete studio scenario

Imagine a mid-size mobile studio running a freemium puzzle game during soft launch.

The game produces around 5 million events per day. The team wants to answer questions like:

  • Which acquisition cohort has the best D7 retention this week?
  • What is the IAP conversion rate for players who saw the new starter pack offer versus the old one?
  • Among players who churned on day 1, what was the last event before they quit?

A hosted analytics tool may answer the first question easily. The second may require custom setup. The third usually requires raw event access.

With a Rawbbit + BigQuery setup, all three questions can be answered from the raw Parquet layer through external tables. This works well, but frequent ad-hoc exploration can increase scan costs.

With a Rawbbit + ClickHouse setup, the same raw Parquet files can be loaded into local MergeTree tables. Analysts and dashboards query ClickHouse for fast repeated access, while the raw Parquet remains available for BigQuery or any other downstream tool.

The architecture is not locked to one query engine. It keeps the raw record stable and lets the serving layer match the workload.

When BigQuery is the better path

BigQuery external tables are often the right first query layer.

Use BigQuery when:

  • the studio already runs on Google Cloud;
  • the team already knows BigQuery;
  • query volume is low to moderate;
  • most analysis is occasional rather than repeated every day;
  • BigQuery ML or other native BigQuery features are important;
  • the team does not want to operate ClickHouse.

This path keeps the stack simpler. For many teams, that is the right tradeoff.

A query layer should match the current workload, not an imagined future workload.

When ClickHouse is the better path

ClickHouse becomes more useful when the studio has frequent analytical reads over the same event data.

Use ClickHouse when:

  • analysts run many ad-hoc queries every day;
  • live-ops dashboards need fast refreshes;
  • the team wants predictable infrastructure cost instead of per-query scan cost;
  • the studio is comfortable operating self-hosted services;
  • the workload benefits from local tables, ordering keys, and repeated reads;
  • the team wants a query layer that is not tied to a single cloud provider.

ClickHouse is not free operationally. It adds a server that needs monitoring, backups, upgrades, and basic database ownership. But for teams with the right workload and skill set, it can be a high-leverage serving layer over raw events.

When both make sense

The BigQuery and ClickHouse paths are not mutually exclusive.

A common setup is:

  • BigQuery for one-off analysis, ML workflows, or cloud-native data work;
  • ClickHouse for live dashboards and frequent ad-hoc exploration;
  • raw Parquet as the shared source of truth behind both.

In this model, ClickHouse may contain a serving copy of the data for performance. That is data duplication for query speed, not a second source of truth.

The source of truth remains the Parquet layer. If ClickHouse needs to be rebuilt, it can be reloaded from raw storage. If BigQuery is used for a separate workflow, it can read the same raw record.

This is the main benefit of treating the raw layer as a contract.

The honest tradeoff

Adding ClickHouse means adding operational responsibility.

Even a simple Docker Compose setup still needs monitoring, disk planning, backups, and version upgrades. Someone on the team needs to understand ClickHouse tables, ingestion jobs, and query patterns. Poor schemas can still produce poor performance.

A two-person indie team may not need this. If Firebase Analytics, GA4, or BigQuery external tables are still good enough, adding ClickHouse too early may create unnecessary work.

ClickHouse becomes more attractive when the pain is clear: frequent queries, repeated dashboard reads, large raw event volumes, cost sensitivity, or a need to avoid tying the analytics serving layer to one managed warehouse.

Infrastructure should be added when it removes more friction than it creates.

Current Rawbbit status

Rawbbit currently focuses on the raw event pipeline and downstream query layers:

  • HTTP event collection through the Collector API;
  • durable buffering through NATS JetStream;
  • raw event storage as Parquet files;
  • BigQuery external-table access;
  • ClickHouse query-layer setup and loader patterns.

The next product gap is game-engine SDK support. Native SDKs for Unity and Godot are in progress, because that is what many game teams expect before adopting an analytics stack deeply.

But SDKs are not required to start sending events.

Even today, a game client, backend, or lightweight wrapper can send event batches directly to Rawbbit’s Collector API over HTTP using POST /v1/events:batch. The SDKs are about making that integration cleaner: batching, retries, platform conventions, local buffering, and developer experience. They do not unlock the ingestion path itself.

Rawbbit is Apache 2.0 and runs inside the studio’s own cloud account. The raw data stays in the studio’s storage. There is no telemetry requirement and no vendor service that needs to receive the game’s event stream.

The one-sentence version

ClickHouse gives Rawbbit users a fast, cost-predictable serving layer for repeated analytics queries, while raw Parquet remains the durable source of truth.