Materialize integration almost done (knock on wood)

I’ve been working for almost a week now on getting Materialize integrated with Findka. My most recent takeaway is that running Kafka, Zookeeper and Confluent schema registry (and Biff, Crux and Materialize) on a single machine is in fact too much for my poor commodity laptop. I almost ordered some parts to build a real dev machine. However, that’s because at the time I thought a lot of the slowdown I was experiencing was coming from other parts of Findka.

In reality, it was all coming from Kafka and friends. Materialize doesn’t actually require Kafka. You can create sources and sinks with regular files. But for some reason which partly escapes me now, I decided early on to go for embedded Kafka instead. I guess I thought that running an embedded instance from the same JVM as my app wouldn’t add significant overhead. And using the file source/sink in production felt like a hack, or something. I don’t know. I can see now that was a pretty dumb decision.

The good news is that I’ve almost finished converting the integration to use files instead of Kafka, and it’s nice and lightweight. And in ten years when Findka has enough traffic to warrant horizontal scaling, I already have all the code for switching to Kafka.

Here’s an example of what using the integration will be like. On Findka you’ll notice some stats in the sidebar (unless you’re on mobile):

Currently I calculate these with a cron job (and by cron job, I mean I download the data set and run various analytical queries on it a few times a week). With Materialize, the stats will be up-to-date all the time. It’ll require three pieces of code.

First, you have to define some SQL queries. Biff will handle writing all the Crux transactions into a file, which you can use as a Materialize source:

create materialized source docs_source
from file 'data/findka/mz-io/source'
with (tail = true)
format text;
create materialized view all_docs as
select
x->>'biff.mz/id' as id,
(x->'biff.mz/tx-id')::int as tx_id,
x
from (select text::jsonb as x from docs_source);
create materialized view valid_docs as
select id, max(tx_id) as tx_id
from all_docs
group by id;
create materialized view docs as
select * from all_docs
natural join valid_docs;

This creates a docs view which contains all the documents currently in your Crux DB. Materialize’s file sources only let you append records; you can’t update or delete. So we have to do that ourselves. This code will be pretty much the same for any Biff app, so I might have Biff run it for you.

(I think this was one of the main reasons I went with Kafka at first: the Kafka source supports upsert, so you don’t have to define these views. That also allows Kafka to do compaction, removing old documents that aren’t valid anymore, which I thought would be desirable for performance.)

From there, we can write some views for calculating the stats. Here’s the ones for calculating the number of different content types:

create materialized view if not exists items as
select
x->>'content-type' as content_type,
x->>'provider' as provider,
x->'provider-id' as provider_id,
x->'crux.db/id' as id,
x->>'title' as title,
x->>'author' as author,
x->>'url' as url,
x->>'image' as image,
x->>'description' as description
from docs
where
x->>'content-type' is not null;
create materialized view if not exists n_content_types as
select content_type, count()
from items
group by content_type
order by count() desc;

You (or Biff) will also have to define a sink. With the Kafka integration, it looked like this:

create sink if not exists n_content_types_sink
from n_content_types
into kafka broker 'localhost:9092' topic 'n_content_types'
format avro using confluent schema registry 'http://localhost:8081';

But I’ll need to update this to use a file sink instead.

Besides that, you have to subscribe to the sink from the frontend:

(defderivations
...
mz-subs [{:table :n_content_types}
; A more nuanced example:
#_{:table :n_content_types
:where [[:content_type "music"]]]}]
...)

And then you have to define some server-side rules to authorize the subscription:

{:biff.mz/rules
{:n_content_types
(fn [{:keys [session/uid doc] :as env}]
; Return true if the current user
; is allowed to see the given document.
true)}}

And that’s it (please hold your applause).

I’m planning to have this running in production soon (hopefully today). However it won’t be available in Biff for a couple weeks. I’ll have to do a lot of code cleanup and write documentation, and before that happens I need to get back to Findka feature dev for a bit. I prioritized the Materialize integration now because I’d like to use it for analytical queries in Findka, but it has started to suck up a little too much time.

Anyway, it looks like I’ll have to find a different excuse to upgrade my computer.

Published 15 Jul 2020