Reading list

Papers, projects, and ideas I've found genuinely interesting. A map of how I think about the problem space — distributed systems, query languages, logic programming, and the future of data infrastructure.

Distributed systems · logic programming · query languages

This cluster started with Eve and kept pulling on threads — Datalog, incremental computation, reactive systems. It's the area I find most intellectually alive in systems research right now.

Eve

Maybe the single most influential discovery of my career. Eve was a strange, wonderful, and sadly abandoned programming language / runtime / IDE / cloud platform that proposed new ways of thinking about building software unlike anything I'd seen. Code blocks embedded in documents (like a Jupyter notebook but more fundamental); blocks are self-contained handlers that query program state, manipulate it, and write it back; each block executes independently every step, ordered only by dataflow dependencies. Discovering Eve and tracing what it was founded on and what it led to next sent me down a rabbit hole I haven't fully climbed out of.

Dedalus: Datalog in Time and Space · slides

A framework for reasoning about time and "space" (i.e. different machines) in distributed systems, using pure Datalog. The key idea: endow every proposition p(A, B) with a time variable t, so p(A, B, t) means "p is true of A and B at timestamp t." Intuitive in retrospect, powerful in practice. From 2009, predating most of what people now call "streaming". This is what Eve is founded on, and it shows up in a lot of subsequent distributed systems research.

Naiad: A Timely Dataflow System

A framework for data-parallel incremental reactive computation — when source data changes, outputs update automatically, requiring work proportional only to the size of the change (not a full recompute). McSherry et al.'s ideas here became the basis for Differential Dataflow, ultimately commercialized in Materialize and Feldera.

Categorical Query Language (CQL)

A relational query language grounded in category theory. I read it as "SQL with a better type system and compile-time integrity checking, plus some category-theory machinery underneath." It has a built-in theorem prover, and the foreign key handling — where you dereference a FK like an object field and the join is implicit — is elegant. On pure semantics it's hard to beat. Room to improve on syntax and accessibility.

AP5

A Common Lisp extension from 1990 (older than me!) providing in-memory relations, a query language, consistency rules, and triggers. A surprisingly complete model for a relational programming language. Another example of what a "better SQL" or "SQL/programming language hybrid" could look like.

Cell language

Another relational programming language — clean and concise syntax for manipulating binary relations, highly normalized schemas (around 6NF). Compiles to C++/C#/Java rather than machine code or bytecode: the intended use is generating classes and functions you incorporate into a larger project. An interesting design tradeoff.

New Directions in Cloud Programming

Makes the case for programming languages and models where the execution environment is a collection of cloud resources — not just an interpreter or binary on a single machine. Synthesizes ideas from Dedalus, Timely Dataflow, and others. I'm fully on board with the premise: the right abstractions for distributed systems haven't been invented yet, and this paper is a good map of the territory.

Software performance

Parsing Logs 230x Faster With Rust

Hugely influential on how I think about performance. Beyond just motivating me to learn Rust, it changed how I think about memory, allocation, and what's achievable when you take performance seriously. It also shaped my skepticism toward many "big data" systems: a lot of them are slow not because the data is big, but because the tooling is bloated. If your data isn't as large as you think, Spark is probably not the right answer.

Serverless compute for data

A thread I explored for a while: could you build a genuinely good distributed data processing system on top of FaaS primitives — Lambda or equivalent — to get massive parallelism without cluster overhead?

A lot of this is from ~10 years ago(!), but kept for historical interest:

PyWren

The clearest early implementation of the idea: scale out embarrassingly-parallel Python workloads via Lambda. The motivating question from the paper — "Why is there no cloud button?" — still feels unanswered. A student should be able to push a button and have their existing optimized single-machine code run in the cloud. We still don't really have that.

Serverless Data Analytics with Flint

A Spark execution engine on top of Lambda, plugging in via the SchedulerBackend interface — so you get the same Spark API. Uses SQS for shuffles and intermediate results, which is the hard part of serverless map-reduce architectures.

Corral: A Serverless MapReduce Framework

A serverless MapReduce framework written in Go. Does well on filtering and aggregation; falls flat on joins, for the same fundamental reason as all serverless shuffle problems — no efficient intermediate store. The difficulty with joins has the same root as the difficulty with shuffles, which other attempts (Flint, etc.) remediate through SQS or Elasticache.

Serverless Computing: One Step Forward, Two Steps Back

A principled critique of the serverless-for-data-processing approach. The main problems: (1) no data locality — you ship data to code, not code to data; (2) lambdas aren't network-addressable and can't communicate directly, forcing all coordination through a secondary service; (3) no control over hardware, so no GPU or SIMD acceleration. Worth reading alongside the above.

One objection to these objections -- which the authors do note in the paper -- is that designing around these constraints and limitations could inspire more innovative techniques; e.g. the lack (or exorbitant cost) of communication between Lambdas or ordering of their execution incentivizes [https://disorderlylabs.github.io/](disorderly programming), high-level declarative languages, etc.