A thread I explored for a while: could you build a genuinely good distributed data processing system on top of FaaS primitives — Lambda or equivalent — to get massive parallelism without cluster overhead?
PyWren
project
The clearest early implementation of the idea: scale out embarrassingly-parallel Python workloads via Lambda. The motivating question from the paper — "Why is there no cloud button?" — still feels unanswered. A student should be able to push a button and have their existing optimized single-machine code run in the cloud. We still don't really have that.
Serverless Data Analytics with Flint
paper
A Spark execution engine on top of Lambda, plugging in via the SchedulerBackend interface — so you get the same Spark API. Uses SQS for shuffles and intermediate results, which is the hard part of serverless map-reduce architectures.
Corral: A Serverless MapReduce Framework
blog post · Go · open source
A serverless MapReduce framework written in Go. Does well on filtering and aggregation; falls flat on joins, for the same fundamental reason as all serverless shuffle problems — no efficient intermediate store. The difficulty with joins has the same root as the difficulty with shuffles, which other attempts (Flint, etc.) remediate through SQS or Elasticache.
Serverless Computing: One Step Forward, Two Steps Back
CIDR 2019 · Hellerstein et al. · paper
A principled critique of the serverless-for-data-processing approach. The main problems: (1) no data locality — you ship data to code, not code to data; (2) lambdas aren't network-addressable and can't communicate directly, forcing all coordination through a secondary service; (3) no control over hardware, so no GPU or SIMD acceleration. Worth reading alongside the above.
One objection to these objections -- which the authors do note in the paper -- is that designing around these constraints and limitations could inspire more innovative techniques; e.g. the lack (or exorbitant cost) of communication between Lambdas or ordering of their execution incentivizes [https://disorderlylabs.github.io/](disorderly programming), high-level declarative languages, etc.