I wrote an SQL engine in Python

I've been working on Quokka for a year now. Before I started, I was on leave from Stanford to work on a starter write assembly accelerating machine learning primitives like and . After a while I realized that while I might be able to make a living doing the above, what most clients want to speed up isn't ML model inference/training, but data pipelines. After all, most ML in the industry today seems to be lightweight models applied to heavily designed functionality, not GPT3 on Common Crawl.

Almost all feature engineering today is done with SQL/DataFrames with a popular library of user-defined functions (UDFs) that encode business logic. Think fraud detection, search recommendations, personalization pipelines. In model training, materializing these features is often the bottleneck, especially if the actual model being used is not a neural network. People are now using a managed feature platform like Tecton, Feathr.ai or launching their own pipelines with SparkSQL. With robust UDF support, SparkSQL seems to be the de facto standard for these feature engineering workloads and is used under the hood by virtually every managed feature platform. (Unless you're on GCP, in which case BigQuery is also a strong contender.)

Of course, these issues only occur when you have Big Data (>100GB), cannot use Pandas, and need to use a distributed framework like SparkSQL. Having lost all the money I made from my shitcoin and stock market startup, I returned to my PhD program to build a better distributed query engine, Quokka, to speed up these workloads. feature engineering. Initially, I had several goals:

Easy to install and run, especially for distributed deployments. Good support for Python UDFs which may involve numpy, sklearn or even Pytorch. At least 2 times the performance of SparkSQL. Otherwise what are we doing here. Fault tolerance. This may not be important for small-scale frameworks, but it's essential for TB-scale tasks that run for hours and are hard to restart. SparkSQL can recover from job failures due to Spot Instance preemptions, Quokka should be too.

The first two goals strongly scream that Python is the language of choice for Quokka. PySpark supports Python UDFs reasonably well, but there are a myriad of downsides that stem from its Java backend - you need to sudo install all required Python packages on EMR, Python UDF error messages don't pop up. will not display correctly, no precise control of intra-UDF parallelism, etc. While each issue seems minor on its own, these footguns are extremely annoying to native Python data scientists like me whose last experience with Java was AP Computer Science.

I've had some reservations about this - I know major tech players who maintain UDF libraries in Scala/Java, and senior engineers who claim that Java isn't that bad and that all serious engineers should know that anyway. My argument:

I got a computer science degree from MIT without writing a single line of Java. I know many who have done the same. I want to empower data scientists without formal computer science training, and their first language of choice is unlikely to be Java due to the number of tutorial videos available on YouTube. Have you ever wondered why Tensorflow4j exists? Do you even want to learn how to use it instead of just writing PyTorch?

But how do you build a distributed engine on Python? After all, Python isn't known for its distributed prowess...until Ray came along. I won't waste any space here describing how amazing it is - but it's basically Akka in Python that actually works. It lets you easily instantiate a custom Python class object on a remote worker machine and call its functions, which is pretty much all you need to build a distributed query engine. Ray also lets you easily create remote clusters with a few lines of code and manage arbitrary Python dependencies programmatically, which easily satisfied my first two goals.

Well, what about performance? Python is so famous for being slow that there is a reason for that. However, Python's slowness works in its favor - since it's so slow, people have built amazing open source libraries in C or Rust that speed up common operations, like numpy, Pandas or Polars! If you use these libraries as much as possible, your code can actually perform extremely well: e.g. if you implement a data analysis workflow using only columnar Pandas APIs, it will beat a hand-coded Java or even C program almost every day.

Specifically, for a distributed query engine...

I wrote an SQL engine in Python

I've been working on Quokka for a year now. Before I started, I was on leave from Stanford to work on a starter write assembly accelerating machine learning primitives like and . After a while I realized that while I might be able to make a living doing the above, what most clients want to speed up isn't ML model inference/training, but data pipelines. After all, most ML in the industry today seems to be lightweight models applied to heavily designed functionality, not GPT3 on Common Crawl.

Almost all feature engineering today is done with SQL/DataFrames with a popular library of user-defined functions (UDFs) that encode business logic. Think fraud detection, search recommendations, personalization pipelines. In model training, materializing these features is often the bottleneck, especially if the actual model being used is not a neural network. People are now using a managed feature platform like Tecton, Feathr.ai or launching their own pipelines with SparkSQL. With robust UDF support, SparkSQL seems to be the de facto standard for these feature engineering workloads and is used under the hood by virtually every managed feature platform. (Unless you're on GCP, in which case BigQuery is also a strong contender.)

Of course, these issues only occur when you have Big Data (>100GB), cannot use Pandas, and need to use a distributed framework like SparkSQL. Having lost all the money I made from my shitcoin and stock market startup, I returned to my PhD program to build a better distributed query engine, Quokka, to speed up these workloads. feature engineering. Initially, I had several goals:

Easy to install and run, especially for distributed deployments. Good support for Python UDFs which may involve numpy, sklearn or even Pytorch. At least 2 times the performance of SparkSQL. Otherwise what are we doing here. Fault tolerance. This may not be important for small-scale frameworks, but it's essential for TB-scale tasks that run for hours and are hard to restart. SparkSQL can recover from job failures due to Spot Instance preemptions, Quokka should be too.

The first two goals strongly scream that Python is the language of choice for Quokka. PySpark supports Python UDFs reasonably well, but there are a myriad of downsides that stem from its Java backend - you need to sudo install all required Python packages on EMR, Python UDF error messages don't pop up. will not display correctly, no precise control of intra-UDF parallelism, etc. While each issue seems minor on its own, these footguns are extremely annoying to native Python data scientists like me whose last experience with Java was AP Computer Science.

I've had some reservations about this - I know major tech players who maintain UDF libraries in Scala/Java, and senior engineers who claim that Java isn't that bad and that all serious engineers should know that anyway. My argument:

I got a computer science degree from MIT without writing a single line of Java. I know many who have done the same. I want to empower data scientists without formal computer science training, and their first language of choice is unlikely to be Java due to the number of tutorial videos available on YouTube. Have you ever wondered why Tensorflow4j exists? Do you even want to learn how to use it instead of just writing PyTorch?

But how do you build a distributed engine on Python? After all, Python isn't known for its distributed prowess...until Ray came along. I won't waste any space here describing how amazing it is - but it's basically Akka in Python that actually works. It lets you easily instantiate a custom Python class object on a remote worker machine and call its functions, which is pretty much all you need to build a distributed query engine. Ray also lets you easily create remote clusters with a few lines of code and manage arbitrary Python dependencies programmatically, which easily satisfied my first two goals.

Well, what about performance? Python is so famous for being slow that there is a reason for that. However, Python's slowness works in its favor - since it's so slow, people have built amazing open source libraries in C or Rust that speed up common operations, like numpy, Pandas or Polars! If you use these libraries as much as possible, your code can actually perform extremely well: e.g. if you implement a data analysis workflow using only columnar Pandas APIs, it will beat a hand-coded Java or even C program almost every day.

Specifically, for a distributed query engine...

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow