Pipeline Data Engineering Academy home blog pages letters

The Pipeline Academy Awards 2021

The end of the year has arrived, our first full year in operation. 2021 was about the Cambrian explosion of data engineering tooling, yet you don't have to be a data scientist to be certain that 90% of the data tools will be gone in about two years or so, and for a good reason. Like it or not, most of them are solutions for fictional problems.

Whether you are a seasoned data architect or a newcomer in the world of data infrastructure, you already know that a major part of your work is about curating the right tools for your business, your team and your data. One of the key assets that our participants leave the course with is a framework for making informed decisions about data infrastructure tooling. It considers timeless software engineering practices, state-of-the-art-technologies, business goals and even sustainability factors, but under the hood it is very much powered by common sense.

This framework (and the desire to avoid major migration projects) is what prevents data engineers from being misled by the shockwave of marketing messages that aim to convince them that a new and shiny tool is a must have in... wait for it...

THE MODERN DATA STACK.

We deal with the process of making these tooling decisions on a daily basis. We cover a lot of them in the course, teach them, love them, update them and use them. And we have favourites that stand out.

The 2021 Pipeline Academy Awards (The Pipies) are brought to you by Pipeline Academy. It's our way of praising the teams that built meaningful software solutions that serve a real purpose, the ones that are most likely here to stay to make the lives of data engineers better.

Separating the wheat from the chaff

Generally what we value:

  • the tool is open source, you can run it on your own if you have the resources and the expertise,

  • the tool has a paid hosted option for the occasion that you don't have the means to host it on your own,

  • the hosted option has a generous free tier to start with,

  • good tutorials and documentation is provided,

  • learning curve is not steep.

Caveat: the Apple M1 hell is real, the support for it is not really there yet.

The 2021 Pipies go to…

Data Acquisition: Airbyte

Self-definition: Open-Source Data Integration Pipelines | ELT. Our choice for EL, plays nice with dbt, bit heavy on the resource side, Docker as a Lambda :)

Telemetry: Quix

Self-definition: Real-time stream processing PaaS. Whatever takes the pain out of Kafka is a friend of ours. A European player with McLaren expertise, supported by Project A.

ETL/ELT: Prefect

Self-definition: The New Standard in Dataflow Automation. "Airflow is ****, yes.", so Why Not Airflow?. A product that evolves in a great way, already at its second-generation workflow engine, Orion.

Data Modeling: dbt

Self-definition: Transform data in your warehouse. The whale that rules the T in ELT that coined the term Analytics Engineering. dbt encodes best practices, a product built on experience solving actual problems; the jury is still out there on how it scales for BIG data though. Still should be good enough for 90% of the BI teams.

Data Warehousing: ClickHouse

Self-definition: fast open-source OLAP DBMS. From Russia with love, written in C, just a magnitude faster than anything else especially if it starts with Apache.

Data Quality: Great Expectations

Self-definition: a shared, open standard for data quality. The Janus-faced power tool, marrying the best of both machine-readibility and human oversight. Outstanding learning curve and workflow, can get heavy on deployment because of its massive dependencies.

Clouds: Pulumi

Self-definition: Modern Infrastructure as Code. Infrastructure as code, not configfiles, no YAML horror - even in Python. Period.

Deployment: Fly

Self-definition: Deploy app servers close to your users. Dark magic to deploy your Docker images in a whim.

Serving Data: Streamlit

Self-definition: The fastest way to build and share data apps. Shiny for Python, straps a beautiful interactive interface on your Python logic.

Honorable mentions:

Computation: Saturn Cloud

Self-definition: Data Science & Machine Learning with Dask & GPUs. I still like the old tagline: It's like Spark. Except you won't hate yourself.

CD/CI: GitHub Actions

Self-definition: Automate your workflow from idea to production. The only thing that I feel “how-could-we-live-without-it” about, also git scraping.

Serving Data: Datasette

Self-definition: An open source multi-tool for exploring and publishing data. Mentioning Simon Willison, let's talk about the power of SQLite, an interface, all possibly Dockerized with a few lines of code.


The Pipies 2021

Big shout-out and thank you to all of the teams who are responsible for these products, it's a pleasure for us to have so many amazing possibilities to choose from.

If you haven't yet, we recommend every data professional to check out these tools and compare them with the solutions you're already familiar with. Let us know what you think about the winners, and feel free to send us your preferred data tools and updates from this year!

See you in 2022, when The Pipies return.

Disclosure: Pipeline Academy is not financially affiliated with any of the above organisations.