Pipeline Data Engineering Academy home blog pages letters

The Data Janitor Letters - November 2020

Data engineering salon. News and interesting reads about the world of data.

Marketers are Addicted to Bad Data
Jacques Corby-Tuech, Marketing Operations Manager, CyberSmart

36% percent of people in the UK use an adblocker.



AWS S3 — Disaster recovery using versioning and objects metadata
Jacek Małyszko, Data Engineer, Fandom

Accidental removal of data on S3 is something that no Data Engineer on AWS wants to be involved in.


ClickHouse, Redshift and 2.5 Billion Rows of Time Series Data
Brandon Harris, Cloud + Analytics, Discover Financial

In this post I show you how to synthesize billions of rows of true time series data with an autoregressive component, and then explore it with ClickHouse, a big data scale OLAP RDBMS, all on AWS.


Python and Parquet performance optimization using Pandas, PySpark, PyArrow, Dask, fastparquet and AWS S3
Russell Jurney, Principal Consultant, Data Syndrome

This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of columnar storage, columnar compression and data partitioning.


ClickHouse Capacity Estimation Framework
Oxana Kharitonova, SRE, Cloudflare

Our current insertion rate is about 90M rows per second.


How to Build a Production Grade Workflow with SQL Modelling
Michelle Ark, Senior Data Engineer, Shopify

Currently, we have a warehouse consisting of over 100 models, and this validation step takes about two minutes.


The State of Open-Source Data Integration and ETL
John Lafleur, Co-Founder, Airbyte

Is an open-source (OSS) approach is more relevant than a commercial software approach in addressing the data integration problem.