This is the first of what I hope is a series of articles to define the life of a data engineer; what it is, the daily workflow and the struggles, some key concepts, etc. Any feedback on my writing/insights would be appreciated and will be included in future posts; if all goes to plan, the last article should be infinitely better than the first. Thanks!
- Stéphane Burwash
Monday morning, 8h45 am.
You’re at the coffee machine, enjoying a nice cup with your coworkers and sharing the same platitudes as you always do - “How was the weekend ?”, “What did you do? …. Oh that’s amazing! Setting up furniture? - So exciting …”, You know the drill, just going through the motions before starting another week of work.
Through the door comes the data engineer. You can recognize him from the helmet that would look great on anyone else but somehow seems to swallow his entire head, as well as the e-scooter that he carries under his arm. While popular with a fair few businessmen and being a very “in” mode of transport, in his hands, it just seems to cement the word “nerd” already firmly tattooed on his aura.
After walking dropping off his laptop, he walks into the kitchen to grab a cup of coffee - large, black americano, as always. He says hi to everyone, happy to join a conversation, but not great at starting one - he has that sort of open-shyness that afflict many developers.
Monday, 8h58 am.
You’re in the middle of describing your picnic at Parc Lafontaine (it’s in Montreal, you should go - it’s nice) when suddenly, the data engineer looks at his phone, before rushing out of the kitchen, leaving a “shit, I’m late” in his wake. As you watch him depart, sit at his desk, put his overly-large headphones on and join his first Google Meet of the day, you stop and ask yourself “Wait - what does this guy do again? Like, I know he looks busy a lot, and that his job probably has something to do with AI/data/<insert_buzzword_here>, but what does he actually DO? What are his tasks, his routines, his issues, and his successes? And why does he always have to swear so much ?”
Data-Engineering - What is It?
Before answering the question “What does a data engineer do ?”, we first need to establish “What is a data engineer?”.
In classic modern blog post fashion, we asked chatGPT to answer this question for us:
A data engineer is a professional who designs, develops, and manages the infrastructure and systems required to handle large volumes of data. They play a crucial role in the data lifecycle, ensuring that data is collected, stored, processed, and made accessible for analysis and decision-making purposes.
…
In summary, data engineers bridge the gap between data sources and data consumers by building and maintaining the infrastructure necessary for efficient data processing and analysis. Their work is fundamental to enable organizations to leverage data effectively and derive meaningful insights from it.
While this seems like a very straightforward answer, it doesn’t give us much in terms of concrete responsibilities. This is by design, as the term “data engineer” is incredibly fluid and can encompass many responsibilities depending on the business’s data needs.
Therefore, it is easier to qualify a data engineer not by what they do, but by what their mandate is - bridging the gap between data sources and data consumers. Let’s look at a concrete example of how this can be done.
Data in a company - an end-to-end pipeline
Everyone has a different analogy for explaining the data team pipeline/workflow. I was going to use the obvious choice - crude oil to refined gasoline, but then I remembered that I love to cook, so we’re going to use a kitchen instead for our setting.
Our story will take place in a small falafel shop in the Mile End neighbourhood of Montreal, Falafel FooBar. When you order at FFB and sit down to eat, all you can see is the refined, finished meal; a beautiful falafel sandwich. That is because you are the client or end-user; you do not need to have an understanding of where the chickpeas were sourced or how the fries were cut, all you want is a delicious sandwich. But there are a few steps between raw flour and pita bread.
But let’s say you’re a very curious client - you WANT to know how your sandwich is made, and make sure no chickpea cows were harmed in the process of making your sandwich. First stop - the counter.
Behind the counter, you can see that someone is assembling the sandwich for you; no cutting of potatoes or making of batter here. Only a few bowls with freshly cooked falafels, hot pitas, and all the toppings your heart may desire. Why?
Time - It would be inefficient to have the person at the counter make fresh batter for every client.
Training / Responsibilities - It is not necessary for the employee to KNOW how the pita is made to serve it; all they need is to be able to assemble a delicious sandwich, have summary information about the product on hand, and know who to ask if a nosy client (yourself) has questions that are a bit too picky / targeted.
Standardization & quality control- We want to make sure every client gets the same, top-quality end products (falafels, pita, hummus, etc.) to put in their sandwich. If every person at the counter had to make their own product, there could be wild discrepancies from one sandwich to the next; with a standardized product, we can more easily ensure a great experience for every client. This also allows us to perform quality control on all outgoing products.
But you’re a particularly nosy client, and you want to see how the falafel is actually made.
Depending on the size of the restaurant, you could have only 1 person in the back doing all of the work, or you could have a small/medium/large team where everyone has different specialties & responsibilities.
In this particular falafel shop, we have 2 employees in the back. One’s role is more “resource-based”, and the other’s, “assembly” based.
A resource expert in the kitchen is known as a pantry chef. Their role is to make sure the restaurant has all the ingredients it needs to perform properly. They:
Source management - Order all necessary ingredients.
Scheduling & Capacity - Manage deliveries of ingredients so that they come at a regular pace and the restaurant always has everything it needs. This includes being ready to accept a variable number of orders at any time.
Ensure source freshness - Ensure that only the freshest ingredients go into our sandwiches, and have an understanding of the shelf life of your different ingredients - flour goes bad a lot more slowly than tomatoes.
Ensuring source ingredient quality - Make sure that there are no rocks in our chickpeas or worms in our onions.
Source integration & exploration - Source new products depending on the needs of the client - oh, they like hot chilis now? Let’s find them some hot chilis then.
Staging ingredient - Prepare the ingredients to ease the preparation of the core blocks (falafel, etc.). This can take the form of measuring out the flour, soaking the chickpeas, chopping the onions, etc.
On the flip side, our assembly manager’s role is to turn the ingredients provided by the resource manager into the end products to be used by the people at the counter. Their responsibilities include.
Creation of end products - Assembling the raw ingredients in order to create easy-to-use products (chickpea + herbs + garam flour + baking powder + spices = falafel)
Establish quality checks for business use case - Quality checks for a falafel are different from the quality checks for an onion; we need to make sure it’s the right size, shape and crispiness. These are more subjective checks, and it takes real business knowledge & understanding of the client’s needs to set them.
Liaison between front & back of the kitchen - The pantry chef cannot know that we are missing tomatoes in the front without being told, and the counter employee does not necessarily fully understand that it is much harder & expensive to source heirloom tomatoes than regular beefsteak ones, even though they would taste slightly better. The assembly manager has both an understanding of the front and the back of the restaurant, and can therefore ensure good communication of needs in the entire kitchen.
With these 2 new additions, our kitchen schema would be complete:
And here we have it, the entire restaurant pipeline, from chickpea to delicious sandwich in your mouth.
So, thanks for the analogy, but how does this apply to data engineering?
At this point, it’s basically term substitution.
The counter employees represent your business intelligence team - they’re the ones that take your finished data objects (falafel, pita, deal, project, …) and turn them into insights (sandwich, platter, gross margin, team productivity, …).
The pantry chef is your classic data engineer; their mandate is to provide a variety of high-quality, fresh and available staged data (weighed flour, soaked chickpeas, chopped onions). In this case, staged data refers to “vetted and slightly prepared data”. It still resembles the source data but is now turned into a basic building block that can be used by the rest of the kitchen.
Depending on the workload, they can also perform a lot of assembly work to help out the assembly manager, who represents the analytics engineer in your modern data team. An analytics engineer is a new term coined by dbt labs to encompass an employee whose role it is to transform the staged data into data products which can be used by the BI team. This requires an innate understanding of both the source data and the business objects that need to be turned into data products. The analytics engineer is a fairly new term and is a role that has historically been served either by data engineering or business intelligence.
With this said, here is our final data team pipeline:
Of course, there are many other roles within a data team; data scientist, data product manager, data director, …, but to understand data engineering, this should be sufficient.
In summary
If you have to retain anything from this article, here would be the main elements:
Data engineering is a broad practice which can take many different forms depending on your place of work, the requirements of the data team, the volume of data,...
It usually revolves around the extraction, storing & management of source data.
Analogies are AWESOME and you should use them more in your everyday life.
We also looked into the responsibilities of typical data-engineer, which usually revolve around:
Source management - Order all necessary ingredients.
Source management
Scheduling & Capacity
Ensure source freshness
Ensuring source quality
Source integration & exploration
Staging data for downstream transformation
I am fully conscious that this is a gross simplification of the job role and definitely does not go into details about the details (day-to-day, responsibilities, key concepts, etc.). I’ll try to get through those 1-by-1 in the following articles.
Thanks for reading!