Hiring for Data Engineering in 2024
If a “Data Engineer” or “Machine Learning Engineer” was the same as a “Software Engineer”, then interviewing them the same way would make sense.
But they are not.
They have different responsibilities. They do different things. They handle different bugs. They think about different problems. Hence the different title.
So what should a “Data Engineer” or “Machine Learning Engineer” do?
Make things as easy as possible for those who consume the data
Ensure the quality of the data that others consume
Work with software engineers to ensure good data practices
And what additional responsibilities should a “Machine Learning Engineer” have?
Support people in the “Data Scientist” role.
Stand up and maintain the infrastructure for tracking experiments.
Handle
So how would you evaluate people for these kind of tasks?
I talk to a lot of different practitioners in this role. One of the common complaints is getting hit with the typical "leetcode" or "gotcha" questions that are given to software engineers. In many cases, these questions are about signaling to the interviewer that you have practiced coding interviews in the same way that they did. It is a bit of a handshake that you came from the same background.
This rarely gives useful information about competence in the role.
In this case, the result of this kind of question is getting a "data person" who is actually just like them. Unfortunately, that will probably not get them closer to solving the problems that made them put up a req for a data engineer.
Given the stakes involved with an interview - that you may be onboarding someone who can bring great upside or cause great damage - it is worth being very deliberate about the kinds of questions that are asked of candidates and having.
One way to evaluate understanding is via Bloom's Taxonomy.
Wikipedia and Google have made trivia contests seem a bit pointless. What is the point of memorizing facts that can be trivially looked up in seconds?
Regrettably, many companies barely climb 1 or 2 levels higher in the taxonomy in how they evaluate candidates. They ask questions covering a small number of patterns that interviewees mostly memorize. After the interviews are over, the candidates promptly forget the details of fiddling with binary trees or solving mazes until the next time they have to go through this dance.
A more informed approach, especially for an actually "senior" role, would be to test at the "synthesis" and "evaluation" levels.
After all, this is one reason why people are so excited about modern AI capabilities from techniques such as RAG. We have long been able to retrieve structured information (trivia and knowledge), but now we have programs that can evaluate and synthesize information.
I have interviewed over 120 different candidates during my time at different roles. Perhaps it was my time as a piano teacher or calculus lecturer that made think carefully about evaluating the knowledge and capabilities of the candidates.
Here are some questions that I think could be useful for interviewing data engineers:
"Design a data pipeline that does X"
"How do you feel about this stack? What you would change?"
"How would you choose between using X and Y tools?"
"What are some practices that you have strong opinions about?"
"What are some ways where you have helped a team with data?"
"What is something in the data space that you are excited about?"
"What are some ways where you have handled data that changes over time?"