Ask Jet Schuett: the road to a self-taught Data Engineer

Sertis
13 min readJul 25, 2022

--

From a data analyst to a self-taught data engineer, Jet Schuett started as one but ended with another (so far). During one project as a data analyst, he found his hidden passion in data warehousing, data pipelining, and pipeline orchestration, along with the enthusiasm to maintain a system. That was the obsession of a data engineer, not an analyst. He then decided to transition to data engineering. The big question is how did he become a self-taught data engineer? That is what we will get to know in the very first Ask the Experts series “Ask Jet Schuett: the road to a self-taught Data Engineer”

For those with a dream of being an outstanding data engineer. Jet’s experience of going through all the learning and practice and the firsthand lesson learned will truly help you.

Quoted from Jet in this interview, the essence of the success on the road to a self-taught data engineer is to “Know it well”; know the languages, the tools, and the values. But about why and how, you must discover it yourself in this interview.

What is it like being a data engineer?

When people ask me what I do, the one-liner answer I give them is either “I write code to move data around” or “I make the data available when and where people need it”. The former is a descriptive answer — what I do as a matter-of-fact as one would observe me on a given day, whereas the latter is more of a normative one — the purpose of why my role exists in the first place.

Recently, I helped organize an event where I was tasked with coordinating the order of service and the presentation slides throughout the event. My prayer for the service was that it would run so smoothly that the guests wouldn’t even notice I was there. I think that is a good description of a successful data engineering project as well — that the data “just works” such that people forget there is a team of engineers working to support that.

What do you do currently?

I’m in a few different projects with different stages and deliverables. The end goal (what the client sees and cares about) of most of the projects I’m in is a number of dashboards that use data from the data warehouse to help the client make decisions. At Sertis, this is mostly performed by the data analysts, who bring in domain knowledge and business acumen to speak the client’s language and visualize the data in such a way that helps with their business questions the most. But of course, data in these warehouses have to be updated, transformed, aggregated, and maintained on a regular basis. This is where I come in — developing “stuff” so that these processes are as automated and resilient to errors as possible.

In terms of volume, I currently manage three production-stage data warehouses, each of which has tens to hundreds of tables, ranging in size from hundreds of GBs to tens of TBs, and has a few dozen pipelines attached that are run on a daily, weekly, and monthly basis. The sheer number of moving parts highlights the importance of making things automated and error-resilient, and of trying to be proactive in identifying issues so that they are mitigated before the client sees them.

How did you get here?

I started out as a data analyst at Sertis about four years ago. Soon after I became part of a demand forecasting project for a major FMCG manufacturer in Thailand, the development period of which lasted roughly two years. Towards the end of the development phase, when productionization of the models was imminent, one of my colleagues and I started looking into things like data warehousing, data pipelining, and pipeline orchestration — all the things necessary to maintain a system that generates outputs to the client on a regular basis. That is when I realized I enjoyed working as a data engineer. So after that project was wrapped up, I transitioned into focusing on data warehouses and data pipelines.

What do I need to become a data engineer?

I think there are three sides to a data engineer that one needs to master: The languages, the tools, and the value proposition. The languages are the programming skills one needs to do stuff with, e.g. Python and SQL; the tools are the domain-specific libraries that will get common tasks done faster and with fewer errors; the value proposition — the most important and most difficult in my opinion — is why your product matters.

The Language

On the language aspect, mastery of a programming language is a must to be proficient in data engineering. Different companies use different systems that work better in different languages. At Sertis, we mostly use Python; other companies might work more in Java or C++. But get started with one, and know it well — like, really well. By that I mean trying to know the internals of how that language works. Obscure bugs, weird errors, and inefficient implementations, among other things, should be much less common with this knowledge in mind.

The best way to learn a language, I think, is to skim through a couple of hours’ worth of tutorials/documentation and then start working on a few toy projects. Think of a task of interest at the moment, and think about how to accomplish that using the language at hand. In my case, I was interested in trading in the financial markets, so my motivation to learn Python was to create a system to analyze financial information and automate my trades. The resources I used to get started were YouTube tutorials by Sentdex . I have also come to find articles by Real Python and deep dives by James Murphy and Codey Shafer to be incredibly enlightening, fun, and useful for more advanced topics.

It is also a good practice to code the way other professionals code — i.e. following a certain “accepted” way of how a language should look like. For Python, we say it’s being “pythonic”. Not only does it make the code look more professional, but the accepted norms are there for a reason. Usually, it makes code maintenance easier, reduces chances of bugs, and makes for a smoother collaboration with others. For some languages, the community might already have codified these norms into something one can simply follow. For Python, the most explicit materials on this are the PEP-8 (the Python language style guide) and PEP-20 (the Zen of Python), but I also find the documentation of the standard library modules to be very enlightening as well.

In addition to a programming language (Python, Java, etc.), proficiency in SQL is absolutely an advantage, if not a requirement, to be a versatile data engineer. The work of a data engineer necessarily involves manipulation of large amounts of data, which is, for the most part, best performed by a database system. I also believe the kind of SQL that data engineers use is not the same as the kind of SQL software engineers and other developers use. For them, most data easily fits into memory; the application only works with a few rows of data — most likely a single row of data — at a time; and the database is simply a way to manage state. Not so with data engineers. Databases are the core of what we do. We manipulate millions, if not billions of rows of data at a time. Looping through each row is out of the question, and it also probably doesn’t fit into memory. Efficiency is of paramount importance. Writing an efficient query can mean shortening the execution time from a day to less than an hour. Writing SQL queries this way requires a different mindset from coding in other languages, because we don’t necessarily think in terms of functions and classes or loops and control flows, but in terms of rows and columns, i.e. vectorized operations and immutable data schemas.

A data scientist colleague of mine once had to manipulate certain data using a spark cluster. He wrote a query that got the job done but took a very long time to complete, so he asked me to help him optimize the query. The result was an order of magnitude improvement in execution time. He then asked me “why are data engineers so proficient in SQL?”, to which I responded, “why are data scientists so good in model training and optimization?”. It’s basically what we do. It’s what we need to do well.

That being said, it’s important to note that different databases have different SQL flavors, but having a strong grasp on the common SELECT statement and how the database engine processes a query will go a long way in dealing with most database systems. With that understanding, whenever I work on a new database engine that uses a different SQL flavor from what I’m used to, I simply just read through the language documentation to see the syntax differences I need to be mindful of and the built-in functions and extra features I can utilize.

To get started, I think w3schools has a good resource to learn SQL. Their interactive exercises are very helpful for you to author query and quickly test results without needing to set up a database of your own. Once you are familiar with the general syntax, I’d suggest practicing aggregating and filtering the toy data they have in various different ways — perhaps come up with a business question of your own. The main thing about proficiency in SQL is becoming familiar with how to work with data and a different programming paradigm.

The Tools

The second thing to learn are the tools — the domain-specific libraries that help us perform common tasks better. In my work, I would say I can name a few different categories of tools — and this is a non-exhaustive list — I utilize on a regular basis: the tools to move data, the tools to store data, the tools to manipulate data, the tools to govern data, the tools to orchestrate pipelines, and the tools to ensure that everything works together well — and let me know if they do not. These tools might be libraries. They might be cloud services. They might be an ecosystem of products that function together. The field is a huge and rapidly expanding one. Of course, the choice of platform — on premise or on cloud, and the programming language the company uses, goes a long way in dictating what tools are available and what needs to be learned, but again the basic goal is the same: know the tools, and know them well — well enough to understanding why it’s behaving the way it does, what it is good at and where it falls short, and how to squeeze every bit of capability out of those tools. For example, knowing Pandas well can mean the difference between a one-liner code to manipulate a range of dates and another that takes ten lines to do the same thing, or understanding how Airflow parses the DAG files can save a lot of time and irritation figuring out why creating a DAG with a dynamic structure does not work.

Since the field is extremely huge, where does one start? If you already know what you’ll need, then begin from there. But if not, here are some ideas to help you figure out a starting point:

  1. Data storage: learn about the different file formats, particularly columnar ones like ORC and Parquet files, and definitely learn about cloud storage services (GCS, S3, etc.)
  2. Data manipulation: outside of SQL, do learn Pandas — and master it. It’s really powerful in so many ways, especially if your data can fit in memory. More data can fit in memory than you might expect.
  3. Pipeline orchestration: my favorite tool is Airflow. It’s easy to understand (although there are a few sharp edges, like its infamous execution date logic), very extensible, and comes pretty much complete with all the features I need in most pipeline projects. Other tools you might come across are Prefect and Nifi.
  4. Data governance: this is quite format- and use case-specific. I use JSON Schema quite a lot because it’s quite versatile — a lot of data, especially the ones we need to validate, can usually be described as dicts and lists. In a relational database context, using constraints, triggers, and indexes can help ensure the quality of your data. I think it’s also good to know, even if only in theory, about data lineage, slowly changing dimension (SCD), and change data capture (CDC).
  5. Monitoring: building a system does not stop at making it work. The key is for it to work when you need it to and to know when it stops working. Monitoring the health of your system is very important for its reliability. Monitoring solutions can come in different ways depending on how you deploy your system. Docker-compose comes with a health check feature to do something automatically if your system stops responding to a handshake; python-based Supervisord can also monitor processes and restart them automatically when they die. If you need something more full-featured, then Prometheus might be your next step. Cloud platforms also come with built-in monitoring and alerting services for cloud infrastructure.

Value Propositions

Lastly, and certainly what I think is most important, is to make my products valuable to the user. Not that I would go about selling it to my colleagues, but if the very purpose of my role is to help them do their jobs easier, then I better be sure what I build is actually useful to them. The question here is not of how, but of what: what to build; what features are important; what values are fundamental to the system I’m building; what will make it reliable. Imagine building a house: an architect can know all about laying down bricks and mixing cement and calculating tensile strengths and what not, but if the house the architect builds only has a bedroom and a living room without a kitchen or a bathroom, then it wouldn’t be much use as a place of residence, would it? The same thing goes for a data warehouse and a data pipeline: if the code base follows all coding best practices and is incredibly efficient in its implementation, but the data users cannot easily interact with the data or its so erroneous because all data quality checks are ignored, then it wouldn’t be of much use as well. It might be surprising, but in my experience, I mostly only receive the one-sentence “what I need” from the users; all the other things that make a system useful to them I had to come up by talking to them, observing their work, understanding what is important to them, where their pains are, and getting feedback from them.

So how does one build a “value proposition” into a system? I’d say it begins with the attitude: that we are here to serve. We are here to help make the job of a data user (in my case mostly the data analysts) easier. With that attitude comes the action: How should I design the schemas, the tables, and the relationships in the database to suit the needs of my users the most? What data quality checks should be in place to give the user a peace of mind that what they see is as correct as we could possibly verify? What can I do to let the users know if there are problems with data ingestion on a particular day? What’s the easiest way for them to run queries against the data we have? There is no one-size-fits-all answer to these questions, so I’d put myself in the shoes of my users — their day-to-day activities, their technical competency and what they are comfortable with, and what they are trying to achieve — then cater a solution to their needs. The key here is to go the extra mile. I have learned that it’s the little things that differentiates an easy-to-use, reliable system from one that isn’t.

Final Thoughts

So how to become a data engineer? Know the language, know the tools, and know them really well. Start small, build a toy project, then try expanding on that by incorporating more and more of what you are learning. But the most important thing, I think, is the attitude. We are here to serve. I ask myself: what do my users do, and how can I make my system serve them better? Faithfulness and integrity are key here: am I being faithful to the ones I’m serving if I know there is something I can do for them but don’t just because I’m not “required” to do?. And since a lot of people depend on the data I provide being correct and complete, I cannot be a good data engineer if I see a problem in the system or the data and do not work to remediate that, even if it is difficult and tedious. At the end of the day, I look forward to the heavenly accolade that says “Well done, good and faithful servant. You have been faithful over little; I will set you over much. Enter into the joy of your Master.” Having this mindset, you’ll also be building systems that are useful to your users, and along the way you’ll learn new things to differentiate yourself and your work and will be able to look back and rejoice in the works your hands have made. Soli Deo gloria!

Written by: Jet Schuett

Explore yourself and learn by doing, in the leading technology company that is surrounded by talented people from around the globe. Check out our open positions at this link below https://www.careers.sertiscorp.com/jobs

--

--

Sertis
Sertis

Written by Sertis

Leading big data and AI-powered solution company https://www.sertiscorp.com/

No responses yet