Data engineering projects often look alike in their structure, as shown in Figure 2: the client has source data that needs to be extracted, transformed, and loaded (ETL) into an analytic data store. This analytic data store is used to inform data models whose results are then presented visually to users.
Figure 2: Typical Data Engineering Pipeline
However, while the structure of data engineering projects rarely changes, the details often do.
Client environments can range from entirely cloud-based solutions such as AWS or Azure to bespoke on-premises data centers – or some unique mix of these environments. Source data stores can include file storage systems with documents in dozens of formats, SQL databases from multiple vendors, and specialized NoSQL solutions. The client might have existing software licenses that they would like to use or existing models and ETL processes that must be integrated.
The data engineer must be ready to handle any ecosystem of these components by being pragmatic, principled, and practiced: