When a data project gets too hot for a script to handle, data engineers, scientists and developers will start to look for a better way to handle multiple different types of data processes.
This is exactly what happened to me when I started, I found a great project called Luigi (Open Source from Spotify). Luigi is a data orchestration framework which gives you the ability to both reuse and define dependencies between data processes. Using frameworks like Luigi allow developers to create performant and complex data pipelines (combinations of multiple processes / data sources). This concept isn’t new, but is key to building a successful data platform.
Data pipelines, in combination with other features (data warehouse modelling, monitoring, etc), can be used to create a modern data platform. However, a data platform is much more than this, from a technical, business and cultural perspective.
Some of the common mistakes (technical) include:
This isn’t always an advantage. When breaking monolithic systems apart, bear in mind parts may still need to communicate. This can add other forms of complexity, which in a number of situations may outweigh the benefit of separation. However…
Data processes are very loosely, if at all, coupled. This makes them ideal candidates for separation. Why should the environment that runs a SQL query be the same for training a machine learning model?
This has led the team at Datasparq to create individual services for the cloud data products in our architecture. Combined with “single source of the truth” best practices, it’s enabled us to build new projects quickly.
Below shows how different types of data stakeholders interact with one of our services. The selected service “warehouse operations”, handles all processes carried out by the data warehouse (queries, table creation, truncation, etc):
We’ve made a trade here. A slight increase in initial development of processes, some of which are pre-built in classic orchestration solutions.
However it’s been worth it, we can now: