Everyone starts somewhere
When a data project gets too hot for a script to handle, data engineers, scientists and developers will start to look for a better way to handle multiple different types of data processes.
This is exactly what happened to me when I started, I found a great project called Luigi (Open Source from Spotify). Luigi is a data orchestration framework which gives you the ability to both reuse and define dependencies between data processes. Using frameworks like Luigi allow developers to create performant and complex data pipelines (combinations of multiple processes / data sources). This concept isn’t new, but is key to building a successful data platform.
Data Platforms are more than a collection of processes
Data pipelines, in combination with other features (data warehouse modelling, monitoring, etc), can be used to create a modern data platform. However, a data platform is much more than this, from a technical, business and cultural perspective.
Some of the common mistakes (technical) include:
- Overhead of maintenance. In any system there is a degree of maintenance (yes, even serverless products), you can accept it or the consequences. Data processes are modified and contributed frequently. Regular security, library, monitoring and functionality updates are essential. Cloud providers reduce overheads but do not remove them entirely via managed applications (Google Cloud Composer) or serverless solutions (Azure Data Factory).
- Extendibility. How will we manage requirements from teams with different needs than ours? Change is inevitable. Business users will want to build valuable complex processes and share results with others.
- Slow to implement. Platform projects have the potential to run for long periods of time, by which point existing business processes may have moved on.
- Lack of documentation. A data platform needs use cases and business users to drive data driven decisions and create products in order to generate value. Without these it is powerless. If people can’t find and understand the data within, this seriously limits its potential to generate value.
Services break large problems into smaller, reusable parts
This isn’t always an advantage. When breaking monolithic systems apart, bear in mind parts may still need to communicate. This can add other forms of complexity, which in a number of situations may outweigh the benefit of separation. However…
Data processes are very loosely, if at all, coupled. This makes them ideal candidates for separation. Why should the environment that runs a SQL query be the same for training a machine learning model?
This has led the team at Datasparq to create individual services for the cloud data products in our architecture. Combined with “single source of the truth” best practices, it’s enabled us to build new projects quickly.
Below shows how different types of data stakeholders interact with one of our services. The selected service “warehouse operations”, handles all processes carried out by the data warehouse (queries, table creation, truncation, etc):
No data solution is perfect
We’ve made a trade here. A slight increase in initial development of processes, some of which are pre-built in classic orchestration solutions.
However it’s been worth it, we can now:
- Replicate key services between multiple projects instantly, reducing development time and cost
- Run services on cloud serverless products, reducing maintenance and running cost across projects
- Innovate with new cloud products & techniques the moment they are available, snapping new services into development pipelines
- Follow metadata “single-source” best practice, allowing business users access to data catalogues whilst using them for our development activities (transformations, schema definitions, etc)