Don’t let your data lake become a data swamp
By Jason Bissell & Calvin Hoon April 24, 2018
- A successful system of real-time business insights starts with a system of trust
- Now everyone wants to create content, add context, enrich data, and share it with others
IN AN always-on, competitive business environment, organisations are looking to gain an edge through digital transformation. Subsequently, many companies feel a sense of urgency to transform across all areas of their enterprise — from manufacturing to business operations — in the constant pursuit of continuous innovation and process efficiency.
Data is at the heart of all these digital transformation projects. It is the critical component that helps generate smarter, improved decision-make by empowering business users to eliminate gut feelings, unclear hypotheses, and false assumptions.
As a result, many organisations believe building a massive data lake is the ‘silver bullet’ for delivering real-time business insights. In fact, according to a survey by CIO review from IDG, 75% of business leaders believe their future success will be driven by their organisation’s ability to make the most of their information assets. However, only 4% of these organisations said they have set up a data-driven approach to successfully benefit from their information.
Is your data lake becoming more of a hindrance than an enabler?
The reality is that all these new initiatives and technologies come with a unique set of generated data, which creates additional complexity in the decision-making process. To cope with the growing volume and complexity of data and alleviate IT pressure, some are migrating to the cloud.
But this transition — in turn — creates other issues. For example, once data is made more broadly available via the cloud, more employees want access to that information.
Growing numbers and varieties of business roles are looking to extract value from increasingly diverse data sets, faster than ever — putting pressure on IT organisations to deliver real-time, data access that serves the diverse needs of business users looking to apply real-time analytics to their everyday jobs.
However, it’s not just about better analytics — business users also frequently want tools that allow them to prepare, share, and manage data.
To minimise tension and friction between IT and business departments, moving raw data to one place where everybody can access it sounded like a good move. The concept of the data lake was first coined by James Dixon in 2014 who expected the data lake to be a large body of raw data in a more natural state where different users come to examine it, delve into it, or extract samples from it.
However, increasingly organisations are beginning to realise that all the time and effort spent building massive data lakes have frequently made things worse due to poor data governance and management, which resulted in the formation of so-called “data swamps”.
Bad data clogging up the machinery
The same way data warehouses failed to manage data analytics a decade ago, data lakes will undoubtedly become “data swamps” if companies don’t manage them in the correct way.
Putting all your data in a single place won’t in and of itself solve a broader data access problem. Leaving data uncontrolled, un-enriched, not qualified, and unmanaged, will dramatically hamper the benefits of a data lake, as it will still have the ability to only be utilised properly by a limited number of experts with a unique set of skills.
A successful system of real-time business insights starts with a system of trust. To illustrate the negative impact of bad data and bad governance, let’s take a look at what happened to Dieselgate.
The Dieselgate emissions scandal highlighted the difference between real-world and official air pollutant emissions data. In this case, the issue was not a problem of data quality, but of ethics, since some car manufacturers misled the measurement system by injecting fake data. This resulted in fines for car manufacturers exceeding more than tens of billions of dollars and consumers losing faith in the industry. After all, how can consumers trust the performance of cars now that they know the system-of-measure has been intentionally tampered with?
The takeaway in the context of an enterprise data lake is that its value will depend on the level of trust employees have in the data contained in the lake. Failing to control data accuracy and quality within the lake will create mistrust amongst employees, seed doubt about the competency of IT, and jeopardise the whole data value chain, which then negatively impacts overall company performance.
A cloud data warehouse to deliver trusted insights for the masses
Leading firms believe governed cloud data lakes represent an adequate solution to overcoming some of these more traditional data lake stumbling blocks. The following four-step approach helps modernise cloud data warehouse while providing better insight into the entire organisation.
- Unite all data sources and reconcile them: Make sure the organisation has the capacity to integrate a wide array of data sources, formats and sizes. Storing a wide variety of data in one place is the first step, but it’s not enough. Bridging data pipelines and reconciling them is another way to gain the capacity to manage insights. Verify the company has a cloud-enabled data management platform combining rich integration capabilities and cloud elasticity to process high data volumes at a reasonable price.
- Accelerate trusted insights to the masses: Efficiently manage data with cloud data integration solutions that help prepare, profile, cleanse, and mask data while monitoring data quality over time regardless of file format and size. When coupled with cloud data warehouse capabilities, data integration can enable companies to create trusted data for access, reporting, and analytics in a fraction of the time and cost of traditional data warehouses.
- Collaborative data governance to the rescue: The old schema of a data value chain where data is produced solely by IT in data warehouses and consumed by business users is no longer valid. Now everyone wants to create content, add context, enrich data, and share it with others. Take the example of the internet and a knowledge platform such as Wikipedia where everybody can contribute, moderate and create new entries in the encyclopedia. In the same way Wikipedia established collaborative governance, companies should instill a collaborative governance in their organisation by delegating the appropriate role-based, authority or access rights to citizen data scientists, line-of-business experts, and data analysts.
- Democratise data access and encourage users to be part of the Data Value Chain: Without making people accountable for what they’re doing, analysing, and operating, there is little chance that organisations will succeed in implementing the right data strategy across business lines. Thus, you need to build a continuous Data Value Chain where business users contribute, share, and enrich the data flow in combination with a cloud data warehouse multi-cluster architecture that will accelerate data usage by load balancing data processing across diverse audiences.
In summary, think of data as the next strategic asset. Right now, it’s more like a hidden treasure at the bottom of many companies. Once modernised, shared and processed, data will reveal its true value, delivering better and faster insights to help companies get ahead of the competition.
Jason Bissell is the general manager of Asia Pacific and Japan at Talend Inc while Calvin Hoon is the regional VP of sales, Asia Pacific at Talend Inc.
McAfee study: 1-in-4 organisations using Public Cloud has had data stolen
Consumers worry that privacy invasions may lead to a loss of civil rights: EIU survey
GDPR: Privacy and security ‘by design’