Data lake technology has drawn the attention of organizations that need a place to hold massive amounts of raw data until its use in analytics applications.
The data lake storage market is set to grow rapidly. Data lake providers offer such benefits as storage scalability and cost savings.
“While it remains an emerging solution, data lake storage is an increasingly popular approach to data architecture,” said Gene Locklear, AI research scientist at Sentient Digital, a technology solutions provider that serves government and commercial clients.
Unlike a data warehouse, which places data into files or folders, a data lake stores data in its native format. This capability eliminates the need to restructure data before organizations use it for various types of analytics.
Organizations need to understand their needs and their options to choose the right provider. Learn where data lakes prove most advantageous and the key buying points of this technology. Then skim the main features of major data lake providers.
Benefits of data lake storage
The technology’s primary beneficiaries include sales, marketing and customer support organizations, said Radhakrishnan Rajagopalan, global head of technology services at IT consulting and services company Mindtree.
“A data lake brings together diverse data onto a single unified data platform, enabling agile decision-making,” Rajagopalan said.
Data lakes are scalable, which enables adopters to store data in a relatively inexpensive manner. The technology also helps to decommission legacy analytics applications, which frees up capital and resources.
“It also allows companies that perhaps had legacy applications and databases to move the data to a more cost-effective storage mechanism,” said Craig Kelly, vice president of analytics at IT consulting firm Syntax.
Businesses increasingly make decisions based on insights derived from data.
“For many companies, data lakes are more economical than data warehousing, especially where speed of data retrieval is key,” Rajagopalan said.
Picking a data lake provider
Selecting a provider hinges on the type of storage platform — on premises or cloud — as well as the organization’s data governance and data types.
Data lake hosting. On-premises data lakes are most effective when the adopter invests in long-term infrastructure — including storage space, power, hardware and software — as well as the talent necessary for running the systems, Rajagopalan said. Data lakes in the cloud are best for organizations that want to outsource and need a nimble infrastructure.
Security. A match with the organization’s security and accessibility profile is the most important attribute to look for in a data lake cloud storage provider.
“There’s an inevitable tradeoff between security and ease of access and processing,” Locklear said. “If you’re working with a provider that emphasizes safety versus ease of use, or vice versa, in contrast with your priorities, you’re going to struggle from day one.”
Ensure that the provider encrypts data, both in transit and at rest.
Data handling. The data lake should easily handle all data types, whether structured, semistructured or unstructured.
“Organizations will generate all forms exponentially,” Kelly said.
Gene LocklearAI research scientist, Sentient Digital
Examples of data lake providers
Many major storage technology vendors, including IBM and HPE, can help enterprises build an on-premises data lake. Microsoft Azure and AWS are the largest cloud-based data lake providers.
- Data Lake on AWS combines the core AWS cloud services needed to tag, search, share, analyze and govern subsets of data, according to the vendor. Features include a managed storage layer, encryption at rest through AWS Key Management Service and data access flexibility.
- HPE touts its Apollo 4200 Gen10 Plus storage server as a building block for a modern data lake. It suits data-centric workloads and features NVMe flash, persistent memory, and high throughput and low latency required for in-place analytics, according to HPE.
- IBM offers data lake deployment through its Power and Spectrum Scale products. Organizations can choose from on-premises, cloud and hybrid options. Through a partnership with Cloudera, IBM provides data governance, security and analytics.
- Microsoft Azure Data Lake is a cloud service that stores and analyzes petabyte-size files and trillions of objects. Microsoft manages the Data Lake product. It includes data encryption at rest and in motion, multifactor authentication and auditing.
Many smaller players — including Dremio and Databricks with Delta Lake — are also entering the market, potentially leading to a wider supply of options at lower prices.
“As current trends continue into 2022, a wholesale migration to data lake storage may well be in the cards,” Locklear said.
Data lake adopters face problems, particularly cost issues, with storing, updating and retrieving massive amounts of data. In fact, much of that data is useless.
“Companies become data hoarders,” Locklear said. “Nobody wants to be the one who says, ‘delete it.'”
Meanwhile, without constant, vigilant management, data lakes can gradually become less effective.
“The risk is the so-called ‘data graveyard,’ where potentially relevant data becomes lost among unnecessary files, skewing metrics and analytics,” Locklear said.