Dremio Dart Initiative to optimize SQL queries across open data lakes
Elevate your enterprise data technology and strategy at Transform 2021.
Dremio today launched an initiative to make it simpler to deploy its namesake in-memory SQL engine across multiple open storage platforms that make up a data lake and launch queries of any size twice as fast.
The goal of the Dremio Dart Initiative is to increase the performance of those SQL queries by a factor of five in the next 12 months, said Tomer Shiran, company founder and chief product officer at Dremio.
Rather than copying data into a proprietary data warehouse deployed in the cloud or in an on-premises environment, Shiran said it’s more cost effective to create a data lake using its SQL engine to directly access data stored, for example, in an Amazon S3 and Azure Data Lake Storage (ADLS) service. That approach also eliminates the need to aggregate tables, extract data, or employ a separate online analytic processing (OLAP) cube to structure data in a way that is compatible with SQL, Shiran added.
Dremio is now allowing IT teams to deploy unlimited table sizes spanning an unlimited number of partitions and files, with all data becoming near-instantaneously available as it is added to the data lake. Dremio is also adding support for manifest-based metadata and version management to manage those larger data sets.
In general, Dremio is trying to make it simpler for end users that have SQL expertise to launch queries directly against a data lake or enable of the users of business intelligence (BI) applications to launch queries via the Dremio engine without having to know SQL. Regardless of the approach, the goal is to make sure organizations are not locked into a proprietary data warehouse deployed in a cloud in the same way they are in an on-premises IT environment. “They are not in love with that being their future again,” Shiran said.
Optimizing queries for the data lake
As part of the Dart Initiative, Dremio is optimizing query planning by automatically gathering statistics about the underlying data, which enables the Dremio query optimizer to determine the optimal execution path for any given query before data is loaded.
Dremio is also adding support for query plan caching, which eliminates both overhead and latency for repeated queries, in addition to a high-performance compiler that enables much larger and more complex SQL statements while employing machine learning algorithms to reduce the amount of compute resources required to launch SQL queries. Cloud storage read operations make up 30% to 60% of query execution costs in some workloads, Dremio says, and the company is reducing the amount of data read from cloud object storage by enhancing the scan filter pushdown capabilities it provides.
The company is also broadening its support for SQL to include support for additional functions, operators, and SQL grammar constructs. The Dart Initiative provides a significant boost in the performance of end-user queries with complex expressions by extending the capabilities of Gandiva, a toolkit that enables vectorized execution on modern processors within the in-memory buffers in Apache Arrow, an open source columnar data format Dremio co-created.
Finally, Dremio has enhanced its ability to support the orchestrated refresh of hundreds of reflections employed by BI applications within multi-tenant environments. Future Dart Initiative phases will advance acceleration management through improved refresh granularity, consistency across related reflections, and improved refresh monitoring and restartability.
Those capabilities complement technologies such as the open source Apache Iceberg table format, which Dremio employs to enable multiple instances of its engines to work together, and Project Nessie, which brings Git-based semantics and workflows to data lakes, Shiran said.
It may be a while before organizations fully appreciate the degree to which open data lakes might obviate the need for proprietary data warehouses. The challenge is distinguishing between what truly constitutes a data lake versus just another data warehouse that happens to be deployed in a cloud.
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more
Source: Read Full Article