演讲摘要:The top two problems for data warehouse users today are data quality and staleness. While building a reliable data pipeline is inherently hard, a lot of today’s problems stem from the complex data architectures that organizations deploy. These architectures contain many systems—data lakes, message queues, and warehouses—that data must pass through, where each transfer step adds delays and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high SQL performance similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks and is now used at thousands of companies.
讲者简介:Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley, and has worked on other widely used open source data analytics and AI software including MLflow and Delta Lake. At Stanford, he is part of the DAWN lab focusing on infrastructure for machine learning. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the US government to early-career scientists and engineers.