Our visionary researchers and founders have repeatedly demonstrated their ability to pioneer advancements in distributed systems, AI, Machine Learning, and data analytics.
Scalytics Connect is the next generation data management framework that powers best of breed implementations for deep learning and machine learning applications in federated environments. Scalytics Connect utilizes our own custom version of Federated Learning to deliver market leading results in a single deployment.
Since 2015, we have been regularly contributing our research to the community and publishing in high-quality journals.
Browse our library; all papers are in chronological order, newest first. Bookmark this page to stay up to date on our latest articles and research. Another excellent resource is our blog, where our experts discuss the most recent advancements in AI, ML, and data processing.
Explore Apache Wayang, a groundbreaking open-source data analytics framework that unites various data processing platforms, optimizing performance, and reducing costs. Dive into the paper for insights on Wayang’s architecture and its seamless, integrated user experience.
Our approach allows to identify DBMS-supported operations and translate them into SQL to leverage DBMSes for accelerating data science workloads. The optimization target is twofold: First, to improve data loading, by reducing the amount of data to be transferred between runtimes.
Earth observation (EO) is a prime instrument for monitoring land and ocean processes, studying the dynamics at work, and taking the pulse of our planet. This article gives a bird's eye view of the essential scientific tools and approaches informing and supporting the transition from raw EO data to usable EO-based information.
The processing of geo-distributed data is subject to data transfer regulations. In this paper, we present our work on a federated data processing system that can comply with these regulations. We also present research challenges and opportunities for the system to make compliance truly first-class citizens.
Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned.
Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source crossplatform system that copes with these new requirements.
Data analytics are moving beyond the limits of a single data processing platform. A cross-platform query optimizer is necessary to enable applications to run their tasks over multiple platforms efficiently and in a platform-agnostic manner.
Although big data processing has become dramatically easier over the last decade, there has not been matching progress over big data debugging. It is estimated that users spend more than 50% of their time debugging their big data applications.
Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result,organizations typically perform tedious and costly tasks to juggle their code and data across different platforms.
Many of today’s applications need several data processing platforms for complex analytics. Thus, recent systems have taken steps towards supporting cross-platform data analytics. Yet, current cross-platform systems lack of ease-of-use, which is crucial for their adoption.
There is a zoo of data processing platforms which help users and organizations to extract value out of their data. Although each of these platforms excels in specific aspects, users typically end up running their data analytics on suboptimal platforms.
Today, organizations typically perform tedious and costly tasks to juggle their code and data across different data processing platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging because it requires quite good expertise for all the available data processing platforms.
Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such assort-merge join, to the use of efficient indices such as B+-tree, R∗-tree and Bitmap.
As the use of machine learning (ML) permeates into diverse application domains, there is an urgent need to support a declarative framework for ML. Ideally, a user will specify anML task in a high-level and easy-to-use language and the framework will invoke the appropriate algorithms and system configurations to execute it.
The world is fast moving towards a data-driven society where data is the most valuable asset. Organizations need to perform very diverse analytic tasks using various data processing platforms. In Doing so, they face many challenges; chiefly, platform dependence,poor interoperability, and poor performance when using multiple platforms.
Many emerging applications, from domains such as healthcare and oil & gas, require several data processing system for complex analytics. This demo paper showcases Rheem, a framework that provides multi-platform task execution for such applications
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions.
Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R∗-tree and Bitmap.