CEDEs: Continuously Evolving Distributed Ensembles
Machine Learning (ML) has advanced significantly over the past decade, as has our relationship with data. Data are now diverse, produced in vast volumes, and stored, processed and used in real-time. As a result, there are still significant technical and research challenges to be solved [11]: learning from large datasets, learning from high-speed streaming data, learning from different types of data, learning from data with uncertainty and incompleteness, and learning from low-value and diverse data.
Research challenges are evident in an analysis of recent literature [1,4,11]. Technical challenges are visible in companies struggling to use data in an effective manner for decision-support. The PI has first-hand experience in this scope, namely through his two latest projects: 3SLM and NEURAT. In the former, in co-promotion with Compta, he dealt with the challenges of using large volumes of streaming data from public street lighting networks for detecting anomalies in real-time, and implemented an autonomous management mechanism focused on energy efficiency. In the latter, with Petapilot and in cooperation with the Portuguese Tax Authority, he is addressing similar challenges in the context of financial fraud detection. The project consists on the implementation of an interactive ML system, in which fiscal data sources and auditors are the sources of streaming data that change over time and require models to adapt. In both projects, the same challenges were faced.
To address them, this proposal puts forward the concept of Continuously Evolving Distributed Ensembles (CEDEs) for supervised ML problems, motivated by insights from the recent research and applied work of the team. CEDEs is founded on the notion of Ensemble Learning [17,18], in which a complex model is built out of the combination of the predictions of multiple, simpler, base models. However, it introduces some key innovative aspects.
First, CEDEs will be distributed by design, relying on a stack of open-source technologies for data acquisition, storage and processing. Ensembles will be implemented in a distributed way, with each base model being trained in the node of the cluster where the data are. This will naturally distribute and parallelize ML tasks while minimizing data transfer overheads across the cluster. This federated approach will also make CEDEs suitable for a multi-organizational context, in which each organization maintains data access rights and data privacy, while still using a large, shared model.
CEDEs also proposes the notion of continuous multi-objective Ensemble optimization. Most existing approaches use a one-shot static pruning and integration policy for the Ensemble, executed when the base models are trained [16,18]. However, in streaming scenarios, both the models and the pruning/integration policy may become outdated as data concepts and properties change. CEDEs will include a mechanism that continuously fine-tunes this policy in order to maximize the life of the Ensemble. Moreover, instead of relying only on diversity and accuracy metrics as current approaches do [16-18], CEDEs will assess and propose novel optimization objectives, including data meta-features (e.g. quality metrics, information-theoretic features), data/model age, etc. This more comprehensive and dynamic approach will minimize the need for re-training base models or the whole Ensemble.
The last disruptive aspect of CEDEs is the use of meta-learning for predicting model performance. This explores the large number of base-models that will be trained in a CEDEs instance, by maintaining a large, shared meta-dataset that makes the correspondence between meta-features of the input data and model performance metrics. This meta-dataset is then used to train multiple meta-models that predict different performance metrics of future models trained on given data. In turn, these meta-models contribute to the efficient management of resources in a cluster by allowing nodes to predict if a future model will have a better performance than the one currently in use, and decide whether to train it or not.
From a research perspective, CEDEs will have several contributions, including the implementation/assessment of various optimization techniques, optimization objectives, and base models, or the use of meta-learning for predicting model performance. This will be measured through several high-impact publications.
From an applied point of view, CEDEs will also result in a significant contribution by proposing a distributed architecture for federated ML in streaming scenarios, based on open-source technologies. CEDEs will be made publicly available as a containerized prototype that will be easily deployable. This applied nature is also expected to result in further opportunities for projects in co-promotion, with companies that could benefit from the use of CEDEs, contributing to an actual transition of knowledge from the academia to the industry, which often fails.