PROTEUS mission is to investigate and develop ready-to-use scalable online machine learning algorithms and real-time interactive visual analytics to deal with extremely large data sets and data streams.


The foundation for the PROTEUS advances is the use of an optimized implementation of combined batch and streaming processing and building around this later scalable real time processes. The developed algorithms and techniques will form a library to be integrated into an enhanced version of Apache Flink, the EU Big Data platform. PROTEUS will contribute to the Big Data area by addressing fundamental challenges related to the scalability and responsiveness of analytics capabilities. The requirements are defined by a steelmaking industrial use case. The techniques developed in PROTEUS are however, general, flexible and portable to all data stream-based domains.


In particular, the project will go beyond the current state-of-art technology by making the following specific original contributions:
  • New strategies for real-time hybrid computation, batch data and data streams.
  • Real-time scalable machine learning for massive, high-velocity and complex data streams analytics.
  • Real-time interactive visual analytics for Big Data.
  • Implementation the new advances on top of Apache Flink.
  • Real-world industrial validation of the technology developed.

The PROTEUS impact is manifold:

  • strategic, by reducing the gap and dependency from the US technology, empowering the EU Big Data industry through the enrichment of the EU platform Apache Flink;
  • economic, by fostering the development of new skills and new job positions and opportunities towards economic growth;
  • industrial, by considering real-world requirements from industry and by validating the outcome on an operational setting, and
  •  scientific, by developing original hybrid and streaming analytic architectures that enable scalable online machine learning strategies and advanced interactive visualisation techniques that are applicable for general data streams in other domains.