Heben Sie Ihre Leistungen auf People@HES-SO hervor weitere Infos
PEOPLE@HES-SO - Verzeichnis der Mitarbeitenden und Kompetenzen
PEOPLE@HES-SO - Verzeichnis der Mitarbeitenden und Kompetenzen

PEOPLE@HES-SO
Verzeichnis der Mitarbeitenden und Kompetenzen

Hilfe
language
  • fr
  • en
  • de
  • fr
  • en
  • de
  • SWITCH edu-ID
  • Verwaltung
ID
« Zurück
Zapater Sancho Marina

Zapater Sancho Marina

Professeure HES associée

Hauptkompetenzen

High Performance Computing

Embedded Systems

Energy efficiency

Power Management

Deep Learning

Thermal modelling

Novel architectures

  • Kontakt

  • Lehre

  • Forschung

  • Publikationen

  • Konferenzen

Hauptvertrag

Professeure HES associée

Büro: A21

Haute école d'Ingénierie et de Gestion du Canton de Vaud
Route de Cheseaux 1, 1400 Yverdon-les-Bains, CH
HEIG-VD
Institut
ReDS - Institut Reconfigurable & Embedded Digital Systems
BSc en Informatique et systèmes de communication - Haute école d'Ingénierie et de Gestion du Canton de Vaud
  • Calcul numérique et accélération matérielle (CNM)
  • Intelligence Artificielle pour les Systèmes Autonòmes (IAA)
  • Architecture des Ordinateurs (ARO)

Laufend

C4liTwin: Machine-Learning self-calibration and modelling of robotic arms using Digital Twins
AGP

Rolle: Hauptgesuchsteller/in

Financement: Innosuisse; Trimos

Description du projet : The C4liTwin projet (pronounced "calitwin") brings together machine learning and the virtual twin technology to develop the framework and algorithms required to self-calibrate the C4 robotic arms of Trimos by creating a digital twin to model the arm behaviour and enable automatic re-calibration.

Forschungsteam innerhalb von HES-SO: Brunet Yorick , Chacun Guillaume , Extrat Bastien , Zapater Sancho Marina , Akeddar Mehdi

Partenaires académiques: ReDS; Zapater Sancho Marina, ReDS

Durée du projet: 15.11.2023 - 30.11.2025

Montant global du projet: 398'080 CHF

Statut: Laufend

Midgard: Virtual Memory for Post-Moore Servers

Rolle: Partner/in

Requérant(e)s: EPFL

Description du projet :

The goal of this project is to support the Midgard technology in the Gem5 simulator. Midgard proposes to rethink the overall virtual memory technology of current servers by exposing a global but sparse intermediate address space in a coherent cache hierarchy. Midgard eliminates TLBs by offering a direct translation in hardware from existing Operating System (OS) Virtual Memory (VM) software abstractions (called VMAs) and performs page-level translations only when accessing physical memory or I/O. As such, in contrast to state-of-the-art page-based VM, Midgard's overall address translation overhead decreases with an increase in cache hierarchy capacity. By eliminating the need for deep TLB hierarchies, Midgard not only reclaims the TLB silicon provisioning, but also offers orders of magnitude faster address translation, shootdown and access control creation/revocation as compared to conventional page-based VM. 

Prof. Zapater is affiliated faculty in the Midgard project, a joint project between EPFL, Yale University and the University of Edinburgh. Within this project she collaborates with EPFL in providing simulation support for Midgard in gem5. 

Firstly, because Midgard requires the use of multi-level TLBs, to simulate Midgard on gem5 we need to setup all the simulation environment in the last Gem5 version, namely gem5-22. However, the gem5-X simulator developed at EPFL only provides support for older (2019) gem5 versions, which do not include the adequate support for Midgard. Therefore, within this project we plan to perform the necessary developments required to port the main features of gem5-X to gem5-22, creating a new release named gem5-X-22, where all Midgard support will be released. 

Secondly, we will create the necessary models in gem5 to enable simulating Midgard at the hardware and architectural levels. We will also develop the necessary framework to enable the creation of cache-coherent Midgard-compliant accelerators. We will do so by supporting the enhancement of the ALPINE simulation framework, adequately integrating it into gem5. 

Finally, to showcase and fully simulate Midgard, we will focus on having a proof-of-concept Midgard-compliant OS implementation is SO3, running in gem5. SO3 is a Linux-based simple operating system user for teaching and research at the REDS institute. It provides all base functionalities of a Linux kernel while being lightweight and easy to modify. The goal will be providing Midgard support in SO3, while maintaining compatibility with Linux-based systems, to allow advances in the proposal of cache-coherent accelerators.

 

Forschungsteam innerhalb von HES-SO: Zapater Sancho Marina

Partenaires académiques: EPFL

Durée du projet: 01.10.2022 - 31.12.2024

Url des Projektstandortes: https://midgard.epfl.ch

Statut: Laufend

ECO4AI - efficient Edge-to-Cloud workload allOcation for Artificial Intelligence applications

Rolle: Hauptgesuchsteller/in

Financement: HES-SO - Appel à projets jeunes chercheurs

Description du projet :

The main goal of ECO4AI is to propose workload allocation techniques that efficiently distribute workload between the edge and the cloud in a transparent way for AI-based IoT applications, allowing an efficiency increase (in terms of performance per watt). This will be accomplished by exploiting the underlying heterogeneous capabilities of novel edge and cloud architectures, and by proposing elastic edge-cloud resource allocation and management techniques.

The project exploits open hardware architectures such as RISC-V and propose a hardware/software ecosystem that will be released open-source, increasing visibility and impact. 

ECO4AI will focus on three different use cases that play a key role today in the AI-based IoT scenario: (1) Video surveillance and object tracking; (2) autonomous driving, which represents an important opportunity for edge-edge and edge-cloud cooperation and (3) e-Health, and more specifically bio-signal monitoring for cardiac diseases and epileptic seizures

Forschungsteam innerhalb von HES-SO: Zapater Sancho Marina

Durée du projet: 01.01.2022 - 30.06.2023

Montant global du projet: 100'000 CHF

Statut: Laufend

Abgeschlossen

An Edge-to-Cloud platform for semi-supervised multi-source data labelling to enable horse health diagnosis.
AGP

Rolle: Hauptgesuchsteller/in

Financement: HES-SO Rectorat; Alogo Analysis SA

Description du projet : In this project, the main goal is to tackle the objectives (1) and (2) presented above. For this purpose, we plan to develop an edge-to-cloud server platform enabling the collection, synchronization, and labeling of real-time accelerometry data from Alogo Move Pro as well as video data captured by a high-resolution smartphone camera. The accelerometry and video data will be synchronized, and the video will be pre-processed via machine learning techniques in order to adequately crop, refocus and resize the images of horses, as well as to automatically select regions of interest (running, jumping, etc.). An interactive Graphic User Interface (GUI) developed in collaboration with Alogo will be created to enable fast (yet accurate) data labeling, enabling horse experts and veterinaries to label and assess horse condition. The overarching goal will be creating a platform and interactive visualization and labelling tool enabling the creation of a dataset that will be used in future projects to develop AI algorithms able to diagnose horse condition.

Forschungsteam innerhalb von HES-SO: Convers Anthony , Chacun Guillaume , Zapater Sancho Marina , Akeddar Mehdi

Partenaires académiques: ReDS; Zapater Sancho Marina, ReDS

Durée du projet: 01.05.2024 - 30.04.2025

Montant global du projet: 55'000 CHF

Statut: Abgeschlossen

C4libre: Machine-Learning selfcalibration of the C4 robotic arm
AGP

Rolle: Hauptgesuchsteller/in

Financement: Innosuisse

Description du projet : The main goal is to propose novel machine learning based algorithms to enable selfcalibration of robotic coordinate measuring machine (CMMs). CMMs are micrometer-accurate robotic arms capable of providing high-precision measurements of industrial component parts. Calibration is a mandatory step that remains as of today very time-consuming and costly. In the specific case of C4 robotic arm of Trimos, the calibration process must ensure achieving a maximum measurement error below the 8um threshold for the overall volume under measure, while automatizing most of the process to reduce human-related costs. The calibration algorithms proposed in this project imply a significant departure from current techniques. We will build on formal mathematical models that describe the arm behavior (using trigonometry) and propose evolutionary techniques to tune the static and dynamic corrections and attain sub-8um errors. Evolutionary techniques (genetic algorithms and genetic programming) have great potential in situations where we know the underlying governing physical/mathematical laws of the system, but the dynamics are too complex to be described formally and we benefit from a data-driven approach. Moreover, we know that the placement of artifacts required for calibration plays an important role in achieving a uniform error across the overall volume under test. Therefore we also plan to analyze the error distribution to understand the impact of artifact placement on error.

Forschungsteam innerhalb von HES-SO: Extrat Bastien , Zapater Sancho Marina , Akeddar Mehdi

Partenaires académiques: ReDS

Durée du projet: 29.03.2023 - 28.09.2023

Montant global du projet: 15'000 CHF

Statut: Abgeschlossen

2025

LIONHEART :
Wissenschaftlicher Artikel ArODES
a layer-based mapping framework for heterogeneous systems with analog in-memory computing tiles

Corey Lammie, Yuxuan Wang, Flavio Ponzina, Joshua Klein, Hadjer Benmeziane, Marina Zapater Sancho, Irem Boybat, Abu Sebastian, Giovanni Ansaloni, David Atienza

IEEE Transactions on Emerging Topics in Computing,  2025, To be pubished., 1-13

Link zur Publikation

Zusammenfassung:

When arranged in a crossbar configuration, resistive memory devices can be used to execute Matrix-Vector Multiplications (MVMs), the most dominant operation of many Machine Learning (ML) algorithms, in constant time complexity. Nonetheless, when performing computations in the analog domain, novel challenges are introduced in terms of arithmetic precision and stochasticity, due to non-ideal circuit and device behaviour. Moreover, these non-idealities have a temporal dimension, resulting in a degrading application accuracy over time. Facing these challenges, we propose a novel framework, named LionHeart, to obtain hybrid analog-digital mappings to execute Deep Learning (DL) inference workloads using heterogeneous accelerators. The accuracy-constrained mappings derived by LionHeart showcase, across different Convolutional Neural Networks (CNNs) and one transformer-based network, high accuracy and potential for speedup. The results of the full system simulations highlight runtime reductions and energy efficiency gains that exceed 6×, with a user-defined accuracy threshold for a fully digital floating point implementation.

2024

Bank on compute-near-memory :
Wissenschaftlicher Artikel ArODES
design space exploration of processing-near-bank architectures

Rafael Medina, Giovanni Ansaloni, Marina Zapater, Alexandre Levisse, Saeideh Alinezhad Chamazcoti, Timon Evenblij, Dwaipayan Biswas, Francky Catthoor, David Atienza

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,  2024, 43, 11, 4117-4129

Link zur Publikation

Zusammenfassung:

Near-DRAM computing strategies advocate for providing computational capabilities close to where data is stored. Although this paradigm can effectively address the memory-to-processor communication bottleneck, it also presents new challenges: The strict resource constraints in the memory periphery demand careful tailoring of architectural elements. We herein propose a novel framework and methodology to explore compute-near-memory designs that interface to DRAM memory banks, demonstrating the area, energy, and performance tradeoffs subject to the architectural configuration. We exemplify this methodology by conducting two studies on compute-near-bank designs: 1) analyzing the interaction between control and data resources, and 2) exploring the integration of processing units with different DRAM standards. According to our study, the optimal size ratios between instruction and data capacity vary from 2× to 4× across benchmarks from representative application domains. The retrieved Pareto-optimal solutions from our framework improve state-of-the-art designs, e.g., achieving a 50% performance increase on matrix operations with 15% energy overhead relative to the FIMDRAM design. In addition, the exploration of DRAM shows the interplay between available internal bandwidth, performance, and area overhead. For example, a threefold increase in bandwidth rises performance by 47% across workloads at a 34% extra area cost.

Which coupled is best coupled ? An exploration of AIMC tile interfaces and load balancing for CNNs
Wissenschaftlicher Artikel ArODES

Joshua Klein, Irem Boybat, Giovanni Ansaloni, Marina Zapater, David Atienza

IEEE Transactions on Parallel and Distributed Systems,  2024, 35, 10, 1780-1795

Link zur Publikation

Zusammenfassung:

Due to stringent energy and performance constraints, edge AI computing often employs heterogeneous systems that utilize both general-purpose CPUs and accelerators. Analog in-memory computing (AIMC) is a well-known AI inference solution that overcomes computational bottlenecks by performing matrix-vector multiplication operations (MVMs) in constant time. However, the tiles of AIMC-based accelerators are limited by the number of weights they can hold. State-of-the-art research often sizes neural networks to AIMC tiles (or vice-versa), but does not consider cases where AIMC tiles cannot cover the whole network due to lack of tile resources or the network size. In this work, we study the trade-offs of available AIMC tile resources, neural network coverage, AIMC tile proximity to compute resources, and multi-core load balancing techniques. We first perform a study of single-layer performance and energy scalability of AIMC tiles in the two most typical AIMC acceleration targets: dense/fully-connected layers and convolutional layers. This study guides the methodology with which we approach parameter allocation to AIMC tiles in the context of large edge neural networks, both where AIMC tiles are close to the CPU (tightly-coupled) and cannot share resources across the system, and where AIMC tiles are far from the CPU (loosely-coupled) and can employ workload stealing. We explore the performance and energy trends of six modern CNNs using different methods of load balancing for differently-coupled system configurations with variable AIMC tile resources. We show that, by properly distributing workloads, AIMC acceleration can be made highly effective even on under-provisioned systems. As an example, 5.9x speedup and 5.6x energy gains were measured on an 8-core system, for a 41% coverage of neural network parameters.

Intermediate address space :
Wissenschaftlicher Artikel ArODES
virtual memory optimization of heterogeneous architectures for cache-resident workloads

Qunyou LIu, Darong Huang, Luis Costero, Marina Zapater, David Atienza

ACM Transactions on Architecture and Code Optimization,  2024, 21, 3, n°50, p. 1-23

Link zur Publikation

CloudProphet :
Wissenschaftlicher Artikel ArODES
a machine learning-based performance prediction for public clouds

Darong Huang, Luis Costero, Ali Pahlevan, Marina Zapater, David Atienza

IEEE Transactions on Sustainable Computing,  2024, 9, 4, 661-676

Link zur Publikation

Zusammenfassung:

Computing servers have played a key role in developing and processing emerging compute-intensive applications in recent years. Consolidating multiple virtual machines (VMs) inside one server to run various applications introduces severe competence for limited resources among VMs. Many techniques such as VM scheduling and resource provisioning are proposed to maximize the cost-efficiency of the computing servers while alleviating the performance inference between VMs. However, these management techniques require accurate performance prediction of the application running inside the VM, which is challenging to get in the public cloud due to the black-box nature of the VMs. From this perspective, this paper proposes a novel machine learning-based performance prediction approach for applications running in the cloud. To achieve high-accuracy predictions for black-box VMs, the proposed method first identifies the running application inside the virtual machine. It then selects highly correlated runtime metrics as the input of the machine learning approach to accurately predict the performance level of the cloud application. Experimental results with state-of-the-art cloud benchmarks demonstrate that our proposed method outperforms existing prediction methods by more than 2× in terms of the worst prediction error. In addition, we successfully tackle the challenge of performance prediction for applications with variable workloads by introducing the performance degradation index, which other comparison methods fail to consider. The workflow versatility of the proposed approach has been verified with different modern servers and VM configurations.

2023

HackRF?+?GNU Radio: A software-defined radio to teach communication theory
Wissenschaftlicher Artikel ArODES

Alberto A Del Barrio, José P. Manzano, Victor M. Maroto, Álvaro Villarín, Josué Pagán, Marina Zapater, José Ayala, Román Hermida

The International Journal of Electrical Engineering & Education,  60, 1, 23-40

Link zur Publikation

Zusammenfassung:

In this paper, an alternative to the traditional methodology related to signal processing-like subjects is proposed. These are subjects that require a deep mathematical and theoretical basis, but the practical goal is not often emphasized, which drives students to lose interest in the subject. Thus, a software-defined radio environment is proposed to provide a more practical view of the subject. This solution consists of an open hardware–software platform able to capture and process a wide range of frequencies. HackRF is the hardware component, while GNU Radio will provide the graphical support to this device. The tests performed with a set of 36 students have revealed that they are more satisfied with this framework than just employing a traditional equation-based environment as Matlab. Furthermore, their scores in the exams also support the suitability of the proposed platform.

2022

ALPINE :
Wissenschaftlicher Artikel ArODES
analog in-memory acceleration with tight processor integration for deep learning

Joshua Klein, Irem Boybat, Yasir Qureshi, Martino Dazzi, Alexandre Levisse, Giovanni Ansaloni, Marina Zapater, Abu Sebastian, David Atienza

IEEE Transactions on Computers,  2023, vol. 72, no. 7, pp. 1985 - 1998

Link zur Publikation

Zusammenfassung:

Analog in-memory computing (AIMC) cores offers significant performance and energy benefits for neural network inference with respect to digital logic (e.g., CPUs). AIMCs accelerate matrix-vector multiplications, which dominate these applications' run-time. However, AIMC-centric platforms lack the flexibility of general-purpose systems, as they often have hard-coded data flows and can only support a limited set of processing functions. With the goal of bridging this gap in flexibility, we present a novel system architecture that tightly integrates analog in-memory computing accelerators into multi-core CPUs in general-purpose systems. We developed a powerful gem5-based full system-level simulation framework into the gem5-X simulator, ALPINE, which enables an in-depth characterization of the proposed architecture. ALPINE allows the simulation of the entire computer architecture stack from major hardware components to their interactions with the Linux OS. Within ALPINE, we have defined a custom ISA extension and a software library to facilitate the deployment of inference models. We showcase and analyze a variety of mappings of different neural network types, and demonstrate up to 20.5x/20.8x performance/energy gains with respect to a SIMD-enabled ARM CPU implementation for convolutional neural networks, multi-layer perceptrons, and recurrent neural networks.

Thermal and voltage-aware performance management of 3D MPSoCs with flow cell arrays and integrated SC converters
Wissenschaftlicher Artikel ArODES

Halima Najibi, Giovanni Ansaloni, Marina Zapater, Miroslav Vasic, David Atienza

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,  2023, vol. 42, no. 1, pp. 2-15

Link zur Publikation

Zusammenfassung:

Flow cell arrays (FCAs) concurrently provide efficient on-chip liquid cooling and electrochemical power generation. This technology is especially promising for threedimensional multi-processor systems-on-chip (3D MPSoCs) realized in deeply scaled technologies, which present very challenging power and thermal requirements. Indeed, FCAs effectively improve power delivery network (PDN) performance, particularly if switched capacitor (SC) converters are employed to decouple the flow cells and the systems-on-chip voltages, allowing each to operate at their optimal point. Nonetheless, the design of FCAbased solutions entails non-obvious considerations and trade-offs, stemming from their dual role in governing both the thermal and power delivery characteristics of 3D MPSoCs. Showcasing them in this paper, we explore multiple FCA design configurations and demonstrate that this technology can decrease the temperature of a heterogeneous 3D MPSoC by 78∘C, and its total power consumption by 46%, compared to a high-performance cold-plate based liquid cooling solution. At the same time, FCAs enable up to 90% voltage drop recovery across dies, using SC converters occupying a small fraction of the chip area. Such outcomes provide an opportunity to boost 3D MPSoC computing performance by increasing the operating frequency of dies. Leveraging these results, we introduce a novel temperature and voltage-aware model predictive control (MPC) strategy that optimizes power efficiency during run-time. We achieve application-wide speedups of up to 16% on various machine learning (ML), data mining, and other high-performance benchmarks while keeping the 3D MPSoC temperature below 83∘C and voltage drops below 5%.

Reinforcement learning-based joint reliability and performance optimization for hybrid-cache computing servers
Wissenschaftlicher Artikel ArODES

Darong Huang, Ali Pahlevan, Luis Costero, Marina Zapater, David Atienza

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,  2022, vol. 41, no. 12, pp. 5596-5609

Link zur Publikation

Zusammenfassung:

Computing servers play a key role in the development and process of emerging compute-intensive applications in recent years. However, they need to operate efficiently from an energy perspective viewpoint, while maximizing the performance and lifetime of the hottest server components (i.e., cores and cache). Previous methods focused on either improving energy efficiency by adopting new hybrid-cache architectures including the resistive random-access memory (RRAM) and static random-access memory (SRAM) at the hardware level, or exploring trade-offs between lifetime limitation and performance of multi-core processors under stable workloads conditions. Therefore, no work has so far proposed a co-optimization method with hybrid-cache-based server architectures for real-life dynamic scenarios taking into account scalability, performance, lifetime reliability, and energy efficiency at the same time. In this paper, we first formulate a reliability model for the hybrid-cache architecture to enable precise lifetime reliability management and energy efficiency optimization. We also include the performance and energy overheads of cache switching, and optimize the benefits of hybrid-cache usage for better energy efficiency and performance. Then, we propose a runtime Q-Learning-based reliability management and performance optimization approach for multi-core microprocessors with the hybrid-cache architecture, jointly incorporated with a dynamic preemptive priority queue management method to improve the overall tasks’ performance by targeting to respect their end time limits. Experimental results show that our proposed method achieves up to 44% average performance (i.e., tasks execution time) improvement, while maintaining the whole system design lifetime longer than 5 years, when compared to the latest state-of-the-art energy efficiency optimization and reliability management methods for computing servers.

Energy-aware task scheduling in data centers using an application signature
Wissenschaftlicher Artikel ArODES

Juan Carlos Salinas-Hilburg, Marina Zapater, José, M. Moya, José L. Ayala

Computers Electrical Engineering,  2022, vol. 97, article no. 107630

Link zur Publikation

Zusammenfassung:

Data centers are power hungry facilities. Energy-aware task scheduling approaches are of utmost importance to improve energy savings in data centers, although they need to know beforehand the energy consumption of the applications that will run in the servers. This is usually done through a full profiling of the applications, which is not feasible in long-running application scenarios due to the long execution times. In the present work we use an application signature that allows to estimate the energy without the need to execute the application completely. We use different scheduling approaches together with the information of the application signature to improve the makespan of the scheduling process and therefore improve the energy savings in data centers. We evaluate the accuracy of using the application signature by means of comparing against an oracle method obtaining an error below 1.5%, and Compression Ratios around 39.7 to 45.8.

2021

Gem5-X :
Wissenschaftlicher Artikel ArODES
a many-core heterogeneous simulation platform for architectural exploration and optimization

Yasir Mahmood Qureshi, William Andrew Simon, Marina Zapater, Katzalin Olcoz, David Atienza

ACM Transactions on Architecture and Code Optimization,  2021, vol. 18, no 4, article no. 44, pp. 1-27

Link zur Publikation

Zusammenfassung:

The increasing adoption of smart systems in our daily life has led to the development of new applications with varying performance and energy constraints, and suitable computing architectures need to be developed for these new applications. In this article, we present gem5-X, a system-level simulation framework, based on gem-5, for architectural exploration of heterogeneous many-core systems. To demonstrate the capabilities of gem5-X, real-time video analytics is used as a case-study. It is composed of two kernels, namely, video encoding and image classification using convolutional neural networks (CNNs). First, we explore through gem5-X the benefits of latest 3D high bandwidth memory (HBM2) in different architectural configurations. Then, using a two-step exploration methodology, we develop a new optimized clustered-heterogeneous architecture with HBM2 in gem5-X for video analytics application. In this proposed clustered-heterogeneous architecture, ARMv8 in-order cluster with in-cache computing engine executes the video encoding kernel, giving 20% performance and 54% energy benefits compared to baseline ARM in-order and Out-of-Order systems, respectively. Furthermore, thanks to gem5-X, we conclude that ARM Out-of-Order clusters with HBM2 are the best choice to run visual recognition using CNNs, as they outperform DDR4-based system by up to 30% both in terms of performance and energy savings.

Interpreting deep learning models for epileptic seizure detection on EEG signals
Wissenschaftlicher Artikel ArODES

Valentin Gabeff, Tomas Teijeiro, Marina Zapater, Leila Cammoun, Sylvie Rheims, Philippe Ryvlin, David Atienza

Artificial Intelligence in Medicine,  2021, vol. 117, article no. 102084

Link zur Publikation

Zusammenfassung:

While Deep Learning (DL) is often considered the state-of-the art for Artificial Intel-ligence-based medical decision support, it remains sparsely implemented in clinical practice and poorly trusted by clinicians due to insufficient interpretability of neural network models. We have approached this issue in the context of online detection of epileptic seizures by developing a DL model from EEG signals, and associating certain properties of the model behavior with the expert medical knowledge. This has conditioned the preparation of the input signals, the network architecture, and the post-processing of the output in line with the domain knowledge. Specifically, we focused the discussion on three main aspects: (1) how to aggregate the classification results on signal segments provided by the DL model into a larger time scale, at the seizure-level; (2) what are the relevant frequency patterns learned in the first convolutional layer of different models, and their relation with the delta, theta, alpha, beta and gamma frequency bands on which the visual interpretation of EEG is based; and (3) the identification of the signal waveforms with larger contribution towards the ictal class, according to the activation differences highlighted using the DeepLIFT method. Results show that the kernel size in the first layer determines the interpretability of the extracted features and the sensitivity of the trained models, even though the final performance is very similar after post-processing. Also, we found that amplitude is the main feature leading to an ictal prediction, suggesting that a larger patient population would be required to learn more complex frequency patterns. Still, our methodology was successfully able to generalize patient inter-variability for the majority of the studied population with a classification F1-score of 0.873 and detecting 90% of the seizures.

Multi-agent reinforcement learning for hyperparameter optimization of convolutional neural networks
Wissenschaftlicher Artikel ArODES

Arman Iranfar, Marina Zapater, David Atienza

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,  2022, vol. 41, no. 4, pp. 1034-1047

Link zur Publikation

Zusammenfassung:

Nowadays, Deep Convolutional Neural Networks (DCNNs) play a significant role in many application domains, such as, computer vision, medical imaging, and image processing. Nonetheless, designing a DCNN, able to defeat the state of the art, is a manual, challenging, and time-consuming task, due to the extremely large design space, as a consequence of a large number of layers and their corresponding hyperparameters. In this work, we address the challenge of performing hyperparameter optimization of DCNNs through a novel Multi-Agent Reinforcement Learning (MARL)-based approach, eliminating the human effort. In particular, we adapt Q-learning and define learning agents per layer to split the design space into independent smaller design sub-spaces such that each agent fine-tunes the hyperparameters of the assigned layer concerning a global reward. Moreover, we provide a novel formation of Q-tables along with a new update rule that facilitates agents’ communication. Our MARL-based approach is data-driven and able to consider an arbitrary set of design objectives and constraints. We apply our MARL-based solution to different well-known DCNNs, including GoogLeNet, VGG, and U-Net, and various datasets for image classification and semantic segmentation. Our results have shown that, compared to the original CNNs, the MARL-based approach can reduce the model size, training time, and inference time by up to, respectively, 83x, 52%, and 54% without any degradation in accuracy. Moreover, our approach is very competitive to state-of-the-art neural architecture search methods in terms of the designed CNN accuracy and its number of parameters while significantly reducing the optimization cost.

3D-ICE 3.0 :
Wissenschaftlicher Artikel ArODES
efficient nonlinear MPSoC thermal simulation with pluggable heat sink models

Frederico Terraneo, Alberto Leva, William Fornaciari, Marina Zapater, David Atienza

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,  2022, vol. 41, no. 4, pp. 1062-1075

Link zur Publikation

Zusammenfassung:

The increasing power density in modern highperformance multi-processor system-on-chip (MPSoC) is fueling a revolution in thermal management. On the one hand, thermal phenomena are becoming a critical concern, making accurate and efficient simulation a necessity. On the other hand, a variety of physically heterogeneous solutions are coming into play: liquid, evaporative, thermoelectric cooling, and more. A new generation of simulators, with unprecedented flexibility, is thus required. In this paper, we present 3D-ICE 3.0, the first thermal simulator to allow for accurate nonlinear descriptions of complex and physically heterogeneous heat dissipation systems, while preserving the efficiency of latest compact modeling frameworks at the silicon die level. 3D-ICE 3.0 allows designers to extend the thermal simulator with new heat sink models while simplifying the time-consuming step of model validation. Support for nonlinear dynamic models is included, for instance to accurately represent variable coolant flows. Our results present validated models of a commercial water heat sink and an air heat sink plus fan that achieve an average error below 1∘C and simulate, respectively, up to 3x and 12x faster than the real physical phenomena.

ECOGreen :
Wissenschaftlicher Artikel ArODES
electricity cost optimization for green datacenters in emerging power markets

Ali Pahlevan, Marina Zapater, Ayse K. Coskun, David Atienza

IEEE Transactions on Sustainable Computing,  2021, vol. 6, no. 2, pp. 289 - 305

Link zur Publikation

Zusammenfassung:

Modern datacenters need to tackle efficiently the increasing demand for computing resources while minimizing energy usage and monetary costs. Power market operators have recently introduced emerging demand-response programs, in which electricity consumers regulate their power usage following provider requests to reduce monetary costs. Among different programs, regulation service (RS) reserves are particularly promising for datacenters due to the high credit gain possibilities and datacenters' flexibility in regulating their power consumption. Therefore, it is essential to develop bidding strategies for datacenters to participate in emerging power markets together with power management policies that are aware of power market requirements at runtime. In this paper we propose ECOGreen, a holistic strategy to jointly optimize the datacenter RS problem and virtual machine (VM) allocation that satisfies the hour-ahead power market constraints in the presence of electrical energy storage (EES) and renewable energy. We first find the best power and reserve bidding values as well as the number of active servers in a fast analytical way that works well in practice. Then, we present an online adaptive policy that modulates datacenter power consumption by controlling VMs CPU resource limits and efficiently utilizing demand-side EES and renewable power, while guaranteeing quality-of-service (QoS) constraints. Our results demonstrate that ECOGreen can provide 76 percent of the datacenter power consumption on average as reserves to the market, due to largely operating on renewable sources and EES. This translates into ECOGreen saving up to 71 percent electricity costs when compared to other state-of-the-art datacenter electricity cost minimization techniques that participate in the power market.

Fast energy estimation framework for long-running applications
Wissenschaftlicher Artikel ArODES

Juan Carlos Salinas-Hilburg, Marina Zapater, José M. Moya, José C. Ayala

Future Generation Computer Systems,  2021, vol. 115, pp. 20-33

Link zur Publikation

Zusammenfassung:

The computation power in data center facilities is increasing significantly. This brings with it an increase of power consumption in data centers. Techniques such as power budgeting or resource management are used in data centers to increase energy efficiency. These techniques require to know beforehand the energy consumption throughout a full profiling of the applications. This is not feasible in scenarios with long-running applications that have long execution times. To tackle this problem we present a fast energy estimation framework for long-running applications. The framework is able to estimate the dynamic CPU and memory energy of the application without the need to perform a complete execution. For that purpose, we leverage the concept of application signature. The application signature is a reduced version, in terms of execution time, of the original application. Our fast energy estimation framework is validated with a set of long-running applications and obtains RMS values of 11.4% and 12.8% for the CPU and memory energy estimation errors, respectively. We define the concept of Compression Ratio as an indicator of the acceleration of the energy estimation process. Our framework is able to obtain Compression Ratio values in the range of 10.1 to 191.2.

COCKTAIL :
Wissenschaftlicher Artikel ArODES
multi-core co-optimization framework with proactive reliability management

Darong Huang, Ali Pahlevan, Marina Zapater, David Atienza

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,  2022, vol. 41, no. 2, pp. 386-399

Link zur Publikation

Zusammenfassung:

High performance computing (HPC) servers aim to meet an increase in the number and complexity of tasks and, consequently, to address the energy efficiency challenge. In addition to energy efficiency, it is essential to manage lifetime limitations of power-hungry components of servers (e.g., cores and cache), hence avoiding server failure before its lifetime period. Traditional approaches focus on either using hybrid caches to reduce the leakage power of traditional static random-access memory (SRAM) cache, and thus increase the energy efficiency, or the trade-off between the lifetime and performance of multi-core processors. However, these approaches fall short in terms of flexibility and applicability for HPC tasks in terms of multi-parametric optimization including quality-of-service (QoS), lifetime reliability, and energy efficiency. As a result, in this paper we propose COCKTAIL, a holistic strategy framework to jointly optimize the energy efficiency of multi-core server processors and tasks performance in the HPC context, while guaranteeing the lifetime reliability. First, we analyze the best cache technology among traditional SRAM and resistive random access memory (RRAM), within the context of hybrid cache architectures, to improve the energy efficiency and manage cache endurance limits with respect to tasks requirements. Second, we introduce a novel efficient proactive queue optimization policy to reorder HPC tasks for execution considering their end time and possible reliability effects on the use of the hybrid caches. Third, we present a dynamic model predictive control (MPC)-based reliability management method to maximize task performance, by controlling the frequency, temperature, and target lifetime of the server processor. Our results demonstrate that, while consuming similar energy, COCKTAIL provides up to 60% QoS improvement when compared to latest state-of-the-art energy optimization and reliability management techniques in the HPC context. Moreover, our strategy guarantees a design lifetime longer than 5 years for the whole HPC system.

2020

Resource management for power-constrained HEVC transcoding using reinforcement learning
Wissenschaftlicher Artikel ArODES

Luis Costero, Arman Iranfar, Marina Zapater, Francisco D. Igual, Katzalin Olcoz, David Atienza

IEEE Transactions on Parallel and Distributed Systems,  2020, vol. 31, no. 12

Link zur Publikation

Zusammenfassung:

The advent of online video streaming applications and services along with the users' demand for high-quality contents require High Efficiency Video Coding (HEVC), which provides higher video quality and more compression at the cost of increased complexity. On one hand, HEVC exposes a set of dynamically tunable parameters to provide trade-offs among Quality-of-Service (QoS), performance, and power consumption of multi-core servers on the video providers' data center. On the other hand, resource management of modern multi-core servers is in charge of adapting system-level parameters, such as operating frequency and multithreading, to deal with concurrent applications and their requirements. Therefore, efficient multi-user HEVC streaming necessitates joint adaptation of application-and system-level parameters. Nonetheless, dealing with such a large and dynamic design space is challenging and difficult to address through conventional resource management strategies. Thus, in this work, we develop a multi-agent Reinforcement Learning framework to jointly adjust application-and system-level parameters at runtime to satisfy the QoS of multi-user HEVC streaming in power-constrained servers. In particular, the design space, composed of all design parameters, is split into smaller independent sub-spaces. Each design sub-space is assigned to a particular agent so that it can explore it faster, yet accurately. The benefits of our approach are revealed in terms of adaptability and quality (with up to to 4× improvements in terms of QoS when compared to a static resource management scheme), and learning time (6× fasterthan an equivalent mono-agent implementation). Finally, we show that the power-capping techniques formulated outperform the hardware-based power capping with respect to quality.

BLADE :
Wissenschaftlicher Artikel ArODES
an in-cache computing architecture for edge devices

William Andrew Simon, Yasir Mahmood Qureshi, Marco Rios, Alexandre Levisse, Marina Zapater

IEEE Transactions on Computers,  2020, vol. 69, no. 9, pp. 1349 - 1363

Link zur Publikation

Zusammenfassung:

Area and power-constrained edge devices are increasingly utilized to perform compute intensive workloads, necessitating increasingly area and power-efficient accelerators. In this context, in-SRAM computing performs hundreds of parallel operations on spatially local data common in many emerging workloads, while reducing power consumption due to data movement. However, in-SRAM computing faces many challenges, including integration into the existing architecture, arithmetic operation support, data corruption at high operating frequencies, inability to run at low voltages, and low area density. To meet these challenges, this article introduces BLADE, a BitLine Accelerator for Devices on the Edge. BLADE is an in-SRAM computing architecture that utilizes local wordline groups to perform computations at a frequency 2.8× higher than state-of-the-art in-SRAM computing architectures. BLADE is integrated into the cache hierarchy of low-voltage edge devices, and simulated and benchmarked at the transistor, architecture, and software abstraction levels. Experimental results demonstrate performance/energy gains over an equivalent NEON accelerated processor for a variety of edge device workloads, namely, cryptography (4× performance gain/6× energy reduction), video encoding (6×/2×), and convolutional neural networks (3×/1.5×), while maintaining the highest frequency/energy ratio (up to 2.2 Ghz@1V) of any conventional in-SRAM computing architecture, and a low area overhead of less than 8 percent.

The RECIPE approach to challenges in deeply heterogeneous high performance systems
Wissenschaftlicher Artikel ArODES

Giovanni Agosta, William Fornaciari, David Atienza, Ramon Canal, Alessandro Cilardo, Marina Zapater

Microprocessors and Microsystems,  2020, vol. 77, 103185

Link zur Publikation

Zusammenfassung:

RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case.

Genome sequence alignment - design space exploration for optimal performance and energy architectures
Wissenschaftlicher Artikel ArODES

Yasir Mahmood Qureshi, Jose Manuel Herruzo, Marina Zapater, Katzalin Olcoz, Sonia Gonzalez-Navarro

IEEE Transactions on Computers,  2020, vol. 14, no. 8, pp. 1-14

Link zur Publikation

Zusammenfassung:

Next generation workloads, such as genome sequencing, have an astounding impact in the healthcare sector. Sequence alignment, the first step in genome sequencing, has experienced recent breakthroughs, which resulted in next generation sequencing (NGS). As NGS applications are memory bounded with random memory access patterns, we propose the use of high bandwidth memories like 3D stacked HBM2, instead of traditional DRAMs like DDR4, along with energy efficient compute cores to improve both performance and energy efficiency. Three state-of-the-art NGS applications, Bowtie2, BWA-MEM and HISAT2, are used as case studies to explore and optimize NGS computing architectures. Then, using the gem5-X architectural simulator, we obtain an overall 68% performance improvement and 71% energy savings using HBM2 instead of DDR4. Furthermore, we propose an architecture based on ARMv8 cores and demonstrate that 16 ARMv8 64-bit OoO cores with HBM2 outperforms 32-cores of Intel Xeon Phi Knights Landing (KNL) processor with 3D stacked memory. Moreover, we show that by using frequency scaling we can achieve up to 59% and 61% energy savings for ARM in-order and OoO cores, respectively. Lastly, we show that many ARMv8 in-order cores at 1.5GHz match the performance of fewer OoO cores at 2GHz, while attaining 4.5x energy savings.

2019

MAGNETIC :
Wissenschaftlicher Artikel ArODES
multi-agent machine learning-based approach for energy efficient dynamic consolidation in data centers

Kawsar Haghshenas, Ali Pahlevan, Marina Zapater, Siamak Mohammadi, David Atienza

IEEE Transactions on Services Computing,  Early Access

Link zur Publikation

Zusammenfassung:

Improving the energy efficiency of data centers while guaranteeing Quality of Service (QoS), together with detecting performance variability of servers caused by either hardware or software failures, are two of the major challenges for efficient resource management of large-scale cloud infrastructures. Previous works in the area of dynamic Virtual Machine (VM) consolidation are mostly focused on addressing the energy challenge, but fall short in proposing comprehensive, scalable, and low-overhead approaches that jointly tackle energy efficiency and performance variability. Moreover, they usually assume over-simplistic power models, and fail to accurately consider all the delay and power costs associated with VM migration and host power mode transition. These assumptions are no longer valid in modern servers executing heterogeneous workloads and lead to unrealistic or inefficient results. In this paper, we propose a centralized-distributed low-overhead failure-aware dynamic VM consolidation strategy to minimize energy consumption in large-scale data centers. Our approach selects the most adequate power mode and frequency of each host during runtime using a distributed multi-agent Machine Learning (ML) based strategy, and migrates the VMs accordingly using a centralized heuristic. Our Multi-AGent machine learNing-based approach for Energy efficienT dynamIc Consolidation (MAGNETIC) is implemented in a modified version of the CloudSim simulator, and considers the energy and delay overheads associated with host power mode transition and VM migration, and is evaluated using power traces collected from various workloads running in real servers and resource utilization logs from cloud data center infrastructures. Results show how our strategy reduces data center energy consumption by up to 15% compared to other works in the state-of-the-art (SoA), guaranteeing the same QoS and reducing the number of VM migrations and host power mode transitions by up to 86% and 90%, respectively. Moreover, it shows better scalability than all other approaches, taking less than 0.7% time overhead to execute for a data center with 1500 VMs. Finally, our solution is capable of detecting host performance variability due to failures, automatically migrating VMs from failing hosts and draining them from workload.

2018

Machine learning-based quality-aware power and thermal management of multistream HEVC encoding on multicore servers
Wissenschaftlicher Artikel ArODES

Arman Iranfar, Marina Zapater, David Atienza

IEEE Transactions on Parallel and Distributed Systems,  2018, vol. 29, no. 10, pp. 2268 - 2281

Link zur Publikation

Zusammenfassung:

The emergence of video streaming applications, together with the users' demand for high-resolution contents, has led to the development of new video coding standards, such as High Efficiency Video Coding (HEVC). HEVC provides high efficiency at the cost of increased complexity. This higher computational burden results in increased power consumption in current multicore servers. To tackle this challenge, algorithmic optimizations need to be accompanied by content-aware application-level strategies, able to reduce power while meeting compression and quality requirements. In this paper, we propose a machine learning-based power and thermal management approach that dynamically learns and selects the best encoding configuration and operating frequency for each of the videos running on multicore servers, by using information from frame compression, quality, encoding time, power, and temperature. In addition, we present a resolution-aware video assignment and migration strategy that reduces the peak and average temperature of the chip while maintaining the desirable encoding time. We implemented our approach in an enterprise multicore server and evaluated it under several common scenarios for video providers. On average, compared to a state-of-the-art technique, for the most realistic scenario, our approach improves BD-PSNR and BD-rate by 0.54 dB, and 8 percent, respectively, and reduces the encoding time, power consumption, and average temperature by 15.3, 13, and 10 percent, respectively. Moreover, our proposed approach enhances BDPSNR and BD-rate compared to the HEVC Test Model (HM), by 1.19 dB and 24 percent, respectively, without any encoding time degradation, when power and temperature constraints are relaxed.

Exploring manycore architectures for next-generation HPC systems through the MANGO approach
Wissenschaftlicher Artikel ArODES

José Flich, Giovanni Agosta, Philipp Ampletzer, David Atienza Alonso, Carlo Brandolese, Marina Zapater

Microprocessors and Microsystems,  2018, vol. 61, pp. 154-170

Link zur Publikation

Zusammenfassung:

The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.

Integrating heuristic and machine-learning methods for efficient virtual machine allocation in data centers
Wissenschaftlicher Artikel ArODES

Ali Pahlevan, Xiaoyu Qu, Marina Zapater, David Atienza

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,  2018, vol. 37, no. 8, pp. 1667 - 1680

Link zur Publikation

Zusammenfassung:

Modern cloud data centers (DCs) need to tackle efficiently the increasing demand for computing resources and address the energy efficiency challenge. Therefore, it is essential to develop resource provisioning policies that are aware of virtual machine (VM) characteristics, such as CPU utilization and data communication, and applicable in dynamic scenarios. Traditional approaches fall short in terms of flexibility and applicability for large-scale DC scenarios. In this paper, we propose a heuristic- and a machine learning (ML)-based VM allocation method and compare them in terms of energy, quality of service (QoS), network traffic, migrations, and scalability for various DC scenarios. Then, we present a novel hyper-heuristic algorithm that exploits the benefits of both methods by dynamically finding the best algorithm, according to a user-defined metric. For optimality assessment, we formulate an integer linear programming (ILP)-based VM allocation method to minimize energy consumption and data communication, which obtains optimal results, but is impractical at runtime. Our results demonstrate that the ML approach provides up to 24% server-to-server network traffic improvement and reduces execution time by up to 480× compared to conventional approaches, for large-scale scenarios. On the contrary, the heuristic outperforms the ML method in terms of energy and network traffic for reduced scenarios. We also show that the heuristic and ML approaches have up to 6% energy consumption overhead compared to ILP-based optimal solution. Our hyper-heuristic integrates the strengths of both the heuristic and the ML methods by selecting the best one during runtime.

Power transmission and workload balancing policies in eHealth mobile cloud computing scenarios
Wissenschaftlicher Artikel ArODES

Josué Pagán, Marina Zapater, José L. Ayala

Future Generation Computer Systems,  2018, vol. 78, no. 2, pp. 587-601

Link zur Publikation

Zusammenfassung:

The Internet of Things (IoT) holds big promises for healthcare, especially in proactive personal eHealth. Prediction of symptomatic crises in chronic diseases in the IoT scenario leads to the deployment of ambulatory monitoring systems. These systems place a major concern in the amount of data to be processed and the intelligent management of the energy consumption. The huge amount of data generated for these systems require high computing capabilities only available in Data Centers. This paper presents a real case of prediction in the eHealth scenario, devoted to neurological disorders. The presented case study focuses on the migraine headache, a disease that affects around 15% of the European population. This paper extrapolates results from real data and simulations in a study where migraine patients are monitored using an unobtrusive Wireless Body Sensor Network. Low-power techniques are applied in monitorization nodes. Techniques such us: on-node signal processing and radio policies to make node’s autonomy longer and save energy, have been applied. Workload balancing policies are carried out in the coordinator nodes and Data Centers to reduce the computational burden in these facilities and minimize its energy consumption. Our results draw average savings of € 288 million in this eHealth scenario applied only to 2% of European migraine sufferers; in addition to savings of € 1272 million due to the benefits of the migraine prediction.

PowerCool :
Wissenschaftlicher Artikel ArODES
simulation of cooling and powering of 3D MPSoCs with integrated flow cell arrays

Artem Aleksandrovich Andreev, Arvind Sridhar, Mohamed M. Sabry, Marina Zapater, Patrick Ruch

IEEE Transactions on Computers,  2018, vol. 67, no. 1, pp. 73 - 85

Link zur Publikation

Zusammenfassung:

Integrated Flow-Cell Arrays (FCAs) represent a combination of integrated liquid cooling and on-chip power generation, converting chemical energy of the flowing electrolyte solutions to electrical energy. The FCA technology provides a promising way to address both heat removal and power delivery issues in 3D Multiprocessor Systems-on-Chips (MPSoCs). In this paper we motivate the benefits of FCA in 3D MPSoCs via a qualitative analysis and explore the capabilities of the proposed technology using our extended PowerCool simulator. PowerCool is a tool that performs combined compact thermal and electrochemical simulation of 3D MPSoCs with inter-tier FCA-based cooling and power generation. We validate our electrochemical model against experimental data obtained using a micro-scale FCA, and extend PowerCool with a compact thermal model (3D-ICE) and subthreshold leakage estimation. We show the sensitivity of the FCA cooling and power generation on the design-time (FCA geometry) and run-time (fluid inlet temperature, flow rate) parameters. Our results show that we can optimize the FCA to keep maximum chip temperature below 95 °C for an average chip power consumption of 50 W/cm 2 while generating up to 3.6 W per cm 2 of chip area.

2017

Reconsidering the performance of DEVS modeling and simulation environments using the DEVStone benchmark
Wissenschaftlicher Artikel ArODES

José L. Risco-Martin, Saurabh Mittal, Juan Carlos Fabero Jiménez, Marina Zapater, Román Hermida Correa

SIMULATION,  2017, vol. 93, no. 6, pp. 459–476

Link zur Publikation

Zusammenfassung:

The discrete event system specification formalism, which supports hierarchical and modular model composition, has been widely used to understand, analyze and develop a variety of systems. Discrete event system specification has been implemented in various languages and platforms over the years. The DEVStone benchmark was conceived to generate a set of models with varied structure and behavior, and to automate the evaluation of the performance of discrete event system specification-based simulators. However, DEVStone is still in a preliminary phase and more model analysis is required. In this paper, we revisit DEVStone introducing new equations to compute the number of events triggered. We also introduce a new benchmark with a similar central processing unit and memory requirements to the most complex benchmark in DEVStone, but with an easier implementation and with it being more manageable analytically. Finally, we compare both the performance and memory footprint of five different discrete event system specification simulators in two different hardware platforms.

2016

Runtime data center temperature prediction using grammatical evolution techniques
Wissenschaftlicher Artikel ArODES

Marina Zapater, José L. Risco-Martín, Patricia Arroba, José L. Ayala, José M. Moya

Applied Soft Computing,  2016, vol. 49, pp. 94-107

Link zur Publikation

Zusammenfassung:

Data Centers are huge power consumers, both because of the energy required for computation and the cooling needed to keep servers below thermal redlining. The most common technique to minimize cooling costs is increasing data room temperature. However, to avoid reliability issues, and to enhance energy efficiency, there is a need to predict the temperature attained by servers under variable cooling setups. Due to the complex thermal dynamics of data rooms, accurate runtime data center temperature prediction has remained as an important challenge. By using Grammatical Evolution techniques, this paper presents a methodology for the generation of temperature models for data centers and the runtime prediction of CPU and inlet temperature under variable cooling setups. As opposed to time costly Computational Fluid Dynamics techniques, our models do not need specific knowledge about the problem, can be used in arbitrary data centers, re-trained if conditions change and have negligible overhead during runtime prediction. Our models have been trained and tested by using traces from real Data Center scenarios. Our results show how we can fully predict the temperature of the servers in a data rooms, with prediction errors below 2 °C and 0.5 °C in CPU and server inlet temperature respectively.

2015

Leakage-aware cooling management for improving server energy efficiency
Wissenschaftlicher Artikel ArODES

Marina Zapater, Ozan Tuncer, José L. Ayala, José M. Moya, Kalyan Vaidyanathan

IEEE Transactions on Parallel and Distributed Systems,  2015, vol. 26, no. 10, pp. 2764 - 2777

Link zur Publikation

Zusammenfassung:

The computational and cooling power demands of enterprise servers are increasing at an unsustainable rate. Understanding the relationship between computational power, temperature, leakage, and cooling power is crucial to enable energy-efficient operation at the server and data center levels. This paper develops empirical models to estimate the contributions of static and dynamic power consumption in enterprise servers for a wide range of workloads, and analyzes the interactions between temperature, leakage, and cooling power for various workload allocation policies. We propose a cooling management policy that minimizes the server energy consumption by setting the optimum fan speed during runtime. Our experimental results on a presently shipping enterprise server demonstrate that including leakage awareness in workload and cooling management provides additional energy savings without any impact on performance.

Self-organizing maps versus growing neural gas in detecting anomalies in data centres
Wissenschaftlicher Artikel ArODES

Marina Zapater, David Fraga, Pedro Malagón, Zorana Bankovic, José M. Moya

Logic Journal of IGPL,  2015, vol. 23, no. 3, pp. 495–505

Link zur Publikation

Zusammenfassung:

Reliability is one of the key performance factors in data centres. The out-of-scale energy costs of these facilities lead data centre operators to increase the ambient temperature of the data room to decrease cooling costs. However, increasing ambient temperature reduces the safety margins and can result in a higher number of anomalous events. Anomalies in the data centre need to be detected as soon as possible to optimize cooling efficiency and mitigate the harmful effects over servers. This article proposes the usage of clustering-based outlier detection techniques coupled with a trust and reputation system engine to detect anomalies in data centres. We show how self-organizing maps or growing neural gas can be applied to detect cooling and workload anomalies, respectively, in a real data centre scenario with very good detection and isolation rates, in a way that is robust to the malfunction of the sensors that gather server and environmental information.

Energy-aware policies in ubiquitous computing facilities
Buchkapitel ArODES

Marina Zapater, Patricia Arroba, José Luis Ayala Rodrigo, Katzalin Olcoz Herrero, José Manuel Moya Fernandez

Cloud computing with e-science applications  (20 p.). 2015,  Boca Raton : CRC Press

Link zur Publikation

Zusammenfassung:

This chapter provides a vision of the increasing energy problem in computing facilities with focuses on cloud computing, under the new computational paradigms, and proposes solutions from a global, multilayer perspective, describing a novel system architecture, power models, and optimization algorithms. Researchers have done a massive amount of work to address the issues and provide energy-aware computing environments. Consolidation allows reducing the number of operating servers to process the same workload, minimizing the static consumption, which leads to operating server set and turn-off policies. Cloud computing, mobile cloud computing, or even modern high-performance computing start with data centers. While we can dream of a world in which anyone is allowed to sell their excess computing capacity as virtualized resources to anyone else or where the ubiquitous sensing of information is processed by a center kilometers away from the source.

Comparative study of meta-heuristic 3D floorplanning algorithms
Wissenschaftlicher Artikel ArODES

Alfredo Cuesta-Infante, J. Manuel Colmenar, Zorana Bankovic, José L. Risco-Martín, Marina Zapater

Neurocomputing,  2015, vol. 150, part A, pp. 67-81

Link zur Publikation

Zusammenfassung:

Constant necessity of improving performance has brought the invention of 3D chips. The improvement is achieved due to the reduction of wire length, which results in decreased interconnection delay. However, 3D stacks have less heat dissipation due to the inner layers, which leads to increased temperature and the appearance of hot spots. This problem can be mitigated through appropriate floorplanning. For this reason, in this work we present and compare five different solutions for floorplanning of 3D chips. Each solution uses a different representation, and all are based on meta-heuristic algorithms, namely three of them are based on simulated annealing, while two other are based on evolutionary algorithms. The results show great capability of all the solutions in optimizing temperature and wire length, as they all exhibit significant improvements comparing to the benchmark floorplans.

Enhancing regression models for complex systems using evolutionary techniques for feature engineering
Wissenschaftlicher Artikel ArODES

Patricia Arroba, José L. Risco-Martín, Marina Zapater, José M. Moya, José L. Ayala

Journal of Grid Computing,  2015, vol. 13, pp. 409–423

Link zur Publikation

Zusammenfassung:

This work proposes an automatic methodology for modeling complex systems. Our methodology is based on the combination of Grammatical Evolution and classical regression to obtain an optimal set of features that take part of a linear and convex model. This technique provides both Feature Engineering and Symbolic Regression in order to infer accurate models with no effort or designer’s expertise requirements. As advanced Cloud services are becoming mainstream, the contribution of data centers in the overall power consumption of modern cities is growing dramatically. These facilities consume from 10 to 100 times more power per square foot than typical office buildings. Modeling the power consumption for these infrastructures is crucial to anticipate the effects of aggressive optimization policies, but accurate and fast power modeling is a complex challenge for high-end servers not yet satisfied by analytical approaches. For this case study, our methodology minimizes error in power prediction. This work has been tested using real Cloud applications resulting on an average error in power estimation of 3.98 %. Our work improves the possibilities of deriving Cloud energy efficient policies in Cloud data centers being applicable to other computing environments with similar characteristics.

2014

A novel energy-driven computing paradigm for e-health scenarios
Wissenschaftlicher Artikel ArODES

Marina Zapater, Patricia Arroba, José L. Ayala, José M. Moya, Katzalin Olcoz

Future Generation Computer Systems,  2014, vol. 34, pp. 138-154

Link zur Publikation

Zusammenfassung:

A first-rate e-Health system saves lives, provides better patient care, allows complex but useful epidemiologic analysis and saves money. However, there may also be concerns about the costs and complexities associated with e-health implementation, and the need to solve issues about the energy footprint of the high-demanding computing facilities. This paper proposes a novel and evolved computing paradigm that: (i) provides the required computing and sensing resources; (ii) allows the population-wide diffusion; (iii) exploits the storage, communication and computing services provided by the Cloud; (iv) tackles the energy-optimization issue as a first-class requirement, taking it into account during the whole development cycle. The novel computing concept and the multi-layer top-down energy-optimization methodology obtain promising results in a realistic scenario for cardiovascular tracking and analysis, making the Home Assisted Living a reality.

2012

GreenDisc :
Buchkapitel ArODES
a HW/SW energy optimization framework in globally distributed computation

Marina Zapater, José L. Ayala, Jose M. Moya

Ubiquitous Computing and Ambient Intelligence  (8 p.). 2012,  Heidelberg : Springer

Link zur Publikation

Zusammenfassung:

In recent future, wireless sensor networks (WSNs) will experience a broad high-scale deployment (millions of nodes in the national area) with multiple information sources per node, and with very specific requirements for signal processing. In parallel, the broad range deployment of WSNs facilitates the definition and execution of ambitious studies, with a large input data set and high computational complexity. These computation resources, very often heterogeneous and driven on-demand, can only be satisfied by high-performance Data Centers (DCs). The high economical and environmental impact of the energy consumption in DCs requires aggressive energy optimization policies. These policies have been already detected but not successfully proposed. In this context, this paper shows the following on-going research lines and obtained results. In the field of WSNs: energy optimization in the processing nodes from different abstraction levels, including reconfigurable application specific architectures, efficient customization of the memory hierarchy, energy-aware management of the wireless interface, and design automation for signal processing applications. In the field of DCs: energy-optimal workload assignment policies in heterogeneous DCs, resource management policies with energy consciousness, and efficient cooling mechanisms that will cooperate in the minimization of the electricity bill of the DCs that process the data provided by the WSNs.

Ubiquitous green computing techniques for high demand applications in smart environments
Wissenschaftlicher Artikel ArODES

Marina Zapater, Cesar Sanchez, Jose L. Ayala, Jose M. Moya, José L. Risco-Martín

Sensors,  2012, vol. 12, no. 8, pp. 10659-10677

Link zur Publikation

Zusammenfassung:

Ubiquitous sensor network deployments, such as the ones found in Smart cities and Ambient intelligence applications, require constantly increasing high computational demands in order to process data and offer services to users. The nature of these applications imply the usage of data centers. Research has paid much attention to the energy consumption of the sensor nodes in WSNs infrastructures. However, supercomputing facilities are the ones presenting a higher economic and environmental impact due to their very high power consumption. The latter problem, however, has been disregarded in the field of smart environment services. This paper proposes an energy-minimization workload assignment technique, based on heterogeneity and application-awareness, that redistributes low-demand computational tasks from high-performance facilities to idle nodes with low and medium resources in the WSN infrastructure. These non-optimal allocation policies reduce the energy consumed by the whole infrastructure and the total execution time.

Compiler optimizations as a countermeasure against side-channel analysis in MSP430-based devices
Wissenschaftlicher Artikel ArODES

Pedro Malagón, Juan-Mariano de Goyeneche, Marina Zapater, José M. Moya, Zorana Bankovic

Sensors,  2012, vol. 12, no. 6, pp. 7994-8012

Link zur Publikation

Zusammenfassung:

Ambient Intelligence (AmI) requires devices everywhere, dynamic and massively distributed networks of low-cost nodes that, among other data, manage private information or control restricted operations. MSP430, a 16-bit microcontroller, is used in WSN platforms, as the TelosB. Physical access to devices cannot be restricted, so attackers consider them a target of their malicious attacks in order to obtain access to the network. Side-channel analysis (SCA) easily exploits leakages from the execution of encryption algorithms that are dependent on critical data to guess the key value. In this paper we present an evaluation framework that facilitates the analysis of the effects of compiler and backend optimizations on the resistance against statistical SCA. We propose an optimization-based software countermeasure that can be used in current low-cost devices to radically increase resistance against statistical SCA, analyzed with the new framework.

2024

Cross-layer exploration of 2.5D energy-efficient heterogeneous chiplets integration :
Konferenz ArODES
from system simulation to open hardware

Anna Burdina, Gabriel Catel Torres, Davide Schiavone, Miguel Peón-Quirós, Giovanni Ansaloni, David Atienza, Marina Zapater

Proceedings of the ISLPED '24: Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design

Link zur Konferenz

Zusammenfassung:

In the past decade, computing systems have significantly increased in complexity and power consumption. Nowadays, heterogeneous multi-processor systems-on-chip (MPSoCs) integrate many computing cores. Heterogeneous MPSoCs often comprise general-purpose processors and a variety of accelerators, thus supporting specialized functions for the target application domain to minimize overall energy when executing a specific task. The ensuing architectural design space is, therefore, increasingly multi-dimensional, especially in the light of upcoming 2.5D/3D chiplets integration, which, on one side, allows unprecedented system integration possibilities but, on the other, exacerbates data transfer bottlenecks and affects overall power consumption significantly. To traverse such space in search of high-performance/high-efficiency solutions, we introduce a cross-layer approach combining fast explorations with virtual systems with modular open-hardware design frameworks. This paper showcases how these two approaches effectively cross-fertilize: detailed hardware designs are essential in calibrating performance, power and temperature models, and validating simulation outcomes. Conversely, full system simulation is crucial for projecting the impact of design choices towards complex but energy-efficient heterogeneous multi-processor architectures.

DroneBandit :
Konferenz ArODES
multi-armed contextual bandits for collaborative edge-to-cloud inference in resource-constrained nanodrones

Guillaume Chacun, Mehdi Akeddar, Thomas Rieder, Bruno Da Rocha Carvalho, Marina Zapater

Proceedings of GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024

Link zur Konferenz

Zusammenfassung:

In recent years, Artificial Intelligence (AI) has seen a remarkable expansion. Traditionally, applications primarily relied on edge computing, closer to data sources. However, today edge computing approaches are outstripped by modern AI demands. In this paper, we introduce a novel framework together with an online decision algorithm based on multi-armed contextual bandits (DroneBandit), for the dynamic allocation of inference tasks between edge and cloud, to increase inference performance. Our test environment consists of a resource-constrained nanodrone equipped with a custom DNN for obstacle detection, able to achieve 80% accuracy and 71% F2-score. DroneBandit runs on the drone and chooses the optimal cutting point for each iteration on the fly. The decision is based on predicted back-end delays (i.e. data transfer and inference time on the cloud) and observed front-end delays (i.e. inference time on the edge). DroneBandit achieves 83% accuracy and 89% Top3 accuracy in predicting the ideal cutting point on simulations. Our framework demonstrates enhanced task allocation adaptability, enabling efficient computation offloading and edge computing reliance in varying network conditions. Our experiments, conducted on a nanodrone with an ARM CPU and GAP8 RISC-V accelerator, incorporate quantization and optimization, showcasing efficient obstacle detection in dynamic scenarios.

2023

REMOTE :
Konferenz ArODES
re-thinking task mapping on wireless 2.5D systems-on-package for hotspot removal

Rafael Medina, Darong Huang, Giovanni Ansaloni, Marina Zapater, David Atienza

Proceedings of the 2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC)

Link zur Konferenz

Zusammenfassung:

2.5D Systems-on-Package (SoPs) are composed by several chiplets placed on an interposer. They are becoming increasingly popular as they enable easy integration of electronic components in the same package and high fabrication yields. Nevertheless, they introduce a new bottleneck in inter-chiplet communication, which must be routed through the interposer. Such a constraint favors mapping related tasks on computing cores within the same chiplet, leading to thermal hotspots. In-package wireless technology holds promise to reconsider such a position because integrated wireless antennas provide low-latency and high-bandwidth communication paths, thus bypassing the in-terposer bottleneck. Furthermore, in this work, we propose a new task mapping heuristic that leverages in-package wireless technology to improve the thermal behavior of 2.5D SoPs executing complex applications. Combining system simulation and thermal modeling, our results show that we can distribute computation in wireless 2.5D SoPs to reduce peak temperatures by up to 24% through task mapping with a negligible performance impact.

Validating full-system RISC-V simulator :
Konferenz ArODES
a systematic approach

Karan Pathak, Joshua Klein, Giovanni Ansaloni, Marina Zapater, David Atienza

Proceedings of RISC-V Summit Europe

Link zur Konferenz

Zusammenfassung:

RISC-V-based Systems-on-Chip (SoCs) are witnessing a steady rise in adoption in both industry and academia. However, the limited support for Linux-capable Full System-level simulators hampers development of the RISC-V ecosystem. We address this by validating a full system-level simulator, gXR5 (gem5-eXtensions for RISC-V), against the SiFive HiFive Unleashed SoC, to ensure performance statistics are representative of actual hardware. This work also enriches existing methodologies to validate the gXR5 simulator against hardware by proposing a systematic component-level calibration approach. The simulator error for selected SPEC CPU2017 applications reduces from 44% to 24%, just by calibrating the CPU. We show that this systematic component-level calibration approach is accurate, fast (in terms of simulation time), and generic enough to drive future validation efforts.

An opensource framework for edge-to-cloud inference on resource-constrained RISC-V systems
Konferenz ArODES

Mehdi Akeddar, Thomas Rieder, Guillaume Chacun, Bruno Da Rocha Carvalho, Marina Zapater

Proceedings of the RISC-V Summit Europe, 5-9th June 2023, Barcelona, Spain

Link zur Konferenz

Zusammenfassung:

In recent years we are witnessing an increasing adoption of RISC-V based systems to run Artificial Intelligence (AI) inference tasks. This trend spans to visual navigation, where major players start adopting RISC-V for autonomous driving. Still, RISC-V based edge devices fall short in providing the performance requirements of complex AI inference. Our work tackles the previous challenges by proposing an opensource framework for transparent distribution of visual navigation inference tasks between edge and cloud for resource-constrained RISC-V edge devices. Our framework automates the partitioning of ONNX and TFLite models between a RISC-V accelerated nanodrone equipped a GAP8 system-on-chip and a cloud server. Our results showcase how partial inference improves the performance achieve by drone-only inference.

System-level exploration of in-package wireless communication for multi-chiplet platforms
Konferenz ArODES

Rafael Medina, Joshua Kein, Giovanni Ansaloni, Marina Zapater, Sergi Abadal, Eduard Alarcón, David Atienza

Proceedings of the 28th Asia and South Pacific Design Automation Conference (ASPDAC '23)

Link zur Konferenz

Zusammenfassung:

Multi-Chiplet architectures are being increasingly adopted to support the design of very large systems in a single package, facilitating the integration of heterogeneous components and improving manufacturing yield. However, chiplet-based solutions have to cope with limited inter-chiplet routing resources, which complicate the design of the data interconnect and the power delivery network. Emerging in-package wireless technology is a promising strategy to address these challenges, as it allows to implement flexible chiplet interconnects while freeing package resources for power supply connections. To assess the capabilities of such an approach and its impact from a full-system perspective, herein we present an exploration of the performance of in-package wireless communication, based on dedicated extensions to the gem5-X simulator. We consider different Medium Access Control (MAC) protocols, as well as applications with different runtime profiles, showcasing that current in-package wireless solutions are competitive with wired chiplet interconnects. Our results show how in-package wireless solutions can outperform wired alternatives when running artificial intelligence workloads, achieving up to a 2.64× speed-up when running deep neural networks (DNNs) on a chiplet-based system with 16 cores distributed in four clusters.

2022

Accurate thermal modeling of heterogeneous multi-core processors
Konferenz ArODES

Darong Huang, Luis Costero, Federico Terraneo, Marina Zapater, David Atienza

Proceedings of HSSB Workshop - ISCA 2022

Link zur Konferenz

Zusammenfassung:

The increasing power density of modern multi-core processors using deep nano-scale technologies has entailed severe thermal issues for such chips. Indeed, industry’s heterogeneous chip design trends exacerbate transient non-uniform thermal hotspots in next-generation processors. Different cooling solutions are coming into play to alleviate this situation, such as liquid, evaporative, or thermoelectric cooling, among others. Hence, a new generation of thermal simulators with unprecedented flexibility is required to include such technologies in the modeling of nano-scale IC design technologies. This work presents 3D-ICE 3.1, the first thermal simulator designed for fully customized nonuniform modeling and accurate co-simulation with different heat dissipation systems.

2021

Exact neural networks from inexact multipliers via Fibonacci weight encoding
Konferenz ArODES

William Andrew Simon, Valérian Ray, Alexandre Levisse, Giovanni Ansaloni, Marina Zapater, David Atienza

Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC)

Link zur Konferenz

Zusammenfassung:

Edge devices must support computationally demanding algorithms, such as neural networks, within tight area/energy budgets. While approximate computing may alleviate these constraints, limiting induced errors remains an open challenge. In this paper, we propose a hardware/software co-design solution via an inexact multiplier, reducing area/power-delay-product requirements by 73/43%, respectively, while still computing exact results when one input is a Fibonacci encoded value. We introduce a retraining strategy to quantize neural network weights to Fibonacci encoded values, ensuring exact computation during inference. We benchmark our strategy on Squeezenet 1.0, DenseNet-121, and ResNet-18, measuring accuracy degradations of only 0.4/1.1/1.7%.

Architecting more than Moore :
Konferenz ArODES
wireless plasticity for massive heterogeneous computer architectures (WiPLASH)

Joshua Klein, Alexandre Levisse, Giovanni Ansaloni, David Atienza, Marina Zapater, Martino Dazzi, Geethan Karunaratne, Irem Boybat, Abu Sebastian, Davide Rossi, Francesco Conti, Elana Pereira de Santana, Peter Haring Bolívar, Mohamed Saeed, Renato Negra, Zhenxing Wang, Kun-Ta Wang, Max C. Lemme, Akshay Jain, Robert Guirado, Hamidreza Taghvaee, Sergi Abadal

Proceedings of the 18th ACM International Conference on Computing Frontiers (CF'21)

Link zur Konferenz

Zusammenfassung:

This paper presents the research directions pursued by the WiPLASH European project, pioneering on-chip wireless communications as a disruptive enabler towards next-generation computing systems for artificial intelligence (AI). We illustrate the holistic approach driving our research efforts, which encompass expertises and abstraction levels ranging from physical design of embedded graphene antennas to system-level evaluation of wirelessly-communicating heterogeneous systems.

2020

A hybrid cache HW/SW stack for optimizing neural network runtime, power and endurance
Konferenz ArODES

Alexandre Levisse, Marina Zapater, David Atienza

Proceedings of 28th IFIP/IEEE International Conference on Very Large Scale Integration

Link zur Konferenz

Zusammenfassung:

Hybrid caches consisting of both SRAM and emerging Non-Volatile Random Access Memory (eNVRAM) bitcells increase cache capacity and reduce power consumption by taking advantage of eNVRAM’s small area footprint and low leakage energy. However, they also inherit eNVRAM’s drawbacks, including long write latency and limited endurance. To mitigate these drawbacks, many works propose heuristic strategies to allocate memory blocks into SRAM or eNVRAM arrays at runtime based on block content or access pattern. In contrast, this work presents a HW/SW Stack for Hybrid Caches (SHyCache), consisting of a hybrid cache architecture and supporting programming model, reminiscent of those that enable GP-GPU acceleration, in which application variables can be allocated explicitly to the eNVRAM cache, eliminating the need for heuristics and reducing cache access time, power consumption, and area overhead while maintaining maximal cache utilization efficiency and ease of programming. SHyCache improves performance for applications such as neural networks, which contain large numbers of invariant weight values with high read/write access ratios that can be explicitly allocated to the eNVRAM array. We simulate SHyCache on the gem5-X architectural simulator and demonstrate its utility by benchmarking a range of cache hierarchy variations using three neural networks, namely, Inception v4, ResNet-50, and SqueezeNet 1.0. We demonstrate a design space that can be exploited to optimize performance, power consumption, or endurance, depending on the expected use case of the architecture, while demonstrating maximum performance gains of 1.7/1.4/1.3x and power consumption reductions of 5.1/5.2/5.4x, for Inception/ResNet/SqueezeNet, respectively.

Towards Deeply Scaled 3D MPSoCs with Integrated Flow Cell Array Technology
Konferenz ArODES

Halima Najibi, Alexandre Levisse, Marina Zapater, Sabry Aly Mohamed M., David Atienza

Proceedings of GLSVLSI '20: Proceedings of the 2020 on Great Lakes Symposium on VLSI, September 2020, Online

Link zur Konferenz

Zusammenfassung:

Deeply-scaled three-dimensional (3D) Multi-Processor Systemson-Chip (MPSoCs) enable high performance and massive communication bandwidth for next-generation computing. However as process nodes shrink, temperature-dependent leakage dramatically increases, and thermal and power management becomes problematic. In this context, Integrated Flow Cell Array (FCA) technology, which consists of inter-tier microfluidic channels, combines onchip electrochemical power generation and liquid cooling of 3D MPSoCs. When connected to power delivery networks (PDN) of dies, FCAs provide an additional current compensating the voltage drop (IR-drop). In this paper, we evaluate for the first time how the IR-drop reduction and cooling capabilities of FCAs scale with advanced CMOS processes. We develop a framework to quantify the system-level impact of FCAs at technology nodes from 22𝑛𝑚 to 3𝑛𝑚. Our results show that, across all considered nodes, FCAs reduce the peak temperature of a multi-core processor (MCP) and a Machine Learning (ML) accelerator by over 22°C and 35°C, respectively, compared to off-chip direct liquid cooling. Moreover, the low operation voltages and high temperatures at advanced nodes improve up to 2× FCA power generation. Hence, FCAs allow to keep the IR-drop below 5% for both the MCP and ML accelerator, saving over 10% TSV-reserved area, as opposed to using a HighPerformance Computing (HPC) MPSoC liquid cooling solution.

Enabling optimal power generation of flow cell arrays in 3D MPSoCs with on-chip switched capacitor converters
Konferenz ArODES

Halima Najibi, Jorge Hunter, Alexandre Levisse, Marina Zapater, Miroslav Vasic, David Atienza

Proceedings of 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 6-8 July 2020, Limassol, Cyprus

Link zur Konferenz

Zusammenfassung:

Flow cell arrays (FCAs) provide efficient on-chip liquid cooling and electrochemical power generation capabilities in three-dimensional multi-processor systems-on-chip (3D MPSoCs). When connected to power delivery networks (PDNs) of chips, the current flowing between FCA electrodes partially supplies logic gates and compensates over 20% Vdd drop in high-performance 3D systems. However, operation voltages of CMOS technologies are generally higher than the voltage corresponding to the maximal FCA power generation. Hence, directly connecting FCAs to 3D MPSoC power grids results in sub-optimal performance. In this paper, we design an on-chip direct current to direct current (DC-DC) converter to improve FCA power generation in high-performance 3D MPSoCs. We use switched capacitor (SC) technology and explore different design space parameters to achieve minimal area requirement and maximal power extraction. The proposed converter enables a stable and optimal voltage between FCA electrodes. Furthermore, it allows us to dynamically control FCA connectivity to 3D PDNs, and switch off power extraction during chip inactivity. We show that regulated FCAs generate up to 123% higher power with respect to the case they are directly connected to 3D PDNs. In addition, connecting multiple flow cells to a single optimized converter reduces area requirement down to 1.26%, while maintaining IR-drop below 5%. Finally, we show that activity-based dynamic FCA switching extends by over 1.8X and 4.5X electrolytes lifetime for a processor duty-cycle of 50% and 20%, respectively.

Dynamic thermal management with proactive fan speed control through reinforcement learning
Konferenz ArODES

Arman Iranfar, Federico Terraneo, Gabor Csordas, Marina Zapater, William Fornaciari

Proceedings of the 2020 Design, Automation & Test in Europe Conference

Link zur Konferenz

Zusammenfassung:

Dynamic Thermal Management (DTM) has become a major challenge since it directly affects Multiprocessors Systems-on-chip (MPSoCs) performance, power consumption, and reliability. In this work, we propose a transient fan model, enabling adaptive fan speed control simulation for efficient DTM. Our model is validated through a thermal test chip achieving less than 2°C error in the worst case. With multiple fan speeds, however, the DTM design space grows significantly, which can ultimately make conventional solutions impractical. We address this challenge through a reinforcement learning-based solution to proactively determine the number of active cores, operating frequency, and fan speed. The proposed solution is able to reduce fan power by up to 40% compared to a DTM with constant fan speed with less than 1% performance degradation. Also, compared to a state-of-the-art DTM technique our solution improves the performance by up to 19% for the same fan power.

2019

An associativity-agnostic in-cache computing architecture optimized for multiplication
Konferenz ArODES

Marco Rios, William Simon, Alexandre Levisse, Marina Zapater, David Atienza

Proceedings of the 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC)

Link zur Konferenz

Zusammenfassung:

With the spread of cloud services and Internet of Things concept, there is a popularization of machine learning and artificial intelligence based analytics in our everyday life. However, an efficient deployment of these data-intensive services requires performing computations closer to the edge. In this context, in-cache computing, based on bitline computing, is promising to execute data-intensive algorithms in an energy efficient way by mitigating data movement in the cache hierarchy and exploiting data parallelism. Nevertheless, previous in-cache computing architectures contain serious circuit-level deficiencies (i.e., low bitcell density, data corruption risks, and limited performance), thus report high multiplication latency, which is a key operation for machine learning and deep learning. Moreover, no previous work addresses the issue of way misalignment, strongly constraining data placement not to reduce performance gains. In this work we drastically improve the previously proposed BLADE architecture for in-cache computing to efficiently support multiplication operations by enhancing the local bitline circuitry, enabling associativity-agnostic operations as well as in-place shifting inside local bitline groups. We implemented and simulated the proposed architecture in CMOS 28nm bulk technology from TSMC, validating its functionality and extracting its performance, area, and energy per operation. Then, we designed a behavioral model of the proposed architecture to assess its performance with respect to the latest BLADE architecture. We show a 17.5 and 22% area and energy reduction thanks to the proposed LG optimization. Finally, for 16bits multiplication, we demonstrate 44% cycle count, 47% energy and 41% performances gain versus BLADE and show that 4 embedded shifts is the best trade-off between energy, area and performances.

A machine learning-based framework for throughput estimation of time-varying applications in multi-core servers
Konferenz ArODES

Arman Iranfar, Wellington Silva De Souza, Marina Zapater, Katzalin Olcoz, Samuel Xavier de Souza

Proceedings of the 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC)

Link zur Konferenz

Zusammenfassung:

Accurate workload prediction and throughput estimation are keys in efficient proactive power and performance management of multi-core platforms. Although hardware performance counters available on modern platforms contain important information about the application behavior, employing them efficiently is not straightforward when dealing with time-varying applications even if they have iterative structures. In this work, we propose a machine learning-based framework for workload prediction and throughput estimation using hardware events. Our framework enables throughput estimation over various available system configurations, namely, number of parallel threads and operating frequency. In particular, we first employ workload clustering and classification techniques along with Markov chains to predict the next workload for each available system configuration. Then, the predicted workload is used to estimate the next expected throughput through a machine learning-based regression model. The comparison with state of the art demonstrates that our framework is able to improve Quality of Service (QoS) by 3.4x, while consuming 15% less power thanks to the more accurate throughput estimation.

A QoS and container-based approach for energy saving and performance profiling in multi-core servers
Konferenz ArODES

Arman Iranfar, Anderson Silva, Marina Zapater, Samuel Xavier de Souza

Proceedings of the 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC)

Link zur Konferenz

Zusammenfassung:

In this work we present ContainEnergy, a new performance evaluation and profiling tool that uses software containers to perform application runtime assessment, providing energy and performance profiling data. It is focused on energy efficiency for next generation workloads and IT infrastructure.

A product engine for energy-efficient execution of binary neural networks using resistive memories
Konferenz ArODES

João Vieira, Edouard Giacomin, Yasir Qureshi, Marina Zapater, Xifan Tang

Proceedings of the 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC)

Link zur Konferenz

Zusammenfassung:

The need for running complex Machine Learning (ML) algorithms, such as Convolutional Neural Networks (CNNs), in edge devices, which are highly constrained in terms of computing power and energy, makes it important to execute such applications efficiently. The situation has led to the popularization of Binary Neural Networks (BNNs), which significantly reduce execution time and memory requirements by representing the weights (and possibly the data being operated) using only one bit. Because approximately 90% of the operations executed by CNNs and BNNs are convolutions, a significant part of the memory transfers consists of fetching the convolutional kernels. Such kernels are usually small (e.g., 3×3 operands), and particularly in BNNs redundancy is expected. Therefore, equal kernels can be mapped to the same memory addresses, requiring significantly less memory to store them. In this context, this paper presents a custom Binary Dot Product Engine (BDPE) for BNNs that exploits the features of Resistive Random-Access Memories (RRAMs). This new engine allows accelerating the execution of the inference phase of BNNs. The novel BDPE locally stores the most used binary weights and performs binary convolution using computing capabilities enabled by the RRAMs. The system-level gem5 architectural simulator was used together with a C-based ML framework to evaluate the system's performance and obtain power results. Results show that this novel BDPE improves performance by 11.3%, energy efficiency by 7.4% and reduces the number of memory accesses by 10.7% at a cost of less than 0.3% additional die area, when integrated with a 28 nm Fully Depleted Silicon On Insulator ARMv8 in-order core, in comparison to a fully-optimized baseline of YoloV3 XNOR-Net running in a unmodified Central Processing Unit.

A design framework for thermal-aware power delivery network in 3D MPSoCs with integrated flow cell arrays
Konferenz ArODES

Halima Najibi, Alexandre Levisse, Marina Zapater

Proceedings of the 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)

Link zur Konferenz

Zusammenfassung:

Integrated Flow Cell Array (FCA) technology promises to address the power delivery and heat dissipation challenges in three-dimensional Multi-Processor Systems-on-Chips (3D MPSoCs) by providing combined inter-tier liquid cooling and power generation capabilities. In this paper, we present for the first time a design framework to accurately model the temperature-aware power delivery network in 3D MPSoCs, and quantify the effects of FCAs on the voltage drop (IR-drop). This framework estimates the power generation variation along FCAs due to voltage and temperature, in the case of uniform and non-uniform powermaps from several real processor traces. Furthermore, we explore different 3D MPSoC configurations to quantify their power delivery requirements. Our results show that FCAs improve the IR-drop with respect to state-of-the-art design methods up to 53% and 30% for dies with a power consumption of 60W and 190W, respectively, while maintaining their peak temperatures below 52°C, and at no additional Through Silicon Via (TSV) area overhead. In addition, as the presence of high power density regions (hotspots) can decrease the FCAs IR-drop reduction by up to 21% with respect to the average value, we present a scalable TSV placement optimization methodology using the proposed framework. This methodology minimizes the IR-drop at hotspots and guarantees an optimal and uniform exploitation of the IR-drop reduction benefits of FCAs.

A fast, reliable and wide-voltage-range in-memory computing architecture
Konferenz ArODES

William Simon, Juan Galicia, Alexandre Levisse, Marina Zapater, David Atienza

Proceedings of the 56th Annual Design Automation Conference 2019

Link zur Konferenz

Zusammenfassung:

As the computational complexity of applications on the consumer market, such as high-definition video encoding and deep neural networks, become ever more demanding, novel ways to efficiently compute data intensive workloads are being explored. In this context, In-Memory Computing (IMC) solutions, and particularly bitline computing in SRAM, appear promising as they mitigate one of the most energy consuming aspects in computation: data movement. While IMC architectural level characteristics have been defined by the research community, only a few works so far have explored the implementation of such memories at a low level. Furthermore, these proposed solutions are either slow (<1GHz), area hungry (10T SRAM), or suffer from read disturb and corruption issues. Overall, there is no extensive design study considering realistic assumptions at the circuit level. In this work we propose a fast (up to 2.2Ghz), 6T SRAM-based, reliable (no read disturb issues), and wide voltage range (from 0.6 to 1V) IMC architecture using local bitlines. Beyond standard read and write, the proposed architecture can perform copy, addition and shift operations at the array level. As addition is the slowest operation, we propose a modified carry chain adder, providing a 2× carry propagation improvement. The proposed architecture is validated using a 28nm bulk high performances technology PDK with CMOS variability and post-layout simulations. High density SRAM bitcells (0.127μm) enable area efficiency of 59.7% for a 256×128 array, on par with current industrial standards.

BLADE :
Konferenz ArODES
a bitline accelerator for devices on the edge

William Andrew Simon, Yasir Mahmood Qureshi, Alexandre Levisse, Marina Zapater, David Atienza

Proceedings of the 2019 on Great Lakes Symposium on VLSI

Link zur Konferenz

Zusammenfassung:

The increasing ubiquity of edge devices in the consumer market, along with their ever more computationally expensive workloads, necessitate corresponding increases in computing power to support such workloads. In-memory computing is attractive in edge devices as it reuses preexisting memory elements, thus limiting area overhead. Additionally, in-SRAM Computing (iSC) efficiently performs computations on spatially local data found in a variety of emerging edge device workloads. We therefore propose, implement, and benchmark BLADE, a BitLine Accelerator for Devices on the Edge. BLADE is an iSC architecture that can perform massive SIMD-like complex operations on hundreds to thousands of operands simultaneously. We implement BLADE in 28nm CMOS and demonstrate its functionality down to 0.6V, lower than any conventional state-of-the-art iSC architecture. We also benchmark BLADE in conjunction with a full Linux software stack in the gem5 architectural simulator, providing a robust demonstration of its performance gain in comparison to an equivalent embedded processor equipped with a NEON SIMD co-processor. We benchmark BLADE with three emerging edge device workloads, namely cryptography, high efficiency video coding, and convolutional neural networks, and demonstrate 4x, 6x, and 3x performance improvement, respectively, in comparison to a baseline CPU/NEON processor at an equivalent power budget.

Gem5-X :
Konferenz ArODES
a gem5-based system level simulation framework to optimize many-core platforms

Yasir Mahmood Qureshi, William Andrew Simon, Marina Zapater, David Atienza, Katzalin Olcoz

Proceedings of the 2019 Spring Simulation Conference

Link zur Konferenz

Zusammenfassung:

The rapid expansion of online-based services requires novel energy and performance efficient architectures to meet power and latency constraints. Fast architectural exploration has become a key enabler in the proposal of architectural innovation. In this paper, we present gem5-X, a gem5-based system level simulation framework, and a methodology to optimize many-core systems for performance and power. As real-life case studies of many-core server workloads, we use real-time video transcoding and image classification using convolutional neural networks (CNNs). Gem5-X allows us to identify bottlenecks and evaluate the potential benefits of architectural extensions such as in-cache computing and 3D stacked High Bandwidth Memory. For real-time video transcoding, we achieve 15% speed-up using in-order cores with in-cache computing when compared to a baseline in-order system and 76% energy savings when compared to an Out-of-Order system. When using HBM, we further accelerate real-time transcoding and CNNs by up to 7% and 8% respectively.

Definition of a transparent constraint-based modeling and simulation layer for the management of complex systems
Konferenz ArODES

Kevin Henares, José L. Risco-Martín, Marina Zapater

Proceedings of the Theory of Modeling and Simulation Symposium 2019

Link zur Konferenz

Zusammenfassung:

Modeling and Simulation (M&S) is one of the most multifaceted topics present today in both industry and academia. However, we are involved in a new M&S paradigm. Systems are becoming more complex and new simulation needs arise and have to be studied. As a consequence, the way in which we perform M&S must be adapted, providing new ideas and tools. In this paper, we propose a rule-based constraints evaluator, which facilitate the validation and verification of complex models in a transparent manner. For this, constraints are defined. The constraints definition process is completely independent of the model development process because (a) the set of constraints is defined once the model has been developed, and (b) constraints are validated at simulation time. The proposed Constraint M&S architecture has been built using the Discrete Event System Specification (DEVS) formalism and has been tested on a validated data center simulation model.

Enhancing two-phase cooling efficiency through thermal-aware workload mapping for power-hungry servers
Konferenz ArODES

Arman Iranfar, Ali Pahlevan, Marina Zapater, David Atienza

Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition

Link zur Konferenz

Zusammenfassung:

The power density and, consequently, power hungriness of server processors is growing by the day. Traditional air cooling systems fail to cope with such high heat densities, whereas single-phase liquid-cooling still requires high mass flow-rate, high pumping power, and large facility size. On the contrary, in a micro-scale gravity-driven thermosyphon attached on top of a processor, the refrigerant, absorbing the heat, turns into a two-phase mixture. The vapor-liquid mixture exchanges heat with a coolant at the condenser side, turns back to liquid state, and descends thanks to gravity, eliminating the need for pumping power. However, similar to other cooling technologies, thermosyphon efficiency can considerably vary with respect to workload performance requirements and thermal profile, in addition to the platform features, such as packaging and die floorplan. In this work, we first address the workload- and platform-aware design of a two-phase thermosyphon. Then, we propose a thermal-aware workload mapping strategy considering the potential and limitations of a two-phase thermosyphon to further minimize hot spots and spatial thermal gradients. Our experiments, performed on an 8-core Intel Xeon E5 CPU reveal, on average, up to 10°C reduction in thermal hot spots, and 45% reduction in the maximum spatial thermal gradient on the die. Moreover, our design and mapping strategy are able to decrease the chiller cooling power at least by 45%.

MAMUT :
Konferenz ArODES
multi-agent reinforcement learning for efficient real-time multi-user video transcoding

Luis Costero, Arman Iranfar, Marina Zapater, Francisco D. Igual, Katzalin Olcoz

Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition

Link zur Konferenz

Zusammenfassung:

Real-time video transcoding has recently raised as a valid alternative to address the ever-increasing demand for video contents in servers' infrastructures in current multi-user environments. High Efficiency Video Coding (HEVC) makes efficient online transcoding feasible as it enhances user experience by providing the adequate video configuration, reduces pressure on the network, and minimizes inefficient and costly video storage. However, the computational complexity of HEVC, together with its myriad of configuration parameters, raises challenges for power management, throughput control, and Quality of Service (QoS) satisfaction. This is particularly challenging in multi-user environments where multiple users with different resolution demands and bandwidth constraints need to be served simultaneously. In this work, we present MAMUT, a multi-agent machine learning approach to tackle these challenges. Our proposal breaks the design space composed of run-time adaptation of the transcoder and system parameters into smaller sub-spaces that can be explored in a reasonable time by individual agents. While working cooperatively, each agent is in charge of learning and applying the optimal values for internal HEVC and system-wide parameters. In particular, MAMUT dynamically tunes Quantization Parameter, selects number of threads per video, and sets the operating frequency with throughput and video quality objectives under compression and power consumption constraints. We implement MAMUT on an enterprise multicore server and compare equivalent scenarios to state-of-the-art alternative approaches. The obtained results reveal that MAMUT consistently attains up to 8× improvement in terms of FPS violations (and thus Quality of Service), 24% power reduction, as well as faster and more accurate adaptation both to the video contents and available resources.

2018

Fast energy estimation through partial execution of HPC applications
Konferenz ArODES

Juan Carlos Salinas-Hilburg, Marina Zapater, José M. Moya, José L. Ayala

Proceedings of the 29th International Conference on Application-specific Systems, Architectures and Processors

Link zur Konferenz

Zusammenfassung:

In order to optimize the energy use of servers in Data Centers, techniques such as power capping or power budgeting are usually deployed. These techniques rely on the prediction of the power and execution time of applications. These data are obtained via dynamic profiling which requires a full execution of the application. This is not feasible in High Performance Computing (HPC) applications with long execution times. In this paper, we present a methodology to estimate the dynamic CPU and memory energy consumption of an application without executing it completely. Our methodology merges static code analysis information and dynamic profiling via the partial execution of the application. We do so by leveraging the concept of application signature, defined as a reduced version of the application in terms of execution time and power profile. We validate our methodology with a set of CPU -intensive, memory-intensive benchmarks and multi-threaded applications in a presently shipping enterprise server. Our energy estimation methodology shows an overall error below 8.0% when compared to the dynamic energy of the whole execution of the application. Also, our energy estimation methodology allows to estimate the energy of multi-threaded applications with an RMSE equal to 12.7% when compared to the dynamic energy from the complete parallel execution.

Design optimization of 3D multi-processor system-on-chip with integrated flow cell arrays
Konferenz ArODES

Artem Andreev, Fulya Kaplan, Marina Zapater, Ayse K. Coskun, David Atienza

Proceedings of the International Symposium on Low Power Electronics and Design 2018

Link zur Konferenz

Zusammenfassung:

Integrated flow cell array (FCA) is an emerging technology, targeting the cooling and power delivery challenges of modern 2D/3D Multi-Processor Systems-on-Chip (MPSoCs). In FCA, electrolytic solutions are pumped through microchannels etched in the silicon of the chips, removing heat from the system, while, at the same time, generating power on-chip. In this work, we explore the impact of FCA system design on various 3D architectures and propose a methodology to optimize a 3D MPSoC with integrated FCA to run a given workload in the most energy-efficient way. Our results show that an optimized configuration can save up to 50% energy with respect to sub-optimal 3D MPSoC configurations.

Design of a two-phase gravity-driven micro-scale thermosyphon cooling system for high-performance computing data centers
Konferenz ArODES

André Seuret, Arman Iranfar, Marina Zapater, John Thome, David Atienza

Proceedings of 17th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems

Link zur Konferenz

Zusammenfassung:

Next-generation High-Performance Computing (HPC) systems need to provide outstanding performance with unprecedented energy efficiency while maintaining servers at safe thermal conditions. Air cooling presents important limitations when employed in HPC infrastructures. Instead, two-phase onchip cooling combines small footprint area and large heat exchange surface of micro-channels together with extremely high heat transfer performance, and allows for waste heat recovery. When relying on gravity to drive the flow to the heat sink, the system is called a closed-loop two-phase thermosyphon. Previous research work either focused on the development of large-scale proof-of-concept thermosyphon demonstrators, or on the development of numerical models able to simulate their operation. In this work, we present a new ultra-compact microscale thermosyphon design for high heat flux components. We manufactured a working 8 cm height prototype tailored for Virtex 7 FPGAs with a heat spreader area of 45 mm × 45 mm, and we validate its performance via measurements. The results are compared to our simulator and accurately match the thermal performance of the thermosyphon, with error of less than 3.5% . Our prototype is able to work over the full range of power of the Virtex7, dissipating up to 60 W of power while keeping chip temperature below 60°C. The prototype will next be deployed in a 10 kW rack as part of an HPC prototype, with an expected Power Usage Effectiveness (PUE) below 1.05.

A machine learning-based strategy for efficient resource management of video encoding on heterogeneous MPSoCs
Konferenz ArODES

Arman Iranfar, William Andrew Simon, Marina Zapater, David Atienza

Proceedings of the 2018 IEEE International Symposium on Circuits and Systems

Link zur Konferenz

Zusammenfassung:

The design of new streaming systems is becoming a major area of research to deploy services targeted in the Internet-of-Things (IoT) era. In this context, the new High Efficiency Video Coding (HEVC) standard provides high efficiency and scalability of quality at the cost of increased computational complexity for edge nodes, which is a new challenge for the design of IoT systems. The usage of hardware acceleration in conjunction with general-purpose cores in Multiprocessor Systems-on-Chip (MP-SoCs) is a promising solution to create heterogeneous computing systems to manage the complexity of real-time streaming for high-end IoT systems, achieving higher throughput and power efficiency when compared to conventional processors alone. Furthermore, Machine Learning (ML) provides a promising solution to efficiently use this next-generation of heterogeneous MPSoC designs that the EDA industry is developing by dynamically optimizing system performance under diverse requirements such as frame resolution, search area, operating frequency and stream allocation. In this work, we propose an ML-based approach for stream allocation and Dynamic Voltage and Frequency Scaling (DVFS) management on a heterogeneous MPSoC composed of ARM cores and FPGA fabric containing hardware accelerators for the motion estimation of HEVC encoding. Our experiments on a Zynq7000 SoC outline 20% higher throughput when compared to the state-of-the-art streaming systems for next-generation IoT devices.

Online efficient bio-medical video transcoding on MPSoCs through content-aware workload allocation
Konferenz ArODES

Arman Iranfar, Ali Pahlevan, Marina Zapater, Martin Zagar, Mario Kovac

Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition

Link zur Konferenz

Zusammenfassung:

Bio-medical image processing in the field of telemedicine, and in particular the definition of systems that allow medical diagnostics in a collaborative and distributed way is experiencing an undeniable growth. Due to the high quality of bio-medical videos and the subsequent large volumes of data generated, to enable medical diagnosis on-the-go it is imperative to efficiently transcode and stream the stored videos on real time, without quality loss. However, online video transcoding is a high-demanding computationally-intensive task and its efficient management in Multiprocessor Systems-on-Chip (MPSoCs) poses an important challenge. In this work, we propose an efficient motion- and texture-aware frame-level parallelization approach to enable online medical imaging transcoding on MPSoCs for next generation video encoders. By exploiting the unique characteristics of bio-medical videos and the medical procedure that enable diagnosis, we split frames into tiles based on their motion and texture, deciding the most adequate level of parallelization. Then, we employ the available encoding parameters to satisfy the required video quality and compression. Moreover, we propose a new fast motion search algorithm for bio-medical videos that allows to drastically reduce the computational complexity of the encoder, thus achieving the frame rates required for online transcoding. Finally, we heuristically allocate the threads to the most appropriate available resources and set the operating frequency of each one. We evaluate our work on an enterprise multicore server achieving online medical imaging with 1.6x higher throughput and 44% less power consumption when compared to the state-of-the-art techniques.

Energy proportionality in near-threshold computing servers and cloud data centers :
Konferenz ArODES
consolidating or not?

Ali Pahlevan, Yasir Mahmood Qureshi, Marina Zapater, Andrea Bartolini, Davide Rossi

Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition

Link zur Konferenz

Zusammenfassung:

Cloud Computing aims to efficiently tackle the increasing demand of computing resources, and its popularity has led to a dramatic increase in the number of computing servers and data centers worldwide. However, as effect of post-Dennard scaling, computing servers have become power-limited, and new system-level approaches must be used to improve their energy efficiency. This paper first presents an accurate power modelling characterization for a new server architecture based on the FD-SOI process technology for near-threshold computing (NTC). Then, we explore the existing energy vs. performance trade-offs when virtualized applications with different CPU utilization and memory footprint characteristics are executed. Finally, based on this analysis, we propose a novel dynamic virtual machine (VM) allocation method that exploits the knowledge of VMs characteristics together with our accurate server power model for next-generation NTC-based data centers, while guaranteeing quality of service (QoS) requirements. Our results demonstrate the inefficiency of current workload consolidation techniques for new NTC-based data center designs, and how our proposed method provides up to 45% energy savings when compared to state-of-the-art consolidation-based approaches.

2017

A machine learning-based approach for power and thermal management of next-generation video coding on MPSoCs
Konferenz ArODES

Arman Iranfar, Marina Zapater, David Atienza

Proceedings of the 2017 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)

Link zur Konferenz

Zusammenfassung:

High Efficiency Video Coding (HEVC) provides high efficiency at the cost of increased computational complexity followed by increased power consumption and temperature of current Multi- Processor Systems-on-Chip (MPSoCs). In this paper, we propose a machine learning-based power and thermal management approach that dynamically learns the best encoder configuration and core frequency for each of the several video streams running in an MPSoC, using information from frame compression, quality, performance, total power and temperature. We implement our approach in an enterprise multicore server and compare it against state-of-the-art techniques. Our approach improves video quality and performance by 17% and 11%, respectively, while reducing average temperature by 12%, without degrading compression or increasing power.

MANGO :
Konferenz ArODES
exploring manycore architectures for next-generation HPC systems

José Flich, Giovanni Agosta, Philipp Ampletzer, David Atienza, Carlo Brandolese, Marina Zapater

Proceedings of the 2017 Euromicro Conference on Digital System Design (DSD)

Link zur Konferenz

Zusammenfassung:

The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.

Thermal characterization of next-generation workloads on heterogeneous MPSoCs
Konferenz ArODES

Arman Iranfar, Terraneo, William Andrew Simon, Leon Dragic, Marina Zapater, Igor Piljic

Proceedings of the 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation

Link zur Konferenz

Zusammenfassung:

Next-generation High-Performance Computing (HPC) applications need to tackle outstanding computational complexity while meeting latency and Quality-of-Service constraints. Heterogeneous Multi-Processor Systems-on-Chip (MPSoCs), equipped with a mix of general-purpose cores and reconfigurable fabric for custom acceleration of computational blocks, are key in providing the flexibility to meet the requirements of next-generation HPC. However, heterogeneity brings new challenges to efficient chip thermal management. In this context, accurate and fast thermal simulators are becoming crucial to understand and exploit the trade-offs brought by heterogeneous MPSoCs. In this paper, we first thermally characterize a next-generation HPC workload, the online video transcoding application, using a highly-accurate Infra-Red (IR) microscope. Second, we extend the 3D-ICE thermal simulation tool with a new generic heat spreader model capable of accurately reproducing package surface temperature, with an average error of 6.8% for the hot spots of the chip. Our model is used to characterize the thermal behaviour of the online transcoding application when running on a heterogeneous MPSoC. Moreover, by using our detailed thermal system characterization we are able to explore different application mappings as well as the thermal limits of such heterogeneous platforms.

SFIDE :
Konferenz ArODES
a simulation infrastructure for data centers

Ignacioa Penas, Marina Zapater, José L. Risco-Martín, José L. Ayala

Proceedings of the 2017 Summer Simulation Multi-Conference

Link zur Konferenz

Zusammenfassung:

Data centers are huge power consumers and have very high operational costs. Both industry and academiahave proposed strategies at multiple levels (server, room layout, cooling, workload allocation, etc.) to in-crease the efficiency of these facilities. Testing the impact of variables so different in nature such as layoutor workload allocation can only be managed by simulators. Current simulation infrastructures are eitherfocused on data room thermal dynamics, or target only a specific stage of data center operations, such asworkload allocation. Moreover, they are geared for specific use-cases such as HPC or cloud computing. Inthis paper we present a data center modeling and simulation framework, for both HPC and cloud applica-tions, to assess data center performance, thermal behavior, energy efficiency and operational cost. Our goalis to show the possibilities of the current data center modeling and simulation framework. Furthermore, aswe provide a fully configurable, flexible and scalable infrastructure any kind of policy, data center size orworkload amount could easily be implemented over the simulator. We also provide the data sets used tovalidate our models and policies, obtained from real servers and data centers, so as to enable researchers totest their strategies in a realistic setup.

2016

Unsupervised power modeling of co-allocated workloads for energy efficiency in data centers
Konferenz ArODES

Juan C. Salinas-Hilburg, Marina Zapater, José L. Risco-Martín, José M. Moya, José L. Ayala

Proceedings of the 2016 Conference on Design, Automation & Test in Europe

Link zur Konferenz

Zusammenfassung:

Data centers are huge power consumers and their energy consumption keeps on rising despite the efforts to increase energy efficiency. A great body of research is devoted to the reduction of the computational power of these facilities, applying techniques such as power budgeting and power capping in servers. Such techniques rely on models to predict the power consumption of servers. However, estimating overall server power for arbitrary applications when running co-allocated in multithreaded servers is not a trivial task. In this paper, we use Grammatical Evolution techniques to predict the dynamic power of the CPU and memory subsystems of an enterprise server using the hardware counters of each application. On top of our dynamic power models, we use fan and temperature-dependent leakage power models to obtain the overall server power. To train and test our models we use real traces from a presently shipping enterprise server under a wide set of sequential and parallel workloads running at various frequencies We prove that our model is able to predict the power consumption of two different tasks co-allocated in the same server, keeping error below 8W. For the first time in literature, we develop a methodology able to combine the hardware counters of two individual applications, and estimate overall server power consumption without running the co-allocated application. Our results show a prediction error below 12W, which represents a 7.3% of the overall server power, outperforming previous approaches in the state of the art.

Towards near-threshold server processors
Konferenz ArODES

Ali Pahlevan, Javier Picorel, Arash Pourhabibi Zarandi, Davide Rossi, Marina Zapater

Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition

Link zur Konferenz

Zusammenfassung:

The popularity of cloud computing has led to a dramatic increase in the number of data centers in the world. The ever-increasing computational demands along with the slowdown in technology scaling has ushered an era of power-limited servers. Techniques such as near-threshold computing (NTC) can be used to improve energy efficiency in the post-Dennard scaling era. This paper describes an architecture based on the FD-SOI process technology for near-threshold operation in servers. Our work explores the trade-offs in energy and performance when running a wide range of applications found in private and public clouds, ranging from traditional scale-out applications, such as web search or media streaming, to virtualized banking applications. Our study demonstrates the benefits of near-threshold operation and proposes several directions to synergistically increase the energy proportionality of a near-threshold server.

2015

Dynamic workload and cooling management in high-efficiency data centers
Konferenz ArODES

Marina Zapater, Ata Turk, José M. Moya, José L. Ayala, Ayse K. Coskun

Proceedings of the sixth International Green and Sustainable Computing Conference

Link zur Konferenz

Zusammenfassung:

Energy efficiency research in data centers has traditionally focused on raised-floor air-cooled facilities. As rack power density increases, traditional cooling is being replaced by close-coupled systems that provide enhanced airflow and cooling capacity. This work presents a model for close-coupled data centers with free cooling, and explores the power consumption trade-offs in these facilities as outdoor temperature changes throughout the year. Using this model, we propose a technique that jointly allocates workload and controls cooling in a power-efficient way. Our technique is tested with configuration parameters, power traces, and weather data collected from real-life data centers, and application profiles obtained from enterprise servers. Results show that our joint workload allocation and cooling policy provides 5% reduction in overall data center energy consumption, and up to 24% peak power reduction, leading to a 6% decrease in the electricity costs without affecting performance.

Power-awareness and smart-resource management in embedded computing systems
Konferenz ArODES

M. D. Santambrogio, José L. Ayala, Simone Campanoni, Marina Zapater

Proceedings of the 2015 International Conference on Hardware/Software Codesign and System Synthesis

Link zur Konferenz

Zusammenfassung:

Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.

Using grammatical evolution techniques to model the dynamic power consumption of enterprise servers
Konferenz ArODES

Juan C. Salinas-Hilburg, Marina Zapater, José L. Risco-Martín, José M. Moya, José L. Ayala

Proceedings of the 9th International Conference on Complex, Intelligent, and Software Intensive Systems

Link zur Konferenz

Zusammenfassung:

The increasing demand for computational resources has led to a significant growth of data center facilities. A major concern has appeared regarding energy efficiency and consumption in servers and data centers. The use of flexible and scalable server power models is a must in order to enable proactive energy optimization strategies. This paper proposes the use of Evolutionary Computation to obtain a model for server dynamic power consumption. To accomplish this, we collect a significant number of server performance counters for a wide range of sequential and parallel applications, and obtain a model via Genetic Programming techniques. Our methodology enables the unsupervised generation of models for arbitrary server architectures, in a way that is robust to the type of application being executed in the server. With our generated models, we are able to predict the overall server power consumption for arbitrary workloads, outperforming previous approaches in the state-of-the-art.

A trust and reputation system for energy optimization in cloud data centers
Konferenz ArODES

Ignacio Aransay, Marina Zapater, Patricia Arroba, José M. Moya

Proceedings of the 8th International Conference on Cloud Computing

Link zur Konferenz

Zusammenfassung:

The increasing success of Cloud Computing applications and online services has contributed to the unsustainability of data center facilities in terms of energy consumption. Higher resource demand has increased the electricity required by computation and cooling resources, leading to power shortages and outages, specially in urban infrastructures. Current energy reduction strategies for Cloud facilities usually disregard the data center topology, the contribution of cooling consumption and the scalability of optimization strategies. Our work tackles the energy challenge by proposing a temperature-aware VM allocation policy based on a Trust-and-Reputation System (TRS). A TRS meets the requirements for inherently distributed environments such as data centers, and allows the implementation of autonomous and scalable VM allocation techniques. For this purpose, we model the relationships between the different computational entities, synthesizing this information in one single metric. This metric, called reputation, would be used to optimize the allocation of VMs in order to reduce energy consumption. We validate our approach with a state-of-the-art Cloud simulator using real Cloud traces. Our results show considerable reduction in energy consumption, reaching up to 46.16% savings in computing power and 17.38% savings in cooling, without QoS degradation while keeping servers below thermal redlining. Moreover, our results show the limitations of the PUE ratio as a metric for energy efficiency. To the best of our knowledge, this paper is the first approach in combining Trust-and-Reputation systems with Cloud Computing VM allocation.

2014

Server power modeling for run-time energy optimization of cloud computing facilities
Konferenz ArODES

Patricia Arroba, José L. Risco-Martín, Marina Zapater, José M. Moya, José L. Ayala, Katzalin Olcoz

Proceedings of the 6th International Conference on Sustainability in Energy and Buildings / SEB-14 ; Energy Procedia

Link zur Konferenz

Zusammenfassung:

As advanced Cloud services are becoming mainstream, the contribution of data centers in the overall power consumption of modern cities is growing dramatically. The average consumption of a single data center is equivalent to the energy consumption of 25.000 households. Modeling the power consumption for these infrastructures is crucial to anticipate the effects of aggressive optimization policies, but accurate and fast power modeling is a complex challenge for high-end servers not yet satisfied by analytical approaches. This work proposes an automatic method, based on Multi-Objective Particle Swarm Optimization, for the identification of power models of enterprise servers in Cloud data centers. Our approach, as opposed to previous procedures, does not only consider the workload consolidation for deriving the power model, but also incorporates other non traditional factors like the static power consumption and its dependence with temperature. Our experimental results shows that we reach slightly better models than classical approaches, but simul- taneously simplifying the power model structure and thus the numbers of sensors needed, which is very promising for a short-term energy prediction. This work, validated with real Cloud applications, broadens the possibilities to derive efficient energy saving techniques for Cloud facilities.

2013

A cyber-physical approach to combined HW-SW monitoring for improving energy efficiency in data centers
Konferenz ArODES

Josué Pagán, Marina Zapater, Oscar Cubo, Patricia Arroba, Vicente Martín Ayuso

Proceedings of the XVIII Conference on the Design of Circuits and Integrated Systems

Link zur Konferenz

Zusammenfassung:

High-Performance Computing, Cloud computing and next-generation applications such e-Health or Smart Cities have dramatically increased the computational demand of Data Centers. The huge energy consumption, increasing levels of CO2 and the economic costs of these facilities represent a challenge for industry and researchers alike. Recent research trends propose the usage of holistic optimization techniques to jointly minimize Data Center computational and cooling costs from a multilevel perspective. This paper presents an analysis on the parameters needed to integrate the Data Center in a holistic optimization framework and leverages the usage of Cyber-Physical systems to gather workload, server and environmental data via software techniques and by deploying a non-intrusive Wireless Sensor Net- work (WSN). This solution tackles data sampling, retrieval and storage from a reconfigurable perspective, reducing the amount of data generated for optimization by a 68% without information loss, doubling the lifetime of the WSN nodes and allowing runtime energy minimization techniques in a real scenario.

Leakage and temperature aware server control for improving energy efficiency in data centers
Konferenz ArODES

Marina Zapater, José L. Ayala, José M. Moya, Kalyan Vaidyanathan, Kenny Gross

Proceedings of the 2013 Design, Automation & Test in Europe Conference & Exhibition

Link zur Konferenz

Zusammenfassung:

Reducing the energy consumption for computation and cooling in servers is a major challenge considering the data center energy costs today. To ensure energy-efficient operation of servers in data centers, the relationship among computational power, temperature, leakage, and cooling power needs to be analyzed. By means of an innovative setup that enables monitoring and controlling the computing and cooling power consumption separately on a commercial enterprise server, this paper studies temperature-leakage-energy tradeoffs, obtaining an empirical model for the leakage component. Using this model, we design a controller that continuously seeks and settles at the optimal fan speed to minimize the energy consumption for a given workload. We run a customized dynamic load-synthesis tool to stress the system. Our proposed cooling controller achieves up to 9% energy savings and 30W reduction in peak power in comparison to the default cooling control scheme.

2012

Leveraging heterogeneity for energy minimization in data centers
Konferenz ArODES

Marina Zapater, José L. Ayala, José M. Moya

Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Link zur Konferenz

Zusammenfassung:

Energy consumption in data centers is nowadays a critical objective because of its dramatic environmental and economic impact. Over the last years, several approaches have been proposed to tackle the energy/cost optimization problem, but most of them have failed on providing an analytical model to target both the static and dynamic optimization domains for complex heterogeneous data centers. This paper proposes and solves an optimization problem for the energy-driven configuration of a heterogeneous data center. It also advances in the proposition of a new mechanism for task allocation and distribution of workload. The combination of both approaches outperforms previous published results in the field of energy minimization in heterogeneous data centers and scopes a promising area of research.

2010

System simulation platform for the design of the SORU reconfigurable coprocessor
Konferenz ArODES

Marina Zapater, Pedro Malagón, Zorana Bankovic, José M. Moya, Juan-Mariano de Goyeneche

Proceedings of the 25th Conference on Design of Circuits and Integrated Systems

Link zur Konferenz

Zusammenfassung:

This paper presents the system-level simulation platform we have implemented to design and evaluate the SORU reconfigurable vector coprocessor, aimed at enhancing the security of embedded systems. The simulator interfaces a lowlevel virtual machine (LLVM) with a SystemC TLM 2.0 model of the rest of the system, and a low-level SystemC model of the coprocessor. The results show that we can simulate more than 80K coprocessor operations per second, with decent power estimation, that allows to perform simulated power analysis attacks. The resulting simulation platform is also flexible enough to allow very fast and easy changes to any part of the system.

Avoiding side-channel attacks in embedded systems with non-deterministic branches
Konferenz ArODES

Pedro Malagón, Juan-Mariano de Goyeneche, Marina Zapater, José M. Moya

Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems IWIA 2010

Link zur Konferenz

Zusammenfassung:

In this paper, we suggest handling security in embedded systems by introducing a small architectural change. We propose the use of a non-deterministic branch instruction to generate non-determinism in the execution of encryption algorithms. Non-determinism makes side-channel attacks much more difficult. The experimental results show at least three orders of magnitude improvement in resistance to statistical side-channel attacks for a custom AES implementation, while enhancing its performance at the same time.Compared with previous countermeasures, this architectural-level hiding countermeasure is trivial to integrate in current embedded processor designs, offers similar resistance to side-channel attacks, while maintaining similar power consumption to the unprotected processor.

Errungenschaften

Medien und Kommunikation
Kontaktieren Sie uns
Folgen Sie der HES-SO
linkedin instagram facebook twitter youtube rss
univ-unita.eu www.eua.be swissuniversities.ch
Rechtliche Hinweise
© 2021 - HES-SO.

HES-SO Rectorat