Description du projet :
The C4liTwin projet (pronounced "calitwin") brings together machine learning and the virtual twin technology to develop the framework and algorithms required to self-calibrate the C4 robotic arms of Trimos by creating a digital twin to model the arm behaviour and enable automatic re-calibration.
Forschungsteam innerhalb von HES-SO:
, Extrat Bastien
, Zapater Sancho Marina
, Akeddar Mehdi
Partenaires académiques: ReDS
Durée du projet:
15.11.2023 - 30.11.2025
Montant global du projet: 398'080 CHF
Description du projet :
The goal of this project is to support the Midgard technology in the Gem5 simulator. Midgard proposes to rethink the overall virtual memory technology of current servers by exposing a global but sparse intermediate address space in a coherent cache hierarchy. Midgard eliminates TLBs by offering a direct translation in hardware from existing Operating System (OS) Virtual Memory (VM) software abstractions (called VMAs) and performs page-level translations only when accessing physical memory or I/O. As such, in contrast to state-of-the-art page-based VM, Midgard's overall address translation overhead decreases with an increase in cache hierarchy capacity. By eliminating the need for deep TLB hierarchies, Midgard not only reclaims the TLB silicon provisioning, but also offers orders of magnitude faster address translation, shootdown and access control creation/revocation as compared to conventional page-based VM.
Prof. Zapater is affiliated faculty in the Midgard project, a joint project between EPFL, Yale University and the University of Edinburgh. Within this project she collaborates with EPFL in providing simulation support for Midgard in gem5.
Firstly, because Midgard requires the use of multi-level TLBs, to simulate Midgard on gem5 we need to setup all the simulation environment in the last Gem5 version, namely gem5-22. However, the gem5-X simulator developed at EPFL only provides support for older (2019) gem5 versions, which do not include the adequate support for Midgard. Therefore, within this project we plan to perform the necessary developments required to port the main features of gem5-X to gem5-22, creating a new release named gem5-X-22, where all Midgard support will be released.
Secondly, we will create the necessary models in gem5 to enable simulating Midgard at the hardware and architectural levels. We will also develop the necessary framework to enable the creation of cache-coherent Midgard-compliant accelerators. We will do so by supporting the enhancement of the ALPINE simulation framework, adequately integrating it into gem5.
Finally, to showcase and fully simulate Midgard, we will focus on having a proof-of-concept Midgard-compliant OS implementation is SO3, running in gem5. SO3 is a Linux-based simple operating system user for teaching and research at the REDS institute. It provides all base functionalities of a Linux kernel while being lightweight and easy to modify. The goal will be providing Midgard support in SO3, while maintaining compatibility with Linux-based systems, to allow advances in the proposal of cache-coherent accelerators.
Forschungsteam innerhalb von HES-SO:
Zapater Sancho Marina
Partenaires académiques: EPFL
Durée du projet:
01.10.2022 - 31.12.2024
Url des Projektstandortes:
HES-SO - Appel à projets jeunes chercheurs
The main goal of ECO4AI is to propose workload allocation techniques that efficiently distribute workload between the edge and the cloud in a transparent way for AI-based IoT applications, allowing an efficiency increase (in terms of performance per watt). This will be accomplished by exploiting the underlying heterogeneous capabilities of novel edge and cloud architectures, and by proposing elastic edge-cloud resource allocation and management techniques.
The project exploits open hardware architectures such as RISC-V and propose a hardware/software ecosystem that will be released open-source, increasing visibility and impact.
ECO4AI will focus on three different use cases that play a key role today in the AI-based IoT scenario: (1) Video surveillance and object tracking; (2) autonomous driving, which represents an important opportunity for edge-edge and edge-cloud cooperation and (3) e-Health, and more specifically bio-signal monitoring for cardiac diseases and epileptic seizures
Durée du projet:
01.01.2022 - 30.06.2023
Montant global du projet: 100'000 CHF
Description du projet :
The main goal is to propose novel machine learning based algorithms to enable selfcalibration
of robotic coordinate measuring machine (CMMs). CMMs are micrometer-accurate
robotic arms capable of providing high-precision measurements of industrial component
parts. Calibration is a mandatory step that remains as of today very time-consuming and
costly. In the specific case of C4 robotic arm of Trimos, the calibration process must ensure
achieving a maximum measurement error below the 8um threshold for the overall volume
under measure, while automatizing most of the process to reduce human-related costs.
The calibration algorithms proposed in this project imply a significant departure from current
techniques. We will build on formal mathematical models that describe the arm behavior
(using trigonometry) and propose evolutionary techniques to tune the static and dynamic
corrections and attain sub-8um errors. Evolutionary techniques (genetic algorithms and
genetic programming) have great potential in situations where we know the underlying
governing physical/mathematical laws of the system, but the dynamics are too complex to be
described formally and we benefit from a data-driven approach. Moreover, we know that the
placement of artifacts required for calibration plays an important role in achieving a uniform
error across the overall volume under test. Therefore we also plan to analyze the error
distribution to understand the impact of artifact placement on error.
Forschungsteam innerhalb von HES-SO:
, Zapater Sancho Marina
, Akeddar Mehdi
Durée du projet:
29.03.2023 - 28.09.2023
Montant global du projet: 15'000 CHF
Alberto A Del Barrio, José P. Manzano, Victor M. Maroto, Álvaro Villarín, Josué Pagán, Marina Zapater, José Ayala, Román Hermida
The International Journal of Electrical Engineering & Education,
60, 1, 23-40
Link zur Publikation
In this paper, an alternative to the traditional methodology related to signal processing-like subjects is proposed. These are subjects that require a deep mathematical and theoretical basis, but the practical goal is not often emphasized, which drives students to lose interest in the subject. Thus, a software-defined radio environment is proposed to provide a more practical view of the subject. This solution consists of an open hardware–software platform able to capture and process a wide range of frequencies. HackRF is the hardware component, while GNU Radio will provide the graphical support to this device. The tests performed with a set of 36 students have revealed that they are more satisfied with this framework than just employing a traditional equation-based environment as Matlab. Furthermore, their scores in the exams also support the suitability of the proposed platform.
Joshua Klein, Irem Boybat, Yasir Qureshi, Martino Dazzi, Alexandre Levisse, Giovanni Ansaloni, Marina Zapater, Abu Sebastian, David Atienza
IEEE Transactions on Computers,
2023, vol. 72, no. 7, pp. 1985 - 1998
Analog in-memory computing (AIMC) cores offers significant performance and energy benefits for neural network inference with respect to digital logic (e.g., CPUs). AIMCs accelerate matrix-vector multiplications, which dominate these applications' run-time. However, AIMC-centric platforms lack the flexibility of general-purpose systems, as they often have hard-coded data flows and can only support a limited set of processing functions. With the goal of bridging this gap in flexibility, we present a novel system architecture that tightly integrates analog in-memory computing accelerators into multi-core CPUs in general-purpose systems. We developed a powerful gem5-based full system-level simulation framework into the gem5-X simulator, ALPINE, which enables an in-depth characterization of the proposed architecture. ALPINE allows the simulation of the entire computer architecture stack from major hardware components to their interactions with the Linux OS. Within ALPINE, we have defined a custom ISA extension and a software library to facilitate the deployment of inference models. We showcase and analyze a variety of mappings of different neural network types, and demonstrate up to 20.5x/20.8x performance/energy gains with respect to a SIMD-enabled ARM CPU implementation for convolutional neural networks, multi-layer perceptrons, and recurrent neural networks.
Halima Najibi, Giovanni Ansaloni, Marina Zapater, Miroslav Vasic, David Atienza
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
2023, vol. 42, no. 1, pp. 2-15
Flow cell arrays (FCAs) concurrently provide efficient on-chip liquid cooling and electrochemical power generation. This technology is especially promising for threedimensional multi-processor systems-on-chip (3D MPSoCs) realized in deeply scaled technologies, which present very challenging power and thermal requirements. Indeed, FCAs effectively improve power delivery network (PDN) performance, particularly if switched capacitor (SC) converters are employed to decouple the flow cells and the systems-on-chip voltages, allowing each to operate at their optimal point. Nonetheless, the design of FCAbased solutions entails non-obvious considerations and trade-offs, stemming from their dual role in governing both the thermal and power delivery characteristics of 3D MPSoCs. Showcasing them in this paper, we explore multiple FCA design configurations and demonstrate that this technology can decrease the temperature of a heterogeneous 3D MPSoC by 78∘C, and its total power consumption by 46%, compared to a high-performance cold-plate based liquid cooling solution. At the same time, FCAs enable up to 90% voltage drop recovery across dies, using SC converters occupying a small fraction of the chip area. Such outcomes provide an opportunity to boost 3D MPSoC computing performance by increasing the operating frequency of dies. Leveraging these results, we introduce a novel temperature and voltage-aware model predictive control (MPC) strategy that optimizes power efficiency during run-time. We achieve application-wide speedups of up to 16% on various machine learning (ML), data mining, and other high-performance benchmarks while keeping the 3D MPSoC temperature below 83∘C and voltage drops below 5%.
Darong Huang, Ali Pahlevan, Luis Costero, Marina Zapater, David Atienza
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
2022, vol. 41, no. 12, pp. 5596-5609
Computing servers play a key role in the development and process of emerging compute-intensive applications in recent years. However, they need to operate efficiently from an energy perspective viewpoint, while maximizing the performance and lifetime of the hottest server components (i.e., cores and cache). Previous methods focused on either improving energy efficiency by adopting new hybrid-cache architectures including the resistive random-access memory (RRAM) and static random-access memory (SRAM) at the hardware level, or exploring trade-offs between lifetime limitation and performance of multi-core processors under stable workloads conditions. Therefore, no work has so far proposed a co-optimization method with hybrid-cache-based server architectures for real-life dynamic scenarios taking into account scalability, performance, lifetime reliability, and energy efficiency at the same time. In this paper, we first formulate a reliability model for the hybrid-cache architecture to enable precise lifetime reliability management and energy efficiency optimization. We also include the performance and energy overheads of cache switching, and optimize the benefits of hybrid-cache usage for better energy efficiency and performance. Then, we propose a runtime Q-Learning-based reliability management and performance optimization approach for multi-core microprocessors with the hybrid-cache architecture, jointly incorporated with a dynamic preemptive priority queue management method to improve the overall tasks’ performance by targeting to respect their end time limits. Experimental results show that our proposed method achieves up to 44% average performance (i.e., tasks execution time) improvement, while maintaining the whole system design lifetime longer than 5 years, when compared to the latest state-of-the-art energy efficiency optimization and reliability management methods for computing servers.
Juan Carlos Salinas-Hilburg, Marina Zapater, José, M. Moya, José L. Ayala
Computers Electrical Engineering,
2022, vol. 97, article no. 107630
Data centers are power hungry facilities. Energy-aware task scheduling approaches are of utmost importance to improve energy savings in data centers, although they need to know beforehand the energy consumption of the applications that will run in the servers. This is usually done through a full profiling of the applications, which is not feasible in long-running application scenarios due to the long execution times. In the present work we use an application signature that allows to estimate the energy without the need to execute the application completely. We use different scheduling approaches together with the information of the application signature to improve the makespan of the scheduling process and therefore improve the energy savings in data centers. We evaluate the accuracy of using the application signature by means of comparing against an oracle method obtaining an error below 1.5%, and Compression Ratios around 39.7 to 45.8.
Yasir Mahmood Qureshi, William Andrew Simon, Marina Zapater, Katzalin Olcoz, David Atienza
ACM Transactions on Architecture and Code Optimization,
2021, vol. 18, no 4, article no. 44, pp. 1-27
The increasing adoption of smart systems in our daily life has led to the development of new applications with varying performance and energy constraints, and suitable computing architectures need to be developed for these new applications. In this article, we present gem5-X, a system-level simulation framework, based on gem-5, for architectural exploration of heterogeneous many-core systems. To demonstrate the capabilities of gem5-X, real-time video analytics is used as a case-study. It is composed of two kernels, namely, video encoding and image classification using convolutional neural networks (CNNs). First, we explore through gem5-X the benefits of latest 3D high bandwidth memory (HBM2) in different architectural configurations. Then, using a two-step exploration methodology, we develop a new optimized clustered-heterogeneous architecture with HBM2 in gem5-X for video analytics application. In this proposed clustered-heterogeneous architecture, ARMv8 in-order cluster with in-cache computing engine executes the video encoding kernel, giving 20% performance and 54% energy benefits compared to baseline ARM in-order and Out-of-Order systems, respectively. Furthermore, thanks to gem5-X, we conclude that ARM Out-of-Order clusters with HBM2 are the best choice to run visual recognition using CNNs, as they outperform DDR4-based system by up to 30% both in terms of performance and energy savings.
Valentin Gabeff, Tomas Teijeiro, Marina Zapater, Leila Cammoun, Sylvie Rheims, Philippe Ryvlin, David Atienza
Artificial Intelligence in Medicine,
2021, vol. 117, article no. 102084
While Deep Learning (DL) is often considered the state-of-the art for Artificial Intel-ligence-based medical decision support, it remains sparsely implemented in clinical practice and poorly trusted by clinicians due to insufficient interpretability of neural network models. We have approached this issue in the context of online detection of epileptic seizures by developing a DL model from EEG signals, and associating certain properties of the model behavior with the expert medical knowledge. This has conditioned the preparation of the input signals, the network architecture, and the post-processing of the output in line with the domain knowledge. Specifically, we focused the discussion on three main aspects: (1) how to aggregate the classification results on signal segments provided by the DL model into a larger time scale, at the seizure-level; (2) what are the relevant frequency patterns learned in the first convolutional layer of different models, and their relation with the delta, theta, alpha, beta and gamma frequency bands on which the visual interpretation of EEG is based; and (3) the identification of the signal waveforms with larger contribution towards the ictal class, according to the activation differences highlighted using the DeepLIFT method. Results show that the kernel size in the first layer determines the interpretability of the extracted features and the sensitivity of the trained models, even though the final performance is very similar after post-processing. Also, we found that amplitude is the main feature leading to an ictal prediction, suggesting that a larger patient population would be required to learn more complex frequency patterns. Still, our methodology was successfully able to generalize patient inter-variability for the majority of the studied population with a classification F1-score of 0.873 and detecting 90% of the seizures.
Arman Iranfar, Marina Zapater, David Atienza
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
2022, vol. 41, no. 4, pp. 1034-1047
Nowadays, Deep Convolutional Neural Networks (DCNNs) play a significant role in many application domains, such as, computer vision, medical imaging, and image processing. Nonetheless, designing a DCNN, able to defeat the state of the art, is a manual, challenging, and time-consuming task, due to the extremely large design space, as a consequence of a large number of layers and their corresponding hyperparameters. In this work, we address the challenge of performing hyperparameter optimization of DCNNs through a novel Multi-Agent Reinforcement Learning (MARL)-based approach, eliminating the human effort. In particular, we adapt Q-learning and define learning agents per layer to split the design space into independent smaller design sub-spaces such that each agent fine-tunes the hyperparameters of the assigned layer concerning a global reward. Moreover, we provide a novel formation of Q-tables along with a new update rule that facilitates agents’ communication. Our MARL-based approach is data-driven and able to consider an arbitrary set of design objectives and constraints. We apply our MARL-based solution to different well-known DCNNs, including GoogLeNet, VGG, and U-Net, and various datasets for image classification and semantic segmentation. Our results have shown that, compared to the original CNNs, the MARL-based approach can reduce the model size, training time, and inference time by up to, respectively, 83x, 52%, and 54% without any degradation in accuracy. Moreover, our approach is very competitive to state-of-the-art neural architecture search methods in terms of the designed CNN accuracy and its number of parameters while significantly reducing the optimization cost.
Frederico Terraneo, Alberto Leva, William Fornaciari, Marina Zapater, David Atienza
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
2022, vol. 41, no. 4, pp. 1062-1075
The increasing power density in modern highperformance multi-processor system-on-chip (MPSoC) is fueling a revolution in thermal management. On the one hand, thermal phenomena are becoming a critical concern, making accurate and efficient simulation a necessity. On the other hand, a variety of physically heterogeneous solutions are coming into play: liquid, evaporative, thermoelectric cooling, and more. A new generation of simulators, with unprecedented flexibility, is thus required. In this paper, we present 3D-ICE 3.0, the first thermal simulator to allow for accurate nonlinear descriptions of complex and physically heterogeneous heat dissipation systems, while preserving the efficiency of latest compact modeling frameworks at the silicon die level. 3D-ICE 3.0 allows designers to extend the thermal simulator with new heat sink models while simplifying the time-consuming step of model validation. Support for nonlinear dynamic models is included, for instance to accurately represent variable coolant flows. Our results present validated models of a commercial water heat sink and an air heat sink plus fan that achieve an average error below 1∘C and simulate, respectively, up to 3x and 12x faster than the real physical phenomena.
Ali Pahlevan, Marina Zapater, Ayse K. Coskun, David Atienza
IEEE Transactions on Sustainable Computing,
2021, vol. 6, no. 2, pp. 289 - 305
Modern datacenters need to tackle efficiently the increasing demand for computing resources while minimizing energy usage and monetary costs. Power market operators have recently introduced emerging demand-response programs, in which electricity consumers regulate their power usage following provider requests to reduce monetary costs. Among different programs, regulation service (RS) reserves are particularly promising for datacenters due to the high credit gain possibilities and datacenters' flexibility in regulating their power consumption. Therefore, it is essential to develop bidding strategies for datacenters to participate in emerging power markets together with power management policies that are aware of power market requirements at runtime. In this paper we propose ECOGreen, a holistic strategy to jointly optimize the datacenter RS problem and virtual machine (VM) allocation that satisfies the hour-ahead power market constraints in the presence of electrical energy storage (EES) and renewable energy. We first find the best power and reserve bidding values as well as the number of active servers in a fast analytical way that works well in practice. Then, we present an online adaptive policy that modulates datacenter power consumption by controlling VMs CPU resource limits and efficiently utilizing demand-side EES and renewable power, while guaranteeing quality-of-service (QoS) constraints. Our results demonstrate that ECOGreen can provide 76 percent of the datacenter power consumption on average as reserves to the market, due to largely operating on renewable sources and EES. This translates into ECOGreen saving up to 71 percent electricity costs when compared to other state-of-the-art datacenter electricity cost minimization techniques that participate in the power market.
Juan Carlos Salinas-Hilburg, Marina Zapater, José M. Moya, José C. Ayala
Future Generation Computer Systems,
2021, vol. 115, pp. 20-33
The computation power in data center facilities is increasing significantly. This brings with it an increase of power consumption in data centers. Techniques such as power budgeting or resource management are used in data centers to increase energy efficiency. These techniques require to know beforehand the energy consumption throughout a full profiling of the applications. This is not feasible in scenarios with long-running applications that have long execution times. To tackle this problem we present a fast energy estimation framework for long-running applications. The framework is able to estimate the dynamic CPU and memory energy of the application without the need to perform a complete execution. For that purpose, we leverage the concept of application signature. The application signature is a reduced version, in terms of execution time, of the original application. Our fast energy estimation framework is validated with a set of long-running applications and obtains RMS values of 11.4% and 12.8% for the CPU and memory energy estimation errors, respectively. We define the concept of Compression Ratio as an indicator of the acceleration of the energy estimation process. Our framework is able to obtain Compression Ratio values in the range of 10.1 to 191.2.
Darong Huang, Ali Pahlevan, Marina Zapater, David Atienza
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
2022, vol. 41, no. 2, pp. 386-399
High performance computing (HPC) servers aim to meet an increase in the number and complexity of tasks and, consequently, to address the energy efficiency challenge. In addition to energy efficiency, it is essential to manage lifetime limitations of power-hungry components of servers (e.g., cores and cache), hence avoiding server failure before its lifetime period. Traditional approaches focus on either using hybrid caches to reduce the leakage power of traditional static random-access memory (SRAM) cache, and thus increase the energy efficiency, or the trade-off between the lifetime and performance of multi-core processors. However, these approaches fall short in terms of flexibility and applicability for HPC tasks in terms of multi-parametric optimization including quality-of-service (QoS), lifetime reliability, and energy efficiency. As a result, in this paper we propose COCKTAIL, a holistic strategy framework to jointly optimize the energy efficiency of multi-core server processors and tasks performance in the HPC context, while guaranteeing the lifetime reliability. First, we analyze the best cache technology among traditional SRAM and resistive random access memory (RRAM), within the context of hybrid cache architectures, to improve the energy efficiency and manage cache endurance limits with respect to tasks requirements. Second, we introduce a novel efficient proactive queue optimization policy to reorder HPC tasks for execution considering their end time and possible reliability effects on the use of the hybrid caches. Third, we present a dynamic model predictive control (MPC)-based reliability management method to maximize task performance, by controlling the frequency, temperature, and target lifetime of the server processor. Our results demonstrate that, while consuming similar energy, COCKTAIL provides up to 60% QoS improvement when compared to latest state-of-the-art energy optimization and reliability management techniques in the HPC context. Moreover, our strategy guarantees a design lifetime longer than 5 years for the whole HPC system.
Luis Costero, Arman Iranfar, Marina Zapater, Francisco D. Igual, Katzalin Olcoz, David Atienza
IEEE Transactions on Parallel and Distributed Systems,
2020, vol. 31, no. 12
The advent of online video streaming applications and services along with the users' demand for high-quality contents require High Efficiency Video Coding (HEVC), which provides higher video quality and more compression at the cost of increased complexity. On one hand, HEVC exposes a set of dynamically tunable parameters to provide trade-offs among Quality-of-Service (QoS), performance, and power consumption of multi-core servers on the video providers' data center. On the other hand, resource management of modern multi-core servers is in charge of adapting system-level parameters, such as operating frequency and multithreading, to deal with concurrent applications and their requirements. Therefore, efficient multi-user HEVC streaming necessitates joint adaptation of application-and system-level parameters. Nonetheless, dealing with such a large and dynamic design space is challenging and difficult to address through conventional resource management strategies. Thus, in this work, we develop a multi-agent Reinforcement Learning framework to jointly adjust application-and system-level parameters at runtime to satisfy the QoS of multi-user HEVC streaming in power-constrained servers. In particular, the design space, composed of all design parameters, is split into smaller independent sub-spaces. Each design sub-space is assigned to a particular agent so that it can explore it faster, yet accurately. The benefits of our approach are revealed in terms of adaptability and quality (with up to to 4× improvements in terms of QoS when compared to a static resource management scheme), and learning time (6× fasterthan an equivalent mono-agent implementation). Finally, we show that the power-capping techniques formulated outperform the hardware-based power capping with respect to quality.
William Andrew Simon, Yasir Mahmood Qureshi, Marco Rios, Alexandre Levisse, Marina Zapater
IEEE Transactions on Computers,
2020, vol. 69, no. 9, pp. 1349 - 1363
Area and power-constrained edge devices are increasingly utilized to perform compute intensive workloads, necessitating increasingly area and power-efficient accelerators. In this context, in-SRAM computing performs hundreds of parallel operations on spatially local data common in many emerging workloads, while reducing power consumption due to data movement. However, in-SRAM computing faces many challenges, including integration into the existing architecture, arithmetic operation support, data corruption at high operating frequencies, inability to run at low voltages, and low area density. To meet these challenges, this article introduces BLADE, a BitLine Accelerator for Devices on the Edge. BLADE is an in-SRAM computing architecture that utilizes local wordline groups to perform computations at a frequency 2.8× higher than state-of-the-art in-SRAM computing architectures. BLADE is integrated into the cache hierarchy of low-voltage edge devices, and simulated and benchmarked at the transistor, architecture, and software abstraction levels. Experimental results demonstrate performance/energy gains over an equivalent NEON accelerated processor for a variety of edge device workloads, namely, cryptography (4× performance gain/6× energy reduction), video encoding (6×/2×), and convolutional neural networks (3×/1.5×), while maintaining the highest frequency/energy ratio (up to 2.2 Ghz@1V) of any conventional in-SRAM computing architecture, and a low area overhead of less than 8 percent.
Giovanni Agosta, William Fornaciari, David Atienza, Ramon Canal, Alessandro Cilardo, Marina Zapater
Microprocessors and Microsystems,
2020, vol. 77, 103185
RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case.
Yasir Mahmood Qureshi, Jose Manuel Herruzo, Marina Zapater, Katzalin Olcoz, Sonia Gonzalez-Navarro
IEEE Transactions on Computers,
2020, vol. 14, no. 8, pp. 1-14
Next generation workloads, such as genome sequencing, have an astounding impact in the healthcare sector. Sequence alignment, the first step in genome sequencing, has experienced recent breakthroughs, which resulted in next generation sequencing (NGS). As NGS applications are memory bounded with random memory access patterns, we propose the use of high bandwidth memories like 3D stacked HBM2, instead of traditional DRAMs like DDR4, along with energy efficient compute cores to improve both performance and energy efficiency. Three state-of-the-art NGS applications, Bowtie2, BWA-MEM and HISAT2, are used as case studies to explore and optimize NGS computing architectures. Then, using the gem5-X architectural simulator, we obtain an overall 68% performance improvement and 71% energy savings using HBM2 instead of DDR4. Furthermore, we propose an architecture based on ARMv8 cores and demonstrate that 16 ARMv8 64-bit OoO cores with HBM2 outperforms 32-cores of Intel Xeon Phi Knights Landing (KNL) processor with 3D stacked memory. Moreover, we show that by using frequency scaling we can achieve up to 59% and 61% energy savings for ARM in-order and OoO cores, respectively. Lastly, we show that many ARMv8 in-order cores at 1.5GHz match the performance of fewer OoO cores at 2GHz, while attaining 4.5x energy savings.
Kawsar Haghshenas, Ali Pahlevan, Marina Zapater, Siamak Mohammadi, David Atienza
IEEE Transactions on Services Computing,
Improving the energy efficiency of data centers while guaranteeing Quality of Service (QoS), together with detecting performance variability of servers caused by either hardware or software failures, are two of the major challenges for efficient resource management of large-scale cloud infrastructures. Previous works in the area of dynamic Virtual Machine (VM) consolidation are mostly focused on addressing the energy challenge, but fall short in proposing comprehensive, scalable, and low-overhead approaches that jointly tackle energy efficiency and performance variability. Moreover, they usually assume over-simplistic power models, and fail to accurately consider all the delay and power costs associated with VM migration and host power mode transition. These assumptions are no longer valid in modern servers executing heterogeneous workloads and lead to unrealistic or inefficient results. In this paper, we propose a centralized-distributed low-overhead failure-aware dynamic VM consolidation strategy to minimize energy consumption in large-scale data centers. Our approach selects the most adequate power mode and frequency of each host during runtime using a distributed multi-agent Machine Learning (ML) based strategy, and migrates the VMs accordingly using a centralized heuristic. Our Multi-AGent machine learNing-based approach for Energy efficienT dynamIc Consolidation (MAGNETIC) is implemented in a modified version of the CloudSim simulator, and considers the energy and delay overheads associated with host power mode transition and VM migration, and is evaluated using power traces collected from various workloads running in real servers and resource utilization logs from cloud data center infrastructures. Results show how our strategy reduces data center energy consumption by up to 15% compared to other works in the state-of-the-art (SoA), guaranteeing the same QoS and reducing the number of VM migrations and host power mode transitions by up to 86% and 90%, respectively. Moreover, it shows better scalability than all other approaches, taking less than 0.7% time overhead to execute for a data center with 1500 VMs. Finally, our solution is capable of detecting host performance variability due to failures, automatically migrating VMs from failing hosts and draining them from workload.
IEEE Transactions on Parallel and Distributed Systems,
2018, vol. 29, no. 10, pp. 2268 - 2281
The emergence of video streaming applications, together with the users' demand for high-resolution contents, has led to the development of new video coding standards, such as High Efficiency Video Coding (HEVC). HEVC provides high efficiency at the cost of increased complexity. This higher computational burden results in increased power consumption in current multicore servers. To tackle this challenge, algorithmic optimizations need to be accompanied by content-aware application-level strategies, able to reduce power while meeting compression and quality requirements. In this paper, we propose a machine learning-based power and thermal management approach that dynamically learns and selects the best encoding configuration and operating frequency for each of the videos running on multicore servers, by using information from frame compression, quality, encoding time, power, and temperature. In addition, we present a resolution-aware video assignment and migration strategy that reduces the peak and average temperature of the chip while maintaining the desirable encoding time. We implemented our approach in an enterprise multicore server and evaluated it under several common scenarios for video providers. On average, compared to a state-of-the-art technique, for the most realistic scenario, our approach improves BD-PSNR and BD-rate by 0.54 dB, and 8 percent, respectively, and reduces the encoding time, power consumption, and average temperature by 15.3, 13, and 10 percent, respectively. Moreover, our proposed approach enhances BDPSNR and BD-rate compared to the HEVC Test Model (HM), by 1.19 dB and 24 percent, respectively, without any encoding time degradation, when power and temperature constraints are relaxed.
José Flich, Giovanni Agosta, Philipp Ampletzer, David Atienza Alonso, Carlo Brandolese, Marina Zapater
Microprocessors and Microsystems,
2018, vol. 61, pp. 154-170
The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.
Ali Pahlevan, Xiaoyu Qu, Marina Zapater, David Atienza
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
2018, vol. 37, no. 8, pp. 1667 - 1680
Modern cloud data centers (DCs) need to tackle efficiently the increasing demand for computing resources and address the energy efficiency challenge. Therefore, it is essential to develop resource provisioning policies that are aware of virtual machine (VM) characteristics, such as CPU utilization and data communication, and applicable in dynamic scenarios. Traditional approaches fall short in terms of flexibility and applicability for large-scale DC scenarios. In this paper, we propose a heuristic- and a machine learning (ML)-based VM allocation method and compare them in terms of energy, quality of service (QoS), network traffic, migrations, and scalability for various DC scenarios. Then, we present a novel hyper-heuristic algorithm that exploits the benefits of both methods by dynamically finding the best algorithm, according to a user-defined metric. For optimality assessment, we formulate an integer linear programming (ILP)-based VM allocation method to minimize energy consumption and data communication, which obtains optimal results, but is impractical at runtime. Our results demonstrate that the ML approach provides up to 24% server-to-server network traffic improvement and reduces execution time by up to 480× compared to conventional approaches, for large-scale scenarios. On the contrary, the heuristic outperforms the ML method in terms of energy and network traffic for reduced scenarios. We also show that the heuristic and ML approaches have up to 6% energy consumption overhead compared to ILP-based optimal solution. Our hyper-heuristic integrates the strengths of both the heuristic and the ML methods by selecting the best one during runtime.
Josué Pagán, Marina Zapater, José L. Ayala
Future Generation Computer Systems,
2018, vol. 78, no. 2, pp. 587-601
The Internet of Things (IoT) holds big promises for healthcare, especially in proactive personal eHealth. Prediction of symptomatic crises in chronic diseases in the IoT scenario leads to the deployment of ambulatory monitoring systems. These systems place a major concern in the amount of data to be processed and the intelligent management of the energy consumption. The huge amount of data generated for these systems require high computing capabilities only available in Data Centers. This paper presents a real case of prediction in the eHealth scenario, devoted to neurological disorders. The presented case study focuses on the migraine headache, a disease that affects around 15% of the European population. This paper extrapolates results from real data and simulations in a study where migraine patients are monitored using an unobtrusive Wireless Body Sensor Network. Low-power techniques are applied in monitorization nodes. Techniques such us: on-node signal processing and radio policies to make node’s autonomy longer and save energy, have been applied. Workload balancing policies are carried out in the coordinator nodes and Data Centers to reduce the computational burden in these facilities and minimize its energy consumption. Our results draw average savings of € 288 million in this eHealth scenario applied only to 2% of European migraine sufferers; in addition to savings of € 1272 million due to the benefits of the migraine prediction.
Artem Aleksandrovich Andreev, Arvind Sridhar, Mohamed M. Sabry, Marina Zapater, Patrick Ruch
IEEE Transactions on Computers,
2018, vol. 67, no. 1, pp. 73 - 85
Integrated Flow-Cell Arrays (FCAs) represent a combination of integrated liquid cooling and on-chip power generation, converting chemical energy of the flowing electrolyte solutions to electrical energy. The FCA technology provides a promising way to address both heat removal and power delivery issues in 3D Multiprocessor Systems-on-Chips (MPSoCs). In this paper we motivate the benefits of FCA in 3D MPSoCs via a qualitative analysis and explore the capabilities of the proposed technology using our extended PowerCool simulator. PowerCool is a tool that performs combined compact thermal and electrochemical simulation of 3D MPSoCs with inter-tier FCA-based cooling and power generation. We validate our electrochemical model against experimental data obtained using a micro-scale FCA, and extend PowerCool with a compact thermal model (3D-ICE) and subthreshold leakage estimation. We show the sensitivity of the FCA cooling and power generation on the design-time (FCA geometry) and run-time (fluid inlet temperature, flow rate) parameters. Our results show that we can optimize the FCA to keep maximum chip temperature below 95 °C for an average chip power consumption of 50 W/cm 2 while generating up to 3.6 W per cm 2 of chip area.
José L. Risco-Martin, Saurabh Mittal, Juan Carlos Fabero Jiménez, Marina Zapater, Román Hermida Correa
2017, vol. 93, no. 6, pp. 459–476
The discrete event system specification formalism, which supports hierarchical and modular model composition, has been widely used to understand, analyze and develop a variety of systems. Discrete event system specification has been implemented in various languages and platforms over the years. The DEVStone benchmark was conceived to generate a set of models with varied structure and behavior, and to automate the evaluation of the performance of discrete event system specification-based simulators. However, DEVStone is still in a preliminary phase and more model analysis is required. In this paper, we revisit DEVStone introducing new equations to compute the number of events triggered. We also introduce a new benchmark with a similar central processing unit and memory requirements to the most complex benchmark in DEVStone, but with an easier implementation and with it being more manageable analytically. Finally, we compare both the performance and memory footprint of five different discrete event system specification simulators in two different hardware platforms.
Marina Zapater, José L. Risco-Martín, Patricia Arroba, José L. Ayala, José M. Moya
Applied Soft Computing,
2016, vol. 49, pp. 94-107
Data Centers are huge power consumers, both because of the energy required for computation and the cooling needed to keep servers below thermal redlining. The most common technique to minimize cooling costs is increasing data room temperature. However, to avoid reliability issues, and to enhance energy efficiency, there is a need to predict the temperature attained by servers under variable cooling setups. Due to the complex thermal dynamics of data rooms, accurate runtime data center temperature prediction has remained as an important challenge. By using Grammatical Evolution techniques, this paper presents a methodology for the generation of temperature models for data centers and the runtime prediction of CPU and inlet temperature under variable cooling setups. As opposed to time costly Computational Fluid Dynamics techniques, our models do not need specific knowledge about the problem, can be used in arbitrary data centers, re-trained if conditions change and have negligible overhead during runtime prediction. Our models have been trained and tested by using traces from real Data Center scenarios. Our results show how we can fully predict the temperature of the servers in a data rooms, with prediction errors below 2 °C and 0.5 °C in CPU and server inlet temperature respectively.
Marina Zapater, Ozan Tuncer, José L. Ayala, José M. Moya, Kalyan Vaidyanathan
IEEE Transactions on Parallel and Distributed Systems,
2015, vol. 26, no. 10, pp. 2764 - 2777
The computational and cooling power demands of enterprise servers are increasing at an unsustainable rate. Understanding the relationship between computational power, temperature, leakage, and cooling power is crucial to enable energy-efficient operation at the server and data center levels. This paper develops empirical models to estimate the contributions of static and dynamic power consumption in enterprise servers for a wide range of workloads, and analyzes the interactions between temperature, leakage, and cooling power for various workload allocation policies. We propose a cooling management policy that minimizes the server energy consumption by setting the optimum fan speed during runtime. Our experimental results on a presently shipping enterprise server demonstrate that including leakage awareness in workload and cooling management provides additional energy savings without any impact on performance.
Marina Zapater, David Fraga, Pedro Malagón, Zorana Bankovic, José M. Moya
Logic Journal of IGPL,
2015, vol. 23, no. 3, pp. 495–505
Reliability is one of the key performance factors in data centres. The out-of-scale energy costs of these facilities lead data centre operators to increase the ambient temperature of the data room to decrease cooling costs. However, increasing ambient temperature reduces the safety margins and can result in a higher number of anomalous events. Anomalies in the data centre need to be detected as soon as possible to optimize cooling efficiency and mitigate the harmful effects over servers. This article proposes the usage of clustering-based outlier detection techniques coupled with a trust and reputation system engine to detect anomalies in data centres. We show how self-organizing maps or growing neural gas can be applied to detect cooling and workload anomalies, respectively, in a real data centre scenario with very good detection and isolation rates, in a way that is robust to the malfunction of the sensors that gather server and environmental information.
Marina Zapater, Patricia Arroba, José Luis Ayala Rodrigo, Katzalin Olcoz Herrero, José Manuel Moya Fernandez
Cloud computing with e-science applications
(20 p.). 2015,
Boca Raton : CRC Press
Link zur Publikation
This chapter provides a vision of the increasing energy problem in computing facilities with focuses on cloud computing, under the new computational paradigms, and proposes solutions from a global, multilayer perspective, describing a novel system architecture, power models, and optimization algorithms. Researchers have done a massive amount of work to address the issues and provide energy-aware computing environments. Consolidation allows reducing the number of operating servers to process the same workload, minimizing the static consumption, which leads to operating server set and turn-off policies. Cloud computing, mobile cloud computing, or even modern high-performance computing start with data centers. While we can dream of a world in which anyone is allowed to sell their excess computing capacity as virtualized resources to anyone else or where the ubiquitous sensing of information is processed by a center kilometers away from the source.
Alfredo Cuesta-Infante, J. Manuel Colmenar, Zorana Bankovic, José L. Risco-Martín, Marina Zapater
2015, vol. 150, part A, pp. 67-81
Constant necessity of improving performance has brought the invention of 3D chips. The improvement is achieved due to the reduction of wire length, which results in decreased interconnection delay. However, 3D stacks have less heat dissipation due to the inner layers, which leads to increased temperature and the appearance of hot spots. This problem can be mitigated through appropriate floorplanning. For this reason, in this work we present and compare five different solutions for floorplanning of 3D chips. Each solution uses a different representation, and all are based on meta-heuristic algorithms, namely three of them are based on simulated annealing, while two other are based on evolutionary algorithms. The results show great capability of all the solutions in optimizing temperature and wire length, as they all exhibit significant improvements comparing to the benchmark floorplans.
Patricia Arroba, José L. Risco-Martín, Marina Zapater, José M. Moya, José L. Ayala
Journal of Grid Computing,
2015, vol. 13, pp. 409–423
This work proposes an automatic methodology for modeling complex systems. Our methodology is based on the combination of Grammatical Evolution and classical regression to obtain an optimal set of features that take part of a linear and convex model. This technique provides both Feature Engineering and Symbolic Regression in order to infer accurate models with no effort or designer’s expertise requirements. As advanced Cloud services are becoming mainstream, the contribution of data centers in the overall power consumption of modern cities is growing dramatically. These facilities consume from 10 to 100 times more power per square foot than typical office buildings. Modeling the power consumption for these infrastructures is crucial to anticipate the effects of aggressive optimization policies, but accurate and fast power modeling is a complex challenge for high-end servers not yet satisfied by analytical approaches. For this case study, our methodology minimizes error in power prediction. This work has been tested using real Cloud applications resulting on an average error in power estimation of 3.98 %. Our work improves the possibilities of deriving Cloud energy efficient policies in Cloud data centers being applicable to other computing environments with similar characteristics.
Marina Zapater, Patricia Arroba, José L. Ayala, José M. Moya, Katzalin Olcoz
Future Generation Computer Systems,
2014, vol. 34, pp. 138-154
A first-rate e-Health system saves lives, provides better patient care, allows complex but useful epidemiologic analysis and saves money. However, there may also be concerns about the costs and complexities associated with e-health implementation, and the need to solve issues about the energy footprint of the high-demanding computing facilities. This paper proposes a novel and evolved computing paradigm that: (i) provides the required computing and sensing resources; (ii) allows the population-wide diffusion; (iii) exploits the storage, communication and computing services provided by the Cloud; (iv) tackles the energy-optimization issue as a first-class requirement, taking it into account during the whole development cycle. The novel computing concept and the multi-layer top-down energy-optimization methodology obtain promising results in a realistic scenario for cardiovascular tracking and analysis, making the Home Assisted Living a reality.
Marina Zapater, José L. Ayala, Jose M. Moya
Ubiquitous Computing and Ambient Intelligence
(8 p.). 2012,
Heidelberg : Springer
In recent future, wireless sensor networks (WSNs) will experience a broad high-scale deployment (millions of nodes in the national area) with multiple information sources per node, and with very specific requirements for signal processing. In parallel, the broad range deployment of WSNs facilitates the definition and execution of ambitious studies, with a large input data set and high computational complexity. These computation resources, very often heterogeneous and driven on-demand, can only be satisfied by high-performance Data Centers (DCs). The high economical and environmental impact of the energy consumption in DCs requires aggressive energy optimization policies. These policies have been already detected but not successfully proposed. In this context, this paper shows the following on-going research lines and obtained results. In the field of WSNs: energy optimization in the processing nodes from different abstraction levels, including reconfigurable application specific architectures, efficient customization of the memory hierarchy, energy-aware management of the wireless interface, and design automation for signal processing applications. In the field of DCs: energy-optimal workload assignment policies in heterogeneous DCs, resource management policies with energy consciousness, and efficient cooling mechanisms that will cooperate in the minimization of the electricity bill of the DCs that process the data provided by the WSNs.
Marina Zapater, Cesar Sanchez, Jose L. Ayala, Jose M. Moya, José L. Risco-Martín
2012, vol. 12, no. 8, pp. 10659-10677
Ubiquitous sensor network deployments, such as the ones found in Smart cities and Ambient intelligence applications, require constantly increasing high computational demands in order to process data and offer services to users. The nature of these applications imply the usage of data centers. Research has paid much attention to the energy consumption of the sensor nodes in WSNs infrastructures. However, supercomputing facilities are the ones presenting a higher economic and environmental impact due to their very high power consumption. The latter problem, however, has been disregarded in the field of smart environment services. This paper proposes an energy-minimization workload assignment technique, based on heterogeneity and application-awareness, that redistributes low-demand computational tasks from high-performance facilities to idle nodes with low and medium resources in the WSN infrastructure. These non-optimal allocation policies reduce the energy consumed by the whole infrastructure and the total execution time.
Pedro Malagón, Juan-Mariano de Goyeneche, Marina Zapater, José M. Moya, Zorana Bankovic
2012, vol. 12, no. 6, pp. 7994-8012
Ambient Intelligence (AmI) requires devices everywhere, dynamic and massively distributed networks of low-cost nodes that, among other data, manage private information or control restricted operations. MSP430, a 16-bit microcontroller, is used in WSN platforms, as the TelosB. Physical access to devices cannot be restricted, so attackers consider them a target of their malicious attacks in order to obtain access to the network. Side-channel analysis (SCA) easily exploits leakages from the execution of encryption algorithms that are dependent on critical data to guess the key value. In this paper we present an evaluation framework that facilitates the analysis of the effects of compiler and backend optimizations on the resistance against statistical SCA. We propose an optimization-based software countermeasure that can be used in current low-cost devices to radically increase resistance against statistical SCA, analyzed with the new framework.
Rafael Medina, Darong Huang, Giovanni Ansaloni, Marina Zapater, David Atienza
Proceedings of the 2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC)
Link zur Konferenz
2.5D Systems-on-Package (SoPs) are composed by several chiplets placed on an interposer. They are becoming increasingly popular as they enable easy integration of electronic components in the same package and high fabrication yields. Nevertheless, they introduce a new bottleneck in inter-chiplet communication, which must be routed through the interposer. Such a constraint favors mapping related tasks on computing cores within the same chiplet, leading to thermal hotspots. In-package wireless technology holds promise to reconsider such a position because integrated wireless antennas provide low-latency and high-bandwidth communication paths, thus bypassing the in-terposer bottleneck. Furthermore, in this work, we propose a new task mapping heuristic that leverages in-package wireless technology to improve the thermal behavior of 2.5D SoPs executing complex applications. Combining system simulation and thermal modeling, our results show that we can distribute computation in wireless 2.5D SoPs to reduce peak temperatures by up to 24% through task mapping with a negligible performance impact.
Karan Pathak, Joshua Klein, Giovanni Ansaloni, Marina Zapater, David Atienza
Proceedings of RISC-V Summit Europe
RISC-V-based Systems-on-Chip (SoCs) are witnessing a steady rise in adoption in both industry and academia. However, the limited support for Linux-capable Full System-level simulators hampers development of the RISC-V ecosystem. We address this by validating a full system-level simulator, gXR5 (gem5-eXtensions for RISC-V), against the SiFive HiFive Unleashed SoC, to ensure performance statistics are representative of actual hardware. This work also enriches existing methodologies to validate the gXR5 simulator against hardware by proposing a systematic component-level calibration approach. The simulator error for selected SPEC CPU2017 applications reduces from 44% to 24%, just by calibrating the CPU. We show that this systematic component-level calibration approach is accurate, fast (in terms of simulation time), and generic enough to drive future validation efforts.
Mehdi Akeddar, Thomas Rieder, Guillaume Chacun, Bruno Da Rocha Carvalho, Marina Zapater
Proceedings of the RISC-V Summit Europe, 5-9th June 2023, Barcelona, Spain
In recent years we are witnessing an increasing adoption of RISC-V based systems to run Artificial Intelligence (AI) inference tasks. This trend spans to visual navigation, where major players start adopting RISC-V for autonomous driving. Still, RISC-V based edge devices fall short in providing the performance requirements of complex AI inference. Our work tackles the previous challenges by proposing an opensource framework for transparent distribution of visual navigation inference tasks between edge and cloud for resource-constrained RISC-V edge devices. Our framework automates the partitioning of ONNX and TFLite models between a RISC-V accelerated nanodrone equipped a GAP8 system-on-chip and a cloud server. Our results showcase how partial inference improves the performance achieve by drone-only inference.
Rafael Medina, Joshua Kein, Giovanni Ansaloni, Marina Zapater, Sergi Abadal, Eduard Alarcón, David Atienza
Proceedings of the 28th Asia and South Pacific Design Automation Conference (ASPDAC '23)
Multi-Chiplet architectures are being increasingly adopted to support the design of very large systems in a single package, facilitating the integration of heterogeneous components and improving manufacturing yield. However, chiplet-based solutions have to cope with limited inter-chiplet routing resources, which complicate the design of the data interconnect and the power delivery network. Emerging in-package wireless technology is a promising strategy to address these challenges, as it allows to implement flexible chiplet interconnects while freeing package resources for power supply connections. To assess the capabilities of such an approach and its impact from a full-system perspective, herein we present an exploration of the performance of in-package wireless communication, based on dedicated extensions to the gem5-X simulator. We consider different Medium Access Control (MAC) protocols, as well as applications with different runtime profiles, showcasing that current in-package wireless solutions are competitive with wired chiplet interconnects. Our results show how in-package wireless solutions can outperform wired alternatives when running artificial intelligence workloads, achieving up to a 2.64× speed-up when running deep neural networks (DNNs) on a chiplet-based system with 16 cores distributed in four clusters.
Darong Huang, Luis Costero, Federico Terraneo, Marina Zapater, David Atienza
Proceedings of HSSB Workshop - ISCA 2022
The increasing power density of modern multi-core processors using deep nano-scale technologies has entailed severe thermal issues for such chips. Indeed, industry’s heterogeneous chip design trends exacerbate transient non-uniform thermal hotspots in next-generation processors. Different cooling solutions are coming into play to alleviate this situation, such as liquid, evaporative, or thermoelectric cooling, among others. Hence, a new generation of thermal simulators with unprecedented flexibility is required to include such technologies in the modeling of nano-scale IC design technologies. This work presents 3D-ICE 3.1, the first thermal simulator designed for fully customized nonuniform modeling and accurate co-simulation with different heat dissipation systems.
William Andrew Simon, Valérian Ray, Alexandre Levisse, Giovanni Ansaloni, Marina Zapater, David Atienza
Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC)
Edge devices must support computationally demanding algorithms, such as neural networks, within tight area/energy budgets. While approximate computing may alleviate these constraints, limiting induced errors remains an open challenge. In this paper, we propose a hardware/software co-design solution via an inexact multiplier, reducing area/power-delay-product requirements by 73/43%, respectively, while still computing exact results when one input is a Fibonacci encoded value. We introduce a retraining strategy to quantize neural network weights to Fibonacci encoded values, ensuring exact computation during inference. We benchmark our strategy on Squeezenet 1.0, DenseNet-121, and ResNet-18, measuring accuracy degradations of only 0.4/1.1/1.7%.
Joshua Klein, Alexandre Levisse, Giovanni Ansaloni, David Atienza, Marina Zapater, Martino Dazzi, Geethan Karunaratne, Irem Boybat, Abu Sebastian, Davide Rossi, Francesco Conti, Elana Pereira de Santana, Peter Haring Bolívar, Mohamed Saeed, Renato Negra, Zhenxing Wang, Kun-Ta Wang, Max C. Lemme, Akshay Jain, Robert Guirado, Hamidreza Taghvaee, Sergi Abadal
Proceedings of the 18th ACM International Conference on Computing Frontiers (CF'21)
This paper presents the research directions pursued by the WiPLASH European project, pioneering on-chip wireless communications as a disruptive enabler towards next-generation computing systems for artificial intelligence (AI). We illustrate the holistic approach driving our research efforts, which encompass expertises and abstraction levels ranging from physical design of embedded graphene antennas to system-level evaluation of wirelessly-communicating heterogeneous systems.
Alexandre Levisse, Marina Zapater, David Atienza
Proceedings of 28th IFIP/IEEE International Conference on Very Large Scale Integration
Hybrid caches consisting of both SRAM and emerging Non-Volatile Random Access Memory (eNVRAM) bitcells increase cache capacity and reduce power consumption by taking advantage of eNVRAM’s small area footprint and low leakage energy. However, they also inherit eNVRAM’s drawbacks, including long write latency and limited endurance. To mitigate these drawbacks, many works propose heuristic strategies to allocate memory blocks into SRAM or eNVRAM arrays at runtime based on block content or access pattern. In contrast, this work presents a HW/SW Stack for Hybrid Caches (SHyCache), consisting of a hybrid cache architecture and supporting programming model, reminiscent of those that enable GP-GPU acceleration, in which application variables can be allocated explicitly to the eNVRAM cache, eliminating the need for heuristics and reducing cache access time, power consumption, and area overhead while maintaining maximal cache utilization efficiency and ease of programming. SHyCache improves performance for applications such as neural networks, which contain large numbers of invariant weight values with high read/write access ratios that can be explicitly allocated to the eNVRAM array. We simulate SHyCache on the gem5-X architectural simulator and demonstrate its utility by benchmarking a range of cache hierarchy variations using three neural networks, namely, Inception v4, ResNet-50, and SqueezeNet 1.0. We demonstrate a design space that can be exploited to optimize performance, power consumption, or endurance, depending on the expected use case of the architecture, while demonstrating maximum performance gains of 1.7/1.4/1.3x and power consumption reductions of 5.1/5.2/5.4x, for Inception/ResNet/SqueezeNet, respectively.
Halima Najibi, Alexandre Levisse, Marina Zapater, Sabry Aly Mohamed M., David Atienza
Proceedings of GLSVLSI '20: Proceedings of the 2020 on Great Lakes Symposium on VLSI, September 2020, Online
Deeply-scaled three-dimensional (3D) Multi-Processor Systemson-Chip (MPSoCs) enable high performance and massive communication bandwidth for next-generation computing. However as process nodes shrink, temperature-dependent leakage dramatically increases, and thermal and power management becomes problematic. In this context, Integrated Flow Cell Array (FCA) technology, which consists of inter-tier microfluidic channels, combines onchip electrochemical power generation and liquid cooling of 3D MPSoCs. When connected to power delivery networks (PDN) of dies, FCAs provide an additional current compensating the voltage drop (IR-drop). In this paper, we evaluate for the first time how the IR-drop reduction and cooling capabilities of FCAs scale with advanced CMOS processes. We develop a framework to quantify the system-level impact of FCAs at technology nodes from 22𝑛𝑚 to 3𝑛𝑚. Our results show that, across all considered nodes, FCAs reduce the peak temperature of a multi-core processor (MCP) and a Machine Learning (ML) accelerator by over 22°C and 35°C, respectively, compared to off-chip direct liquid cooling. Moreover, the low operation voltages and high temperatures at advanced nodes improve up to 2× FCA power generation. Hence, FCAs allow to keep the IR-drop below 5% for both the MCP and ML accelerator, saving over 10% TSV-reserved area, as opposed to using a HighPerformance Computing (HPC) MPSoC liquid cooling solution.
Halima Najibi, Jorge Hunter, Alexandre Levisse, Marina Zapater, Miroslav Vasic, David Atienza
Proceedings of 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 6-8 July 2020, Limassol, Cyprus
Flow cell arrays (FCAs) provide efficient on-chip liquid cooling and electrochemical power generation capabilities in three-dimensional multi-processor systems-on-chip (3D MPSoCs). When connected to power delivery networks (PDNs) of chips, the current flowing between FCA electrodes partially supplies logic gates and compensates over 20% Vdd drop in high-performance 3D systems. However, operation voltages of CMOS technologies are generally higher than the voltage corresponding to the maximal FCA power generation. Hence, directly connecting FCAs to 3D MPSoC power grids results in sub-optimal performance. In this paper, we design an on-chip direct current to direct current (DC-DC) converter to improve FCA power generation in high-performance 3D MPSoCs. We use switched capacitor (SC) technology and explore different design space parameters to achieve minimal area requirement and maximal power extraction. The proposed converter enables a stable and optimal voltage between FCA electrodes. Furthermore, it allows us to dynamically control FCA connectivity to 3D PDNs, and switch off power extraction during chip inactivity. We show that regulated FCAs generate up to 123% higher power with respect to the case they are directly connected to 3D PDNs. In addition, connecting multiple flow cells to a single optimized converter reduces area requirement down to 1.26%, while maintaining IR-drop below 5%. Finally, we show that activity-based dynamic FCA switching extends by over 1.8X and 4.5X electrolytes lifetime for a processor duty-cycle of 50% and 20%, respectively.
Arman Iranfar, Federico Terraneo, Gabor Csordas, Marina Zapater, William Fornaciari
Proceedings of the 2020 Design, Automation & Test in Europe Conference
Dynamic Thermal Management (DTM) has become a major challenge since it directly affects Multiprocessors Systems-on-chip (MPSoCs) performance, power consumption, and reliability. In this work, we propose a transient fan model, enabling adaptive fan speed control simulation for efficient DTM. Our model is validated through a thermal test chip achieving less than 2°C error in the worst case. With multiple fan speeds, however, the DTM design space grows significantly, which can ultimately make conventional solutions impractical. We address this challenge through a reinforcement learning-based solution to proactively determine the number of active cores, operating frequency, and fan speed. The proposed solution is able to reduce fan power by up to 40% compared to a DTM with constant fan speed with less than 1% performance degradation. Also, compared to a state-of-the-art DTM technique our solution improves the performance by up to 19% for the same fan power.
Marco Rios, William Simon, Alexandre Levisse, Marina Zapater, David Atienza
Proceedings of the 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC)
With the spread of cloud services and Internet of Things concept, there is a popularization of machine learning and artificial intelligence based analytics in our everyday life. However, an efficient deployment of these data-intensive services requires performing computations closer to the edge. In this context, in-cache computing, based on bitline computing, is promising to execute data-intensive algorithms in an energy efficient way by mitigating data movement in the cache hierarchy and exploiting data parallelism. Nevertheless, previous in-cache computing architectures contain serious circuit-level deficiencies (i.e., low bitcell density, data corruption risks, and limited performance), thus report high multiplication latency, which is a key operation for machine learning and deep learning. Moreover, no previous work addresses the issue of way misalignment, strongly constraining data placement not to reduce performance gains. In this work we drastically improve the previously proposed BLADE architecture for in-cache computing to efficiently support multiplication operations by enhancing the local bitline circuitry, enabling associativity-agnostic operations as well as in-place shifting inside local bitline groups. We implemented and simulated the proposed architecture in CMOS 28nm bulk technology from TSMC, validating its functionality and extracting its performance, area, and energy per operation. Then, we designed a behavioral model of the proposed architecture to assess its performance with respect to the latest BLADE architecture. We show a 17.5 and 22% area and energy reduction thanks to the proposed LG optimization. Finally, for 16bits multiplication, we demonstrate 44% cycle count, 47% energy and 41% performances gain versus BLADE and show that 4 embedded shifts is the best trade-off between energy, area and performances.
Arman Iranfar, Wellington Silva De Souza, Marina Zapater, Katzalin Olcoz, Samuel Xavier de Souza
Accurate workload prediction and throughput estimation are keys in efficient proactive power and performance management of multi-core platforms. Although hardware performance counters available on modern platforms contain important information about the application behavior, employing them efficiently is not straightforward when dealing with time-varying applications even if they have iterative structures. In this work, we propose a machine learning-based framework for workload prediction and throughput estimation using hardware events. Our framework enables throughput estimation over various available system configurations, namely, number of parallel threads and operating frequency. In particular, we first employ workload clustering and classification techniques along with Markov chains to predict the next workload for each available system configuration. Then, the predicted workload is used to estimate the next expected throughput through a machine learning-based regression model. The comparison with state of the art demonstrates that our framework is able to improve Quality of Service (QoS) by 3.4x, while consuming 15% less power thanks to the more accurate throughput estimation.
Arman Iranfar, Anderson Silva, Marina Zapater, Samuel Xavier de Souza
In this work we present ContainEnergy, a new performance evaluation and profiling tool that uses software containers to perform application runtime assessment, providing energy and performance profiling data. It is focused on energy efficiency for next generation workloads and IT infrastructure.
João Vieira, Edouard Giacomin, Yasir Qureshi, Marina Zapater, Xifan Tang
The need for running complex Machine Learning (ML) algorithms, such as Convolutional Neural Networks (CNNs), in edge devices, which are highly constrained in terms of computing power and energy, makes it important to execute such applications efficiently. The situation has led to the popularization of Binary Neural Networks (BNNs), which significantly reduce execution time and memory requirements by representing the weights (and possibly the data being operated) using only one bit. Because approximately 90% of the operations executed by CNNs and BNNs are convolutions, a significant part of the memory transfers consists of fetching the convolutional kernels. Such kernels are usually small (e.g., 3×3 operands), and particularly in BNNs redundancy is expected. Therefore, equal kernels can be mapped to the same memory addresses, requiring significantly less memory to store them. In this context, this paper presents a custom Binary Dot Product Engine (BDPE) for BNNs that exploits the features of Resistive Random-Access Memories (RRAMs). This new engine allows accelerating the execution of the inference phase of BNNs. The novel BDPE locally stores the most used binary weights and performs binary convolution using computing capabilities enabled by the RRAMs. The system-level gem5 architectural simulator was used together with a C-based ML framework to evaluate the system's performance and obtain power results. Results show that this novel BDPE improves performance by 11.3%, energy efficiency by 7.4% and reduces the number of memory accesses by 10.7% at a cost of less than 0.3% additional die area, when integrated with a 28 nm Fully Depleted Silicon On Insulator ARMv8 in-order core, in comparison to a fully-optimized baseline of YoloV3 XNOR-Net running in a unmodified Central Processing Unit.
Halima Najibi, Alexandre Levisse, Marina Zapater
Proceedings of the 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)
Integrated Flow Cell Array (FCA) technology promises to address the power delivery and heat dissipation challenges in three-dimensional Multi-Processor Systems-on-Chips (3D MPSoCs) by providing combined inter-tier liquid cooling and power generation capabilities. In this paper, we present for the first time a design framework to accurately model the temperature-aware power delivery network in 3D MPSoCs, and quantify the effects of FCAs on the voltage drop (IR-drop). This framework estimates the power generation variation along FCAs due to voltage and temperature, in the case of uniform and non-uniform powermaps from several real processor traces. Furthermore, we explore different 3D MPSoC configurations to quantify their power delivery requirements. Our results show that FCAs improve the IR-drop with respect to state-of-the-art design methods up to 53% and 30% for dies with a power consumption of 60W and 190W, respectively, while maintaining their peak temperatures below 52°C, and at no additional Through Silicon Via (TSV) area overhead. In addition, as the presence of high power density regions (hotspots) can decrease the FCAs IR-drop reduction by up to 21% with respect to the average value, we present a scalable TSV placement optimization methodology using the proposed framework. This methodology minimizes the IR-drop at hotspots and guarantees an optimal and uniform exploitation of the IR-drop reduction benefits of FCAs.
William Simon, Juan Galicia, Alexandre Levisse, Marina Zapater, David Atienza
Proceedings of the 56th Annual Design Automation Conference 2019
As the computational complexity of applications on the consumer market, such as high-definition video encoding and deep neural networks, become ever more demanding, novel ways to efficiently compute data intensive workloads are being explored. In this context, In-Memory Computing (IMC) solutions, and particularly bitline computing in SRAM, appear promising as they mitigate one of the most energy consuming aspects in computation: data movement. While IMC architectural level characteristics have been defined by the research community, only a few works so far have explored the implementation of such memories at a low level. Furthermore, these proposed solutions are either slow (<1GHz), area hungry (10T SRAM), or suffer from read disturb and corruption issues. Overall, there is no extensive design study considering realistic assumptions at the circuit level. In this work we propose a fast (up to 2.2Ghz), 6T SRAM-based, reliable (no read disturb issues), and wide voltage range (from 0.6 to 1V) IMC architecture using local bitlines. Beyond standard read and write, the proposed architecture can perform copy, addition and shift operations at the array level. As addition is the slowest operation, we propose a modified carry chain adder, providing a 2× carry propagation improvement. The proposed architecture is validated using a 28nm bulk high performances technology PDK with CMOS variability and post-layout simulations. High density SRAM bitcells (0.127μm) enable area efficiency of 59.7% for a 256×128 array, on par with current industrial standards.
William Andrew Simon, Yasir Mahmood Qureshi, Alexandre Levisse, Marina Zapater, David Atienza
Proceedings of the 2019 on Great Lakes Symposium on VLSI
The increasing ubiquity of edge devices in the consumer market, along with their ever more computationally expensive workloads, necessitate corresponding increases in computing power to support such workloads. In-memory computing is attractive in edge devices as it reuses preexisting memory elements, thus limiting area overhead. Additionally, in-SRAM Computing (iSC) efficiently performs computations on spatially local data found in a variety of emerging edge device workloads. We therefore propose, implement, and benchmark BLADE, a BitLine Accelerator for Devices on the Edge. BLADE is an iSC architecture that can perform massive SIMD-like complex operations on hundreds to thousands of operands simultaneously. We implement BLADE in 28nm CMOS and demonstrate its functionality down to 0.6V, lower than any conventional state-of-the-art iSC architecture. We also benchmark BLADE in conjunction with a full Linux software stack in the gem5 architectural simulator, providing a robust demonstration of its performance gain in comparison to an equivalent embedded processor equipped with a NEON SIMD co-processor. We benchmark BLADE with three emerging edge device workloads, namely cryptography, high efficiency video coding, and convolutional neural networks, and demonstrate 4x, 6x, and 3x performance improvement, respectively, in comparison to a baseline CPU/NEON processor at an equivalent power budget.
Yasir Mahmood Qureshi, William Andrew Simon, Marina Zapater, David Atienza, Katzalin Olcoz
Proceedings of the 2019 Spring Simulation Conference
The rapid expansion of online-based services requires novel energy and performance efficient architectures to meet power and latency constraints. Fast architectural exploration has become a key enabler in the proposal of architectural innovation. In this paper, we present gem5-X, a gem5-based system level simulation framework, and a methodology to optimize many-core systems for performance and power. As real-life case studies of many-core server workloads, we use real-time video transcoding and image classification using convolutional neural networks (CNNs). Gem5-X allows us to identify bottlenecks and evaluate the potential benefits of architectural extensions such as in-cache computing and 3D stacked High Bandwidth Memory. For real-time video transcoding, we achieve 15% speed-up using in-order cores with in-cache computing when compared to a baseline in-order system and 76% energy savings when compared to an Out-of-Order system. When using HBM, we further accelerate real-time transcoding and CNNs by up to 7% and 8% respectively.
Kevin Henares, José L. Risco-Martín, Marina Zapater
Proceedings of the Theory of Modeling and Simulation Symposium 2019
Modeling and Simulation (M&S) is one of the most multifaceted topics present today in both industry and academia. However, we are involved in a new M&S paradigm. Systems are becoming more complex and new simulation needs arise and have to be studied. As a consequence, the way in which we perform M&S must be adapted, providing new ideas and tools. In this paper, we propose a rule-based constraints evaluator, which facilitate the validation and verification of complex models in a transparent manner. For this, constraints are defined. The constraints definition process is completely independent of the model development process because (a) the set of constraints is defined once the model has been developed, and (b) constraints are validated at simulation time. The proposed Constraint M&S architecture has been built using the Discrete Event System Specification (DEVS) formalism and has been tested on a validated data center simulation model.
Arman Iranfar, Ali Pahlevan, Marina Zapater, David Atienza
Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition
The power density and, consequently, power hungriness of server processors is growing by the day. Traditional air cooling systems fail to cope with such high heat densities, whereas single-phase liquid-cooling still requires high mass flow-rate, high pumping power, and large facility size. On the contrary, in a micro-scale gravity-driven thermosyphon attached on top of a processor, the refrigerant, absorbing the heat, turns into a two-phase mixture. The vapor-liquid mixture exchanges heat with a coolant at the condenser side, turns back to liquid state, and descends thanks to gravity, eliminating the need for pumping power. However, similar to other cooling technologies, thermosyphon efficiency can considerably vary with respect to workload performance requirements and thermal profile, in addition to the platform features, such as packaging and die floorplan. In this work, we first address the workload- and platform-aware design of a two-phase thermosyphon. Then, we propose a thermal-aware workload mapping strategy considering the potential and limitations of a two-phase thermosyphon to further minimize hot spots and spatial thermal gradients. Our experiments, performed on an 8-core Intel Xeon E5 CPU reveal, on average, up to 10°C reduction in thermal hot spots, and 45% reduction in the maximum spatial thermal gradient on the die. Moreover, our design and mapping strategy are able to decrease the chiller cooling power at least by 45%.
Luis Costero, Arman Iranfar, Marina Zapater, Francisco D. Igual, Katzalin Olcoz
Real-time video transcoding has recently raised as a valid alternative to address the ever-increasing demand for video contents in servers' infrastructures in current multi-user environments. High Efficiency Video Coding (HEVC) makes efficient online transcoding feasible as it enhances user experience by providing the adequate video configuration, reduces pressure on the network, and minimizes inefficient and costly video storage. However, the computational complexity of HEVC, together with its myriad of configuration parameters, raises challenges for power management, throughput control, and Quality of Service (QoS) satisfaction. This is particularly challenging in multi-user environments where multiple users with different resolution demands and bandwidth constraints need to be served simultaneously. In this work, we present MAMUT, a multi-agent machine learning approach to tackle these challenges. Our proposal breaks the design space composed of run-time adaptation of the transcoder and system parameters into smaller sub-spaces that can be explored in a reasonable time by individual agents. While working cooperatively, each agent is in charge of learning and applying the optimal values for internal HEVC and system-wide parameters. In particular, MAMUT dynamically tunes Quantization Parameter, selects number of threads per video, and sets the operating frequency with throughput and video quality objectives under compression and power consumption constraints. We implement MAMUT on an enterprise multicore server and compare equivalent scenarios to state-of-the-art alternative approaches. The obtained results reveal that MAMUT consistently attains up to 8× improvement in terms of FPS violations (and thus Quality of Service), 24% power reduction, as well as faster and more accurate adaptation both to the video contents and available resources.
Juan Carlos Salinas-Hilburg, Marina Zapater, José M. Moya, José L. Ayala
Proceedings of the 29th International Conference on Application-specific Systems, Architectures and Processors
In order to optimize the energy use of servers in Data Centers, techniques such as power capping or power budgeting are usually deployed. These techniques rely on the prediction of the power and execution time of applications. These data are obtained via dynamic profiling which requires a full execution of the application. This is not feasible in High Performance Computing (HPC) applications with long execution times. In this paper, we present a methodology to estimate the dynamic CPU and memory energy consumption of an application without executing it completely. Our methodology merges static code analysis information and dynamic profiling via the partial execution of the application. We do so by leveraging the concept of application signature, defined as a reduced version of the application in terms of execution time and power profile. We validate our methodology with a set of CPU -intensive, memory-intensive benchmarks and multi-threaded applications in a presently shipping enterprise server. Our energy estimation methodology shows an overall error below 8.0% when compared to the dynamic energy of the whole execution of the application. Also, our energy estimation methodology allows to estimate the energy of multi-threaded applications with an RMSE equal to 12.7% when compared to the dynamic energy from the complete parallel execution.
Artem Andreev, Fulya Kaplan, Marina Zapater, Ayse K. Coskun, David Atienza
Proceedings of the International Symposium on Low Power Electronics and Design 2018
Integrated flow cell array (FCA) is an emerging technology, targeting the cooling and power delivery challenges of modern 2D/3D Multi-Processor Systems-on-Chip (MPSoCs). In FCA, electrolytic solutions are pumped through microchannels etched in the silicon of the chips, removing heat from the system, while, at the same time, generating power on-chip. In this work, we explore the impact of FCA system design on various 3D architectures and propose a methodology to optimize a 3D MPSoC with integrated FCA to run a given workload in the most energy-efficient way. Our results show that an optimized configuration can save up to 50% energy with respect to sub-optimal 3D MPSoC configurations.
Arman Iranfar, William Andrew Simon, Marina Zapater, David Atienza
Proceedings of the 2018 IEEE International Symposium on Circuits and Systems
The design of new streaming systems is becoming a major area of research to deploy services targeted in the Internet-of-Things (IoT) era. In this context, the new High Efficiency Video Coding (HEVC) standard provides high efficiency and scalability of quality at the cost of increased computational complexity for edge nodes, which is a new challenge for the design of IoT systems. The usage of hardware acceleration in conjunction with general-purpose cores in Multiprocessor Systems-on-Chip (MP-SoCs) is a promising solution to create heterogeneous computing systems to manage the complexity of real-time streaming for high-end IoT systems, achieving higher throughput and power efficiency when compared to conventional processors alone. Furthermore, Machine Learning (ML) provides a promising solution to efficiently use this next-generation of heterogeneous MPSoC designs that the EDA industry is developing by dynamically optimizing system performance under diverse requirements such as frame resolution, search area, operating frequency and stream allocation. In this work, we propose an ML-based approach for stream allocation and Dynamic Voltage and Frequency Scaling (DVFS) management on a heterogeneous MPSoC composed of ARM cores and FPGA fabric containing hardware accelerators for the motion estimation of HEVC encoding. Our experiments on a Zynq7000 SoC outline 20% higher throughput when compared to the state-of-the-art streaming systems for next-generation IoT devices.
André Seuret, Arman Iranfar, Marina Zapater, John Thome, David Atienza
Proceedings of 17th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems
Next-generation High-Performance Computing (HPC) systems need to provide outstanding performance with unprecedented energy efficiency while maintaining servers at safe thermal conditions. Air cooling presents important limitations when employed in HPC infrastructures. Instead, two-phase onchip cooling combines small footprint area and large heat exchange surface of micro-channels together with extremely high heat transfer performance, and allows for waste heat recovery. When relying on gravity to drive the flow to the heat sink, the system is called a closed-loop two-phase thermosyphon. Previous research work either focused on the development of large-scale proof-of-concept thermosyphon demonstrators, or on the development of numerical models able to simulate their operation. In this work, we present a new ultra-compact microscale thermosyphon design for high heat flux components. We manufactured a working 8 cm height prototype tailored for Virtex 7 FPGAs with a heat spreader area of 45 mm × 45 mm, and we validate its performance via measurements. The results are compared to our simulator and accurately match the thermal performance of the thermosyphon, with error of less than 3.5% . Our prototype is able to work over the full range of power of the Virtex7, dissipating up to 60 W of power while keeping chip temperature below 60°C. The prototype will next be deployed in a 10 kW rack as part of an HPC prototype, with an expected Power Usage Effectiveness (PUE) below 1.05.
Arman Iranfar, Ali Pahlevan, Marina Zapater, Martin Zagar, Mario Kovac
Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition
Bio-medical image processing in the field of telemedicine, and in particular the definition of systems that allow medical diagnostics in a collaborative and distributed way is experiencing an undeniable growth. Due to the high quality of bio-medical videos and the subsequent large volumes of data generated, to enable medical diagnosis on-the-go it is imperative to efficiently transcode and stream the stored videos on real time, without quality loss. However, online video transcoding is a high-demanding computationally-intensive task and its efficient management in Multiprocessor Systems-on-Chip (MPSoCs) poses an important challenge. In this work, we propose an efficient motion- and texture-aware frame-level parallelization approach to enable online medical imaging transcoding on MPSoCs for next generation video encoders. By exploiting the unique characteristics of bio-medical videos and the medical procedure that enable diagnosis, we split frames into tiles based on their motion and texture, deciding the most adequate level of parallelization. Then, we employ the available encoding parameters to satisfy the required video quality and compression. Moreover, we propose a new fast motion search algorithm for bio-medical videos that allows to drastically reduce the computational complexity of the encoder, thus achieving the frame rates required for online transcoding. Finally, we heuristically allocate the threads to the most appropriate available resources and set the operating frequency of each one. We evaluate our work on an enterprise multicore server achieving online medical imaging with 1.6x higher throughput and 44% less power consumption when compared to the state-of-the-art techniques.
Ali Pahlevan, Yasir Mahmood Qureshi, Marina Zapater, Andrea Bartolini, Davide Rossi
Cloud Computing aims to efficiently tackle the increasing demand of computing resources, and its popularity has led to a dramatic increase in the number of computing servers and data centers worldwide. However, as effect of post-Dennard scaling, computing servers have become power-limited, and new system-level approaches must be used to improve their energy efficiency. This paper first presents an accurate power modelling characterization for a new server architecture based on the FD-SOI process technology for near-threshold computing (NTC). Then, we explore the existing energy vs. performance trade-offs when virtualized applications with different CPU utilization and memory footprint characteristics are executed. Finally, based on this analysis, we propose a novel dynamic virtual machine (VM) allocation method that exploits the knowledge of VMs characteristics together with our accurate server power model for next-generation NTC-based data centers, while guaranteeing quality of service (QoS) requirements. Our results demonstrate the inefficiency of current workload consolidation techniques for new NTC-based data center designs, and how our proposed method provides up to 45% energy savings when compared to state-of-the-art consolidation-based approaches.
Proceedings of the 2017 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)
High Efficiency Video Coding (HEVC) provides high efficiency at the cost of increased computational complexity followed by increased power consumption and temperature of current Multi- Processor Systems-on-Chip (MPSoCs). In this paper, we propose a machine learning-based power and thermal management approach that dynamically learns the best encoder configuration and core frequency for each of the several video streams running in an MPSoC, using information from frame compression, quality, performance, total power and temperature. We implement our approach in an enterprise multicore server and compare it against state-of-the-art techniques. Our approach improves video quality and performance by 17% and 11%, respectively, while reducing average temperature by 12%, without degrading compression or increasing power.
José Flich, Giovanni Agosta, Philipp Ampletzer, David Atienza, Carlo Brandolese, Marina Zapater
Proceedings of the 2017 Euromicro Conference on Digital System Design (DSD)
Arman Iranfar, Terraneo, William Andrew Simon, Leon Dragic, Marina Zapater, Igor Piljic
Proceedings of the 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation
Next-generation High-Performance Computing (HPC) applications need to tackle outstanding computational complexity while meeting latency and Quality-of-Service constraints. Heterogeneous Multi-Processor Systems-on-Chip (MPSoCs), equipped with a mix of general-purpose cores and reconfigurable fabric for custom acceleration of computational blocks, are key in providing the flexibility to meet the requirements of next-generation HPC. However, heterogeneity brings new challenges to efficient chip thermal management. In this context, accurate and fast thermal simulators are becoming crucial to understand and exploit the trade-offs brought by heterogeneous MPSoCs. In this paper, we first thermally characterize a next-generation HPC workload, the online video transcoding application, using a highly-accurate Infra-Red (IR) microscope. Second, we extend the 3D-ICE thermal simulation tool with a new generic heat spreader model capable of accurately reproducing package surface temperature, with an average error of 6.8% for the hot spots of the chip. Our model is used to characterize the thermal behaviour of the online transcoding application when running on a heterogeneous MPSoC. Moreover, by using our detailed thermal system characterization we are able to explore different application mappings as well as the thermal limits of such heterogeneous platforms.
Ignacioa Penas, Marina Zapater, José L. Risco-Martín, José L. Ayala
Proceedings of the 2017 Summer Simulation Multi-Conference
Data centers are huge power consumers and have very high operational costs. Both industry and academiahave proposed strategies at multiple levels (server, room layout, cooling, workload allocation, etc.) to in-crease the efficiency of these facilities. Testing the impact of variables so different in nature such as layoutor workload allocation can only be managed by simulators. Current simulation infrastructures are eitherfocused on data room thermal dynamics, or target only a specific stage of data center operations, such asworkload allocation. Moreover, they are geared for specific use-cases such as HPC or cloud computing. Inthis paper we present a data center modeling and simulation framework, for both HPC and cloud applica-tions, to assess data center performance, thermal behavior, energy efficiency and operational cost. Our goalis to show the possibilities of the current data center modeling and simulation framework. Furthermore, aswe provide a fully configurable, flexible and scalable infrastructure any kind of policy, data center size orworkload amount could easily be implemented over the simulator. We also provide the data sets used tovalidate our models and policies, obtained from real servers and data centers, so as to enable researchers totest their strategies in a realistic setup.
Juan C. Salinas-Hilburg, Marina Zapater, José L. Risco-Martín, José M. Moya, José L. Ayala
Proceedings of the 2016 Conference on Design, Automation & Test in Europe
Data centers are huge power consumers and their energy consumption keeps on rising despite the efforts to increase energy efficiency. A great body of research is devoted to the reduction of the computational power of these facilities, applying techniques such as power budgeting and power capping in servers. Such techniques rely on models to predict the power consumption of servers. However, estimating overall server power for arbitrary applications when running co-allocated in multithreaded servers is not a trivial task. In this paper, we use Grammatical Evolution techniques to predict the dynamic power of the CPU and memory subsystems of an enterprise server using the hardware counters of each application. On top of our dynamic power models, we use fan and temperature-dependent leakage power models to obtain the overall server power. To train and test our models we use real traces from a presently shipping enterprise server under a wide set of sequential and parallel workloads running at various frequencies We prove that our model is able to predict the power consumption of two different tasks co-allocated in the same server, keeping error below 8W. For the first time in literature, we develop a methodology able to combine the hardware counters of two individual applications, and estimate overall server power consumption without running the co-allocated application. Our results show a prediction error below 12W, which represents a 7.3% of the overall server power, outperforming previous approaches in the state of the art.
Ali Pahlevan, Javier Picorel, Arash Pourhabibi Zarandi, Davide Rossi, Marina Zapater
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition
The popularity of cloud computing has led to a dramatic increase in the number of data centers in the world. The ever-increasing computational demands along with the slowdown in technology scaling has ushered an era of power-limited servers. Techniques such as near-threshold computing (NTC) can be used to improve energy efficiency in the post-Dennard scaling era. This paper describes an architecture based on the FD-SOI process technology for near-threshold operation in servers. Our work explores the trade-offs in energy and performance when running a wide range of applications found in private and public clouds, ranging from traditional scale-out applications, such as web search or media streaming, to virtualized banking applications. Our study demonstrates the benefits of near-threshold operation and proposes several directions to synergistically increase the energy proportionality of a near-threshold server.
Marina Zapater, Ata Turk, José M. Moya, José L. Ayala, Ayse K. Coskun
Proceedings of the sixth International Green and Sustainable Computing Conference
Energy efficiency research in data centers has traditionally focused on raised-floor air-cooled facilities. As rack power density increases, traditional cooling is being replaced by close-coupled systems that provide enhanced airflow and cooling capacity. This work presents a model for close-coupled data centers with free cooling, and explores the power consumption trade-offs in these facilities as outdoor temperature changes throughout the year. Using this model, we propose a technique that jointly allocates workload and controls cooling in a power-efficient way. Our technique is tested with configuration parameters, power traces, and weather data collected from real-life data centers, and application profiles obtained from enterprise servers. Results show that our joint workload allocation and cooling policy provides 5% reduction in overall data center energy consumption, and up to 24% peak power reduction, leading to a 6% decrease in the electricity costs without affecting performance.
M. D. Santambrogio, José L. Ayala, Simone Campanoni, Marina Zapater
Proceedings of the 2015 International Conference on Hardware/Software Codesign and System Synthesis
Resources such as quantities of transistors and memory, the level of integration and the speed of components have increased dramatically over the years. Even though the technologies have improved, we continue to apply outdated approaches to our use of these resources. Key computer science abstractions have not changed since the 1960's. Therefore this is the time for a fresh approach to the way systems are designed and used.
Proceedings of the 9th International Conference on Complex, Intelligent, and Software Intensive Systems
The increasing demand for computational resources has led to a significant growth of data center facilities. A major concern has appeared regarding energy efficiency and consumption in servers and data centers. The use of flexible and scalable server power models is a must in order to enable proactive energy optimization strategies. This paper proposes the use of Evolutionary Computation to obtain a model for server dynamic power consumption. To accomplish this, we collect a significant number of server performance counters for a wide range of sequential and parallel applications, and obtain a model via Genetic Programming techniques. Our methodology enables the unsupervised generation of models for arbitrary server architectures, in a way that is robust to the type of application being executed in the server. With our generated models, we are able to predict the overall server power consumption for arbitrary workloads, outperforming previous approaches in the state-of-the-art.
Ignacio Aransay, Marina Zapater, Patricia Arroba, José M. Moya
Proceedings of the 8th International Conference on Cloud Computing
The increasing success of Cloud Computing applications and online services has contributed to the unsustainability of data center facilities in terms of energy consumption. Higher resource demand has increased the electricity required by computation and cooling resources, leading to power shortages and outages, specially in urban infrastructures. Current energy reduction strategies for Cloud facilities usually disregard the data center topology, the contribution of cooling consumption and the scalability of optimization strategies. Our work tackles the energy challenge by proposing a temperature-aware VM allocation policy based on a Trust-and-Reputation System (TRS). A TRS meets the requirements for inherently distributed environments such as data centers, and allows the implementation of autonomous and scalable VM allocation techniques. For this purpose, we model the relationships between the different computational entities, synthesizing this information in one single metric. This metric, called reputation, would be used to optimize the allocation of VMs in order to reduce energy consumption. We validate our approach with a state-of-the-art Cloud simulator using real Cloud traces. Our results show considerable reduction in energy consumption, reaching up to 46.16% savings in computing power and 17.38% savings in cooling, without QoS degradation while keeping servers below thermal redlining. Moreover, our results show the limitations of the PUE ratio as a metric for energy efficiency. To the best of our knowledge, this paper is the first approach in combining Trust-and-Reputation systems with Cloud Computing VM allocation.
Patricia Arroba, José L. Risco-Martín, Marina Zapater, José M. Moya, José L. Ayala, Katzalin Olcoz
Proceedings of the 6th International Conference on Sustainability in Energy and Buildings / SEB-14 ; Energy Procedia
As advanced Cloud services are becoming mainstream, the contribution of data centers in the overall power consumption of modern cities is growing dramatically. The average consumption of a single data center is equivalent to the energy consumption of 25.000 households. Modeling the power consumption for these infrastructures is crucial to anticipate the effects of aggressive optimization policies, but accurate and fast power modeling is a complex challenge for high-end servers not yet satisfied by analytical approaches. This work proposes an automatic method, based on Multi-Objective Particle Swarm Optimization, for the identification of power models of enterprise servers in Cloud data centers. Our approach, as opposed to previous procedures, does not only consider the workload consolidation for deriving the power model, but also incorporates other non traditional factors like the static power consumption and its dependence with temperature. Our experimental results shows that we reach slightly better models than classical approaches, but simul- taneously simplifying the power model structure and thus the numbers of sensors needed, which is very promising for a short-term energy prediction. This work, validated with real Cloud applications, broadens the possibilities to derive efficient energy saving techniques for Cloud facilities.
Josué Pagán, Marina Zapater, Oscar Cubo, Patricia Arroba, Vicente Martín Ayuso
Proceedings of the XVIII Conference on the Design of Circuits and Integrated Systems
High-Performance Computing, Cloud computing and next-generation applications such e-Health or Smart Cities have dramatically increased the computational demand of Data Centers. The huge energy consumption, increasing levels of CO2 and the economic costs of these facilities represent a challenge for industry and researchers alike. Recent research trends propose the usage of holistic optimization techniques to jointly minimize Data Center computational and cooling costs from a multilevel perspective. This paper presents an analysis on the parameters needed to integrate the Data Center in a holistic optimization framework and leverages the usage of Cyber-Physical systems to gather workload, server and environmental data via software techniques and by deploying a non-intrusive Wireless Sensor Net- work (WSN). This solution tackles data sampling, retrieval and storage from a reconfigurable perspective, reducing the amount of data generated for optimization by a 68% without information loss, doubling the lifetime of the WSN nodes and allowing runtime energy minimization techniques in a real scenario.
Marina Zapater, José L. Ayala, José M. Moya, Kalyan Vaidyanathan, Kenny Gross
Proceedings of the 2013 Design, Automation & Test in Europe Conference & Exhibition
Reducing the energy consumption for computation and cooling in servers is a major challenge considering the data center energy costs today. To ensure energy-efficient operation of servers in data centers, the relationship among computational power, temperature, leakage, and cooling power needs to be analyzed. By means of an innovative setup that enables monitoring and controlling the computing and cooling power consumption separately on a commercial enterprise server, this paper studies temperature-leakage-energy tradeoffs, obtaining an empirical model for the leakage component. Using this model, we design a controller that continuously seeks and settles at the optimal fan speed to minimize the energy consumption for a given workload. We run a customized dynamic load-synthesis tool to stress the system. Our proposed cooling controller achieves up to 9% energy savings and 30W reduction in peak power in comparison to the default cooling control scheme.
Marina Zapater, José L. Ayala, José M. Moya
Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Energy consumption in data centers is nowadays a critical objective because of its dramatic environmental and economic impact. Over the last years, several approaches have been proposed to tackle the energy/cost optimization problem, but most of them have failed on providing an analytical model to target both the static and dynamic optimization domains for complex heterogeneous data centers. This paper proposes and solves an optimization problem for the energy-driven configuration of a heterogeneous data center. It also advances in the proposition of a new mechanism for task allocation and distribution of workload. The combination of both approaches outperforms previous published results in the field of energy minimization in heterogeneous data centers and scopes a promising area of research.
Marina Zapater, Pedro Malagón, Zorana Bankovic, José M. Moya, Juan-Mariano de Goyeneche
Proceedings of the 25th Conference on Design of Circuits and Integrated Systems
This paper presents the system-level simulation platform we have implemented to design and evaluate the SORU reconfigurable vector coprocessor, aimed at enhancing the security of embedded systems. The simulator interfaces a lowlevel virtual machine (LLVM) with a SystemC TLM 2.0 model of the rest of the system, and a low-level SystemC model of the coprocessor. The results show that we can simulate more than 80K coprocessor operations per second, with decent power estimation, that allows to perform simulated power analysis attacks. The resulting simulation platform is also flexible enough to allow very fast and easy changes to any part of the system.
Pedro Malagón, Juan-Mariano de Goyeneche, Marina Zapater, José M. Moya
Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems IWIA 2010
In this paper, we suggest handling security in embedded systems by introducing a small architectural change. We propose the use of a non-deterministic branch instruction to generate non-determinism in the execution of encryption algorithms. Non-determinism makes side-channel attacks much more difficult. The experimental results show at least three orders of magnitude improvement in resistance to statistical side-channel attacks for a custom AES implementation, while enhancing its performance at the same time.Compared with previous countermeasures, this architectural-level hiding countermeasure is trivial to integrate in current embedded processor designs, offers similar resistance to side-channel attacks, while maintaining similar power consumption to the unprotected processor.