# Kalousis Alexandros

### Professeur HES ordinaire

#### Compétences principales

### Professeur HES ordinaire

Bureau: F 124

Campus Battelle, Rue de la Tambourine 17, 1227 Carouge, CH

- Machine Learning
- Deep Learning

En cours

**Rôle: **Co-requérant(s)

**Requérant(e)s: **Armand Stéphane, Laboratoire de Cinésiologie Willy Taillard Hôpitaux Universitaires de Genève

**Financement: **
SNSF

**Description du projet : **

The aim ofSimGait is to create a musculoskeletal model of the human with neural control to be able to model healthy and impaired gait, for example due to cerebral palsy. The model consista of a dynamics model that models the motion of the legs and trunk, and is operated by muscle forces. Machine learning methods will be used to predict a patient’s gait from their clinical data using a data-driven model as well as to learn controllers that will imitate the gait of individual patients. The overarching goal is to create models that will allow medical doctors to explore the effect that different treatments will have on the gait of any given patient.

**Equipe de recherche au sein de la HES-SO:**
Kalousis Alexandros

**Partenaires académiques: **Ijspeert Auke Jan, Laboratoire de biorobotique EPFL - STI - IBI - BIOROB, Lausanne; Armand Stéphane, Laboratoire de Cinésiologie Willy Taillard Hôpitaux Universitaires de Genève

**Durée du projet:**
01.09.2018 - 31.08.2022

**Montant global du projet: **2'115'000 CHF

**Statut: ** En cours

Terminés

**Rôle: **Requérant(e) principal(e)

**Financement: **
Commission européenne

**Description du projet : **
The purpose of the RAWFIE initiative is to create a federation of different network testbeds that will work together to make their resources
available under a common framework. Specifically, it aims at delivering a unique, mixed experimentation environment across the space and
technology dimensions. RAWFIE will integrate numerous testbeds for experimenting in vehicular (road), aerial and maritime environments. A
Vehicular Testbed (VT) will deal with Unmanned Ground Vehicles (UGVs) while an Aerial Testbed (AT) and a Maritime Testbed (MT) will deal with
Unmanned Aerial Vehicles (UAVs) and Unmanned Surface Vehicles (USVs) respectively. The RAWFIE consortium includes all the possible actors
of this highly challenging experimentation domain, from technology creators to integrators and facility owners. The basic idea behind the RAWFIE
effort is the automated, remote operation of a large number of robotic devices (UGVs, UAVs, USVs) for the purpose of assessing the performance
of different technologies in the networking, sensing and mobile/autonomic application domains. RAWFIE will feature a significant number of UxV
nodes for exposing to the experimenter a vast test infrastructure. All these items will be managed by a central controlling entity which will be
programmed per case and fully overview/drive the operation of the respective mechanisms (e.g., auto-pilots, remote controlled ground vehicles).
Internet connectivity will be extended to the mobile units to enable the remote programming (over-the-air), control and data collection. Support
software for experiment management, data collection and post-analysis will be virtualized to enable experimentation from everywhere in the world.
The vision of Experimentation-as-a-Service (EaaS) will be promoted through RAWFIE. The IoT paradigm will be fully adopted and further refined
for support of highly dynamic node architectures.

**Equipe de recherche au sein de la HES-SO:**
Kalousis Alexandros
, Blonde Lionel
, Ramapuram Jason Emmanuel

**Partenaires académiques: **539,Informatique de gestion; Kalousis Alexandros, 539,Informatique de gestion

**Durée du projet:**
01.01.2015 - 31.03.2019

**Montant global du projet: **623'271 CHF

**Statut: ** Terminé

**Rôle: **Requérant(e) principal(e)

**Financement: **
CTI; Firmenich

**Description du projet : **
We aim at developing rational solutions for improving product performance and differentiation through data-driven and computational approaches. Statistical learning algorithms embeding side information are studied with the objective to design reliable models assessing product properties and qualities.

**Equipe de recherche au sein de la HES-SO:**
Kalousis Alexandros
, Strasser Pablo
, Aminanmu Maolaaisha
, Lavda Frantzeska

**Partenaires académiques: **539,Informatique de gestion

**Durée du projet:**
01.01.2017 - 28.02.2019

**Statut: ** Terminé

**Rôle: **Requérant(e) principal(e)

**Financement: **
HES-SO Rectorat

**Description du projet : **
We will develop new methods and tools for forecasting multivariate time series that exploit hidden structures in the data to improve the accuracy of the forecasts. These shall be able to cope with time series from various application areas, such as financial and economic, transport, or electricity supply and demand, where there are large numbers of indicators developing in parallel along non-trivial and possibly unstable structures in their relationships.

**Equipe de recherche au sein de la HES-SO:**
Kalousis Alexandros
, Gregorova Magda

**Partenaires académiques: **539,Informatique de gestion

**Durée du projet:**
01.11.2016 - 01.05.2018

**Montant global du projet: **158'030 CHF

**Statut: ** Terminé

**Rôle: **Requérant(e) principal(e)

**Financement: **
HES-SO Rectorat

**Description du projet : **
We will use a specific application area of air trafic forecasting to direct our research in learning
multivariate time series models regularized by data-driven yet meaningful constraints. The
objective is to develop algorithms that reduce dimensionality and improve the predictive performance
of the models by exploiting additional knowledge from domain-theory, the specific
spatial structure of the data (low network), or learned in a multi-task setting.

**Equipe de recherche au sein de la HES-SO:**
Kalousis Alexandros
, Gregorova Magda

**Partenaires académiques: **539,Informatique de gestion; Kalousis Alexandros, 539,Informatique de gestion

**Durée du projet:**
01.11.2013 - 30.04.2015

**Montant global du projet: **157'000 CHF

**Statut: ** Terminé

**Rôle: **Requérant(e) principal(e)

**Financement: **
HES-SO Rectorat; 539,Informatique de gestion; FNS - Fonds national suisse

**Description du projet : **
Kernel and metric learning have become very active research fields in machine learning over the last years. Although they have developed as distinct research fields they share common elements. One of the most popular approaches to kernel learning is learning a linear combination of a set of kernels, usually identified as Multiple Kernel Learning (MKL), this essentially corresponds to learning a block diagonal transformation of the concatenation feature space induced by the concatenation of the feature spaces that correspond to
the basis kernels. On the metric learning side, probably the most prominent approach is learning a Mahalanobis distance in some feature space which in fact corresponds to learning a linear transformation in that feature space. So both methods learn linear feature transformations, where in the case of the MKL the learned transformation has a specific structure. Many of the metric learning methods are kernelized which raises the issue of which kernel should one use for a given problem, nevertheless there is no work so far that tries to combine metric learning with MKL. On the other hand since MKL is learning a special form of linear transformation over the concatenation feature space one could use metric learning techniques in order to learn such linear transformations or more general forms of them. In fact one can use metric learning techniques to learn linear transformations over the feature space induced by any kernel. On the same time,
and despite the increasing popularity of metric learning methods, there exist so far no such method that will scale well with increasing problem sizes, i.e. large feature space dimensionality and large number of instances, and on the same time retain a good generalization performance.
In the present proposal we want to take a step to address the issues briefly described above. More precisely the work described in the present proposal is divided into two workpackages. In the first workpackage we link metric learning and kernel learning methods, by using tools from one domain in the other and vice versa. In the second workpackage we will propose metric learning methods
that can scale well with large datasets. The work of the first workpackage is divided into two main tasks. In the first task we will combine metric learning with
MKL, i.e. learning metrics over kernels learned by MKL. We will explore different metric parametrizations which will lead to different learning problems. In the second task we will go on the opposite direction and use metric learning ideas for kernel learning. More precisely we will learn linear transformations of the feature space induced by some kernel, whether this is a kernel that is learned
or it is a standard single kernel. By learning a linear transformation of the feature
space we are in 31.03.2011 20:30:52 Page - 6 - fact learning in the general case a new non-linear, quadratic, kernel. We will experiment with different objective functions in order to learn the linear transformations. In the second workpackage we will also have two main tasks. In the first we will explore the use of stochastic gradient descent methods in order to improve the scalability of metric learning. In the second task we will go to the extreme case of metric learning and we will learn a linear transformation of rank one in order to make metric learning
possible for very large datasets. At first we might think that learning a rank one metric might be too restrictive limiting its application only to simple learning problems. Nevertheless, by kernelizing it, we can apply it on learning problems of any complexity.

**Equipe de recherche au sein de la HES-SO:**
Kalousis Alexandros

**Durée du projet:**
01.10.2011 - 31.08.2013

**Montant global du projet: **56'820 CHF

**Statut: ** Terminé

2022

**Résumé:**

Despite the recent success of reinforcement learning in various domains, these approaches remain, for the most part, deterringly sensitive to hyper-parameters and are often riddled with essential engineering feats allowing their success. We consider the case of off-policy generative adversarial imitation learning, and perform an in-depth review, qualitative and quantitative, of the method. We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well. We then study the effects of this necessary condition and provide several theoretical results involving the local Lipschitzness of the state-value function. We complement these guarantees with empirical evidence attesting to the strong positive effect that the consistent satisfaction of the Lipschitzness constraint on the reward has on imitation performance. Finally, we tackle a generic pessimistic reward preconditioning add-on spawning a large class of reward shaping methods, which makes the base method it is plugged into provably more robust, as shown in several additional theoretical guarantees. We then discuss these through a fine-grained lens and share our insights. Crucially, the guarantees derived and reported in this work are valid for any reward satisfying the Lipschitzness condition, nothing is specific to imitation. As such, these may be of independent interest.

*Sensors*,
2022, Vol. 22, no. 3, article no. 1088

**Résumé:**

Despite the great attention that the research community has paid to the creation of novel indoor positioning methods, a rather limited volume of works has focused on the confidence that Indoor Positioning Systems (IPS) assign to the position estimates that they produce. The concept of estimating, dynamically, the accuracy of the position estimates provided by an IPS has been sporadically studied in the literature of the field. Recently, this concept has started being studied as well in the context of outdoor positioning systems of Internet of Things (IoT) based on Low-Power Wide-Area Networks (LPWANs). What is problematic is that the consistent comparison of the proposed methods is quasi nonexistent: new methods rarely use previous ones as baselines; often, a small number of evaluation metrics are reported while different metrics are reported among different relevant publications, the use of open data is rare, and the publication of open code is absent. In this work, we present an open-source, reproducible benchmarking framework for evaluating and consistently comparing various methods of Dynamic Accuracy Estimation (DAE). This work reviews the relevant literature, presenting in a consistent terminology commonalities and differences and discussing baselines and evaluation metrics. Moreover, it evaluates multiple methods of DAE using open data, open code, and a rich set of relevant evaluation metrics. This is the first work aiming to establish the state of the art of methods of DAE determination in IPS and in LPWAN positioning systems, through an open, transparent, holistic, reproducible, and consistent evaluation of the methods proposed in the relevant literature.

2020

**Résumé:**

Lifelong learning is the problem of learning multiple consecutive tasks in a sequential manner, where knowledge gained from previous tasks is retained and used to aid future learning over the lifetime of the learner. It is essential towards the development of intelligent machines that can adapt to their surroundings. In this work we focus on a lifelong learning approach to unsupervised generative modeling, where we continuously incorporate newly observed distributions into a learned model. We do so through a student-teacher Variational Autoencoder architecture which allows us to learn and preserve all the distributions seen so far, without the need to retain the past data nor the past models. Through the introduction of a novel cross-model regularizer, inspired by a Bayesian update rule, the student model leverages the information learned by the teacher, which acts as a probabilistic knowledge store. The regularizer reduces the effect of catastrophic interference that appears when we learn over sequences of distributions. We validate our model’s performance on sequential variants of MNIST, FashionMNIST, PermutedMNIST, SVHN and Celeb-A and demonstrate that our model mitigates the effects of catastrophic interference faced by neural networks in sequential learning scenarios.

**Résumé:**

One of the major shortcomings of variational autoencoders is the inability to produce generations from the individual modalities of data originating from mixture distributions. This is primarily due to the use of a simple isotropic Gaussian as the prior for the latent code in the ancestral sampling procedure for data generations. In this paper, we propose a novel formulation of variational autoencoders, conditional prior VAE (CP-VAE), with a two-level generative process for the observed data where continuous z and a discrete c variables are introduced in addition to the observed variables x. By learning data-dependent conditional priors, the new variational objective naturally encourages a better match between the posterior and prior conditionals, and the learning of the latent categories encoding the major source of variation of the original data in an unsupervised manner. Through sampling continuous latent code from the data-dependent conditional priors, we are able to generate new samples from the individual mixture components corresponding, to the multimodal structure over the original data. Moreover, we unify and analyse our objective under different independence assumptions for the joint distribution of the continuous and discrete latent variables. We provide an empirical evaluation on one synthetic dataset and three image datasets, FashionMNIST, MNIST, and Omniglot, illustrating the generative performance of our new model comparing to multiple baselines.

2018

*Journal of biomedical semantics*,
2018, vol. 9, no. 21, pp. 1-20

**Résumé:**

Background: While representation learning techniques have shown great promise in application to a number of different NLP tasks, they have had little impact on the problem of ontology matching. Unlike past work that has focused on feature engineering, we present a novel representation learning approach that is tailored to the ontology matching task. Our approach is based on embedding ontological terms in a high-dimensional Euclidean space. This embedding is derived on the basis of a novel phrase retrofitting strategy through which semantic similarity information becomes inscribed onto fields of pre-trained word vectors. The resulting framework also incorporates a novel outlier detection mechanism based on a denoising autoencoder that is shown to improve performance. Results: An ontology matching system derived using the proposed framework achieved an F-score of 94% on an alignment scenario involving the Adult Mouse Anatomical Dictionary and the Foundational Model of Anatomy ontology (FMA) as targets. This compares favorably with the best performing systems on the Ontology Alignment Evaluation Initiative anatomy challenge. We performed additional experiments on aligning FMA to NCI Thesaurus and to SNOMED CT based on a reference alignment extracted from the UMLS Metathesaurus. Our system obtained overall F-scores of 93.2% and 89.2% for these experiments, thus achieving state-of-the-art results. Conclusions: Our proposed representation learning approach leverages terminological embeddings to capture semantic similarity. Our results provide evidence that the approach produces embeddings that are especially well tailored to the ontology matching task, demonstrating a novel pathway for the problem.

2016

**Résumé:**

Recommendation systems often rely on point-wise loss metrics such as the mean squared error. However, in real recommendation settings only few items are presented to a user. This observation has recently encouraged the use of rank-based metrics. LambdaMART is the state-of-the-art algorithm in learning to rank which relies on such a metric. Motivated by the fact that very often the users’ and items’ descriptions as well as the preference behavior can be well summarized by a small number of hidden factors, we propose a novel algorithm, LambdaMART matrix factorization (LambdaMART-MF), that learns latent representations of users and items using gradient boosted trees. The algorithm factorizes LambdaMART by defining relevance scores as the inner product of the learned representations of the users and items. We regularise the learned latent representations so that they reflect the user and item manifolds as these are defined by their original feature based descriptors and the preference behavior. We also propose to use a weighted variant of NDCG to reduce the penalty for similar items with large rating discrepancy. We experiment on two very different recommendation datasets, meta-mining and movies-users, and evaluate the performance of LambdaMART-MF, with and without regularization, in the cold start setting as well as in the simpler matrix completion setting. The experiments show that the factorization of LambdaMart brings significant performance improvements both in the cold start and the matrix completion settings. The incorporation of regularisation seems to have a smaller performance impact.

2015

**Résumé:**

The Data Mining OPtimization Ontology (DMOP) has been developed to support informed decision-making at various choice points of the data mining process. The ontology can be used by data miners and deployed in ontology-driven information systems. The primary purpose for which DMOP has been developed is the automation of algorithm and model selection through semantic meta-mining that makes use of an ontology-based meta-analysis of complete data mining processes in view of extracting patterns associated with mining performance. To this end, DMOP contains detailed descriptions of data mining tasks (e.g., learning, feature selection), data, algorithms, hypotheses such as mined models or patterns, and workflows. A development methodology was used for DMOP, including items such as competency questions and foundational ontology reuse. Several non-trivial modeling problems were encountered and due to the complexity of the data mining details, the ontology requires the use of the OWL 2 DL profile. DMOP was successfully evaluated for semantic meta-mining and used in constructing the Intelligent Discovery Assistant, deployed at the popular data mining environment RapidMiner

2014

*Journal of Artificial Intelligence Research. Novembre 2014. Vol.?51, pp.?605-644*,

2011

*In : Jankowski, Norbert (eds). Meta-learning in computational intelligence. Berlin, Springer, 2011. P. 273-315*. 2011

2023

**Résumé:**

Humans are able to quickly adapt to new situations, learn effectively with limited data, and create unique combinations of basic concepts. In contrast generalizing out-of-distribution (OOD) data and achieving combinatorial generalizations are fundamental challenges for the machine learning models. To address these challenges, we propose BtVAE, a method that employs supervised conditional VAE models to achieve combinatorial generalization in certain scenarios and consequently to generate out-of-distribution (OOD) data. Unlike previous approaches that use new factors of variation during testing, our method uses only existing attributes from the training data, but in ways that were not seen during training (e.g., small objects during training and large objects during testing). We first learn a latent representation of the in-distribution inputs and we passing this representation in a conditional decoder, conditioning on some OOD attribute values, to generate implicit OOD samples. These generated samples are then translated back to the original in-distribution inputs, conditioning on the actual attribute values. To ensure that the generated OOD samples have the specified OOD attribute values, a predictor is introduced. By training with OOD attribute values the decoder learns to produce the correct output for unseen combinations, resulting in a model that not only is able to reconstruct OOD data but also to manipulate the OOD data and to generate samples conditioning on unseen combinations of attribute values.

**Résumé:**

The combination of deep neural nets and theory-driven models (deep grey-box models) can be advantageous due to the inherent robustness and interpretability of the theory-driven part. Deep grey-box models are usually learned with a regularized risk minimization to prevent a theory driven part from being overwritten and ignored by a deep neural net. However, an estimation of the theory-driven part obtained by uncritically optimizing a regularizer can hardly be trustworthy if we are not sure which regularizer is suitable for the given data, which may affect the interpretability. Toward a trustworthy estimation of the theory-driven part, we should analyze the behavior of regularizers to compare different candidates and to justify a specific choice. In this paper, we present a framework that allows us to empirically analyze the behavior of a regularizer with a slight change in the architecture of the neural net and the training objective.

2022

**Résumé:**

We consider the problem of modelling high-dimensional distributions and generating new examples of data with complex relational feature structure coherent with a graph skeleton. The model we propose tackles the problem of generating the data features constrained by the specific graph structure of each data point by splitting the task into two phases. In the first it models the distribution of features associated with the nodes of the given graph, in the second it complements the edge features conditionally on the node features. We follow the strategy of implicit distribution modelling via generative adversarial network (GAN) combined with permutation equivariant message passing architecture operating over the sets of nodes and edges. This enables generating the feature vectors of all the graph objects in one go (in 2 phases) as opposed to a much slower one-by-one generations of sequential models, prevents the need for expensive graph matching procedures usually needed for likelihood-based generative models, and uses efficiently the network capacity by being insensitive to the particular node ordering in the graph representation. To the best of our knowledge, this is the first method that models the feature distribution along the graph skeleton allowing for generations of annotated graphs with user specified structures. Our experiments demonstrate the ability of our model to learn complex structured distributions through quantitative evaluation over three annotated graph datasets.

2021

**Résumé:**

Integrating physics models within machine learning models holds considerable promise toward learning robust models with improved interpretability and abilities to extrapolate. In this work, we focus on the integration of incomplete physics models into deep generative models. In particular, we introduce an architecture of variational autoencoders (VAEs) in which a part of the latent space is grounded by physics. A key technical challenge is to strike a balance between the incomplete physics and trainable components such as neural networks for ensuring that the physics part is used in a meaningful manner. To this end, we propose a regularized learning method that controls the effect of the trainable components and preserves the semantics of the physics-based latent variables as intended. We not only demonstrate generative performance improvements over a set of synthetic and real-world datasets, but we also show that we learn robust models that can consistently extrapolate beyond the training distribution in a meaningful manner. Moreover, we show that we can control the generative process in an interpretable manner.

*Proceedings of the 11th International Conference on Indoor Positioning and Indoor Navigation*

**Résumé:**

The movement advocating for a more transparent and reproducible science has placed the issue of research reproducibility at the center of attention of various stakeholders related to academic research. Universities, funding institutions and publishers have started changing long established policies with the goal to encourage and support best practices for rigorous and transparent science making. Regarding the field of indoor positioning, there is a lack of standard evaluation procedures that would enable consistent comparisons. Moreover, the practices of Open Data and Open Source are on the verge of gaining popularity within the community of the field. This work, after presenting an extensive introduction to the landscape of research reproducibility and providing the viewpoint of the research community of Indoor Positioning, proceeds to its primary contribution: to provide a concrete set of suggestions that could accelerate the pace of the Indoor Positioning research community towards becoming a discipline of reproducible research.

*Proceedings of the 11th International Conference on Indoor Positioning and Indoor Navigation (IPIN)*

**Résumé:**

The proliferation of data-demanding machine learning methods has brought to light the necessity for methodologies which can enlarge the size of training datasets, with simple, rule-based methods. In-line with this concept, the fingerprint augmentation scheme proposed in this work aims to augment fingerprint datasets which are used to train positioning models. The proposed method utilizes fingerprints which are recorded in spacial proximity, in order to perform fingerprint augmentation, creating new fingerprints which combine the features of the original ones. The proposed method of composing the new, augmented fingerprints is inspired by the crossover and mutation operators of genetic algorithms. The ProxyFAUG method aims to improve the achievable positioning accuracy of fingerprint datasets, by introducing a rule-based, stochastic, proximity-based method of fingerprint augmentation. The performance of ProxyFAUG is evaluated in an outdoor Sigfox setting using a public dataset. The best performing published positioning method on this dataset is improved by 40% in terms of median error and 6% in terms of mean error, with the use of the augmented dataset. The analysis of the results indicates a systematic and significant performance improvement at the lower error quartiles, as indicated by the impressive improvement of the median error.

*Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases*

**Résumé:**

In this work, we want to learn to model the dynamics of similar yet distinct groups of interacting objects. These groups follow some common physical laws that exhibit speci_cities that are captured through some vectorial description. We develop a model that allows us to do conditional generation from any such group given its vectorial description. Unlike previous work on learning dynamical systems that can only do trajectory completion and require a part of the trajectory dynamics to be provided as input in generation time, we do generation using only the conditioning vector with no access to generation time's trajectories. We evaluate our model in the setting of modeling human gait and, in particular pathological human gait.

**Résumé:**

The primary expectation from positioning systems is for them to provide the users with reliable estimates of their position. An additional piece of information that can greatly help the users utilize position estimates is the level of uncertainty that a positioning system assigns to the position estimate it produced. The concept of dynamically estimating the accuracy of position estimates of fingerprinting positioning systems has been sporadically discussed over the last decade in the literature of the field, where mainly handcrafted rules based on domain knowledge have been proposed. The emergence of IoT devices and the proliferation of data from Low Power Wide Area Networks (LPWANs) have facilitated the conceptualization of data-driven methods of determining the estimated certainty over position estimates. In this work, we analyze the data-driven approach of determining the Dynamic Accuracy Estimation (DAE), considering it in the broader context of a positioning system. More specifically, with the use of a public LoRaWAN dataset, the current work analyses: the repartition of the available training set between the tasks of determining the location estimates and the DAE, the concept of selecting a subset of the most reliable estimates, and the impact that the spatial distribution of the data has to the accuracy of the DAE. The work provides a wide overview of the data-driven approach of DAE determination in the context of the overall design of a positioning system.

*Proceedings of the Ninth International Conference on Learning Representations (ICLR 2021) Workshop Neural Compression Program Chairs*

**Résumé:**

We consider the problem of learned transform compression where we learn both, the transform as well as the probability distribution over the discrete codes. We utilize a soft relaxation of the quantization operation to allow for back-propagation of gradients and employ vector (rather than scalar) quantization of the latent codes. Furthermore, we apply similar relaxation in the code probability assignments enabling direct optimization of the code entropy. To the best of our knowledge, this approach is completely novel. We conduct a set of proof-of concept experiments confirming the potency of our approaches.

*Proceedings of the Ninth International Conference on Learning Representations (ICLR 2021)*

**Résumé:**

Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work, we develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory via a hierarchical latent variable model. We take inspiration from traditional heap allocation and extend the idea of locally contiguous memory to the Kanerva Machine, enabling a novel differentiable block allocated latent memory. In contrast to the Kanerva Machine, we simplify the process of memory writing by treating it as a fully feed forward deterministic process, relying on the stochasticity of the read key distribution to disperse information within the memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation, resulting in new state-of-the-art conditional likelihood values on binarized MNIST (≤41.58 nats/image) , binarized Omniglot (≤66.24 nats/image), as well as presenting competitive performance on CIFAR10, DMLab Mazes, Celeb-A and ImageNet32×32. One-sentence Summary: Differentiable block allocated latent memory model for generative modeling.

2020

*Proceedings of the Neural Information Processing Systems Online Conference 2020*

**Résumé:**

Despite recent advances, goal-directed generation of structured discrete data remains challenging. For problems such as program synthesis (generating source code) and materials design (generating molecules), finding examples which satisfy desired constraints or exhibit desired properties is difficult. In practice, expensive heuristic search or reinforcement learning algorithms are often employed. In this paper, we investigate the use of conditional generative models which directly attack this inverse problem, by modeling the distribution of discrete structures given properties of interest. Unfortunately, the maximum likelihood training of such models often fails with the samples from the generative model inadequately respecting the input properties. To address this, we introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward. We avoid high-variance score-function estimators that would otherwise be required by sampling from an approximation to the normalized rewards, allowing simple Monte Carlo estimation of model gradients. We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value. In both cases, we find improvements over maximum likelihood estimation and other baselines.

2019

**Résumé:**

Neural networks typically need huge amounts of data to train in order to get reasonable generalizable results. A common approach is to artificially generate samples by using prior knowledge of the data properties or other relevant domain knowledge. However, if the assumptions on the data properties are not accurate or the domain knowledge is irrelevant to the task at hand, one may end up degenerating learning performance by using such augmented data in comparison to simply training on the limited available dataset. We propose a critical data augmentation method using feature side-information, which is obtained from domain knowledge and provides detailed information about features' intrinsic properties. Most importantly, we introduce an instance wise quality checking procedure on the augmented data. It filters out irrelevant or harmful augmented data prior to entering the model. We validated this approach on both synthetic and real-world datasets, specifically in a scenario where the data augmentation is done based on a task independent, unreliable source of information. The experiments show that the introduced critical data augmentation scheme helps avoid performance degeneration resulting from incorporating wrong augmented data.

*Proceedings of the tenth International Conference on Indoor Positioning and Indoor Navigation*

**Résumé:**

Fingerprinting techniques, which are a common method for indoor localization, have been recently applied with success into outdoor settings. Particularly, the communication signals of Low Power Wide Area Networks (LPWAN) such as Sigfox, have been used for localization. In this rather recent field of study, not many publicly available datasets, which would facilitate the consistent comparison of different positioning systems, exist so far. In the current study, a published dataset of RSSI measurements on a Sigfox network deployed in Antwerp, Belgium is used to analyse the appropriate selection of preprocessing steps and to tune the hyperparameters of a kNN fingerprinting method. Initially, the tuning of hyperparameter k for a variety of distance metrics, and the selection of efficient data transformation schemes, proposed by relevant works, is presented. In addition, accuracy improvements are achieved in this study, by a detailed examination of the appropriate adjustment of the parameters of the data transformation schemes tested, and of the handling of out of range values. With the appropriate tuning of these factors, the achieved mean localization error was 298 meters, and the median error was 109 meters. To facilitate the reproducibility of tests and comparability of results, the code and train/validation/test split used in this study are available.

**Résumé:**

GAIL is a recent successful imitation learning architecture that exploits the adversarial training procedure introduced in GANs. Albeit successful at generating behaviours similar to those demonstrated to the agent, GAIL suffers from a high sample complexity in the number of interactions it has to carry out in the environment in order to achieve satisfactory performance. We dramatically shrink the amount of interactions with the environment necessary to learn well-behaved imitation policies, by up to several orders of magnitude. Our framework, operating in the model-free regime, exhibits a significant increase in sample-efficiency over previous methods by simultaneously a) learning a self-tuned adversarially-trained surrogate reward and b) leveraging an off-policy actor-critic architecture. We show that our approach is simple to implement and that the learned agents remain remarkably stable, as shown in our experiments that span a variety of continuous control tasks. Video visualisations available at: \url{https://youtu.be/-nCsqUJnRKU}.

2018

*Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS) 2018*

**Résumé:**

Continual learning is the ability to sequentially learn over time by accommodating knowledge while retaining previously learned experiences. Neural networks can learn multiple tasks when trained on them jointly, but cannot maintain performance on previously learned tasks when tasks are presented one at a time. This problem is called catastrophic forgetting. In this work, we propose a classification model that learns continuously from sequentially observed tasks, while preventing catastrophic forgetting. We build on the lifelong generative capabilities of [10] and extend it to the classification setting by deriving a new variational bound on the joint loglikelihood, log p(x; y).

*Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKKD)*

**Résumé:**

We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selec- tion method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by mapping the inputs into a relatively low-dimensional space of random features. The algorithm discovers the variables relevant for the regression task together with learning the prediction model through learning the appropriate nonlinear random feature maps. We demonstrate the outstanding performance of our method on a set of large-scale synthetic and real datasets.

**Résumé:**

We investigate structured sparsity methods for variable selection in regression problems where the target depends nonlinearly on the inputs. We focus on general nonlinear functions not limiting a priori the function space to additive models. We propose two new regularizers based on partial derivatives as nonlinear equivalents of group lasso and elastic net. We formulate the problem within the framework of learning in reproducing kernel Hilbert spaces and show how the variational problem can be reformulated into a more practical finite dimensional equivalent. We develop a new algorithm derived from the ADMM principles that relies solely on closed forms of the proximal operators. We explore the empirical properties of our new algorithm for Nonlinear Variable Selection based on Derivatives (NVSD) on a set of experiments and confirm favourable properties of our structured-sparsity models and the algorithm in terms of both prediction and variable selection accuracy.

*Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*

2017

**Résumé:**

We present a new method for forecasting systems of multiple interrelated time series. The method learns the forecast models together with discovering leading indicators from within the system that serve as good predictors improving the forecast accuracy and a cluster structure of the predictive tasks around these. The method is based on the classical linear vector autoregressive model (VAR) and links the discovery of the leading indicators to inferring sparse graphs of Granger causality. We formulate a new constrained optimisation problem to promote the desired sparse structures across the models and the sharing of information amongst the learning tasks in a multi-task manner. We propose an algorithm for solving the problem and document on a battery of synthetic and real-data experiments the advantages of our new method over baseline VAR models as well as the state-of-the-art sparse VAR learning methods.

*Proceedings of 14th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2017*

**Résumé:**

Nowadays, the advanced technologies make amounts of data growing in a fast paced way. In many application fields, this trend concerns specially dimensions of the data. It is the case where features are about thousands and tens of thousands, while the number of instances is much smaller. This phenomenon is known as the curse of dimensionality and it results in modest classification performance and feature selection instability. In order to deal with this issue, we propose a new feature selection approach that makes use of background knowledge about some dimensions known to be more relevant, as a means of directing the feature selection process. In this approach, prior knowledge about some features is used to learn new relevant features by a semi supervised approach. Experiments on three high dimensional data sets show promising results on both classification performance and stability of feature selection.

*Proceedings of the European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases*

**Résumé:**

Traditional linear methods for forecasting multivariate time series are not able to satisfactorily model the non-linear dependencies that may exist in non-Gaussian series. We build on the theory of learning vector-valued functions in the reproducing kernel Hilbert space and develop a method for learning prediction functions that accommodate such non-linearities. The method not only learns the predictive function but also the matrix-valued kernel underlying the function search space directly from the data. Our approach is based on learning multiple matrix-valued kernels, each of those composed of a set of input kernels and a set of output kernels learned in the cone of positive semi-de_nite matrices. In addition to superior predictive performance in the presence of strong non-linearities, our method also recovers the hidden dynamic relationships between the series and thus is a new alternative to existing graphical Granger techniques.

**Résumé:**

Very often features come with their own vectorial descriptions which provide detailed information about their properties. We refer to these vectorial descriptions as feature side-information. In the standard learning scenario, input is represented as a vector of features and the feature sideinformation is most often ignored or used only for feature selection prior to model fitting. We believe that feature side-information which carries information about features intrinsic property will help improve model prediction if used in a proper way during learning process. In this paper, we propose a framework that allows for the incorporation of the feature side-information during the learning of very general model families to improve the prediction performance. We control the structures of the learned models so that they reflect features’ similarities as these are defined on the basis of the side-information. We perform experiments on a number of benchmark datasets which show significant predictive performance gains, over a number of baselines, as a result of the exploitation of the side-information.

2016

*Learning in High Dimensions with Structure, Workshop proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016)*

**Résumé:**

Very often features come with their own vectorial descriptions which provide detailed information about their properties. We refer to these vectorial descriptions as feature side-information. The feature side-information is most often ignored or used for feature selction prior to model fitting. In this paper, we propose a framework that allows for the incorporation of feature side-information during the learning of very general model families. We control the structures of the learned models so that they reflect features’ similarities as these are defined on the basis of the side-information. We perform experiments on a number of benchmark datasets which show significant predictive performance gains, over a number of baselines, as a result of the exploitation of the side-information.

2015

*Proceedings of the Time Series Workshop of the 29th Neural Information Processing Systems conference, NIPS-2015*

**Résumé:**

We develop a functional learning approach to modelling systems of time series which preserves the ability of standard linear time-series models (VARs) to uncover the Granger-causality links in between the series of the system while allowing for richer functional relationships. We propose a framework for learning multiple output-kernels associated with multiple input-kernels over a structured input space and outline an algorithm for simultaneous learning of the kernels with the model parameters with various forms of regularization including non-smooth sparsity inducing norms. We present results of synthetic experiments illustrating the benefits of the described approach.

*Advances in Neural Information Processing Systems 28 : Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada*

**Résumé:**

Space-time is a profound concept in physics. This concept was shown to be useful for dimensionality reduction. We present basic definitions with interesting counter-intuitions. We give theoretical propositions to show that space-time is a more powerful representation than Euclidean space. We apply this concept to manifold learning for preserving local information. Empirical results on nonmetric datasets show that more information can be preserved in space-time.

*Proceedings of the Demand Forecasting Workshop of the 32nd International Conference on Machine Learning*

**Résumé:**

We consider the problem of forecasting multiple time series across multiple cross-sections based solely on the past observations of the series. We propose to use panel vector autoregressive model to capture the inter-dependencies on the past values of the multiple series. We restrict the panel vector autoregressive model to exclude the cross-sectional relationships and propose a method to learn models with sparse Granger-causality structures coherent across the panel sections. The method extends the concepts of group variable selection and support union recovery into the panel setting by extending the group lasso penalty (Yuan & Lin, 2006) into matrix output regression setting with 3d-tensor of model parameters.

*Journal of machine learning research : proceedings of the 32nd International Conference on Machine Learning, pp. 49–58, 2015*

**Résumé:**

We study parametric unsupervised mixture learning. We measure the loss of intrinsic information from the observations to complex mixture models, and then to simple mixture models. We present a geometric picture, where all these representations are regarded as free points in the space of probability distributions. Based on minimum description length, we derive a simple geometric principle to learn all these models together. We present a new learning machine with theories, algorithms, and simulations.

2014

*In : Proceedings of the 31th International Conference on Machine Learning (ICML), Beijing, China, 21-26 juin 2014. Pp.?370-378*

2013

*In : proceedings of the 30th International Conference on Machine Learning, 2013, Atlanta, USA, pp. 169-177*

*In : proceedings of the Neural Information Processing 2013, Workshop on Output Representation Learning, pp. 1-6*

2012

*In: Machine learning and knowledge discovery in databases, European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part I. Lecture notes in computer science vol. 7523, 2012, p. 223-236*

*In : proceedings of the 12th IEEE International Conference on Data Mining (ICDM 2012), 2012, Brussels, Belgium, December 11-13*

*In: KDD '12 Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, p. 913-921*

*In : Advances in neural information processing systems, vol. 25 : 26th Annual Conference on Neural Information Processing System 2012, NIPS 2012, Nevada, USA, December 3-8*

*In : Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS) April 21-23, 2012 La Palma, Canary Islands, 2012, vol. 22, p. 308-317*

*In: Advances in Neural Information Processing Systems, Vol. 24 : 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain. 2011, 2011, vol. 24*

Réalisations