« Retour

Popescu-Belis Andrei

Professeur HES associé

Compétences principales

Traitement automatique des langues

Professeur HES associé

Téléphone: +41 24 557 62 99

Bureau: B40

Haute école d'Ingénierie et de Gestion du Canton de Vaud
Route de Cheseaux 1, 1400 Yverdon-les-Bains, CH

Institut
IICT - Institut des Technologies de l'Information et de la Communication

BSc en Informatique et systèmes de communication - Haute école d'Ingénierie et de Gestion du Canton de Vaud

Traitement automatique des langues
Préparation et caractérisation des données
Apprentissage automatique non supervisé
Programmation C/C++
Algorithmes et structures des données

MSc HES-SO en Engineering - HES-SO Master

Text Analysis

Terminés

Digital Lyric

AGP

Rôle: Requérant(e) principal(e)

Financement: UNIL-Faculté des lettre

Description du projet : Le but de ce projet est de concevoir deux systèmes d'aide à la génération de poèmes pour une exposition appelée Digital Lyric qui se tiendra au Château de Morges au printemps 2020. la Heig-Vd est partenaire de recherches avec L'UNIL, le Pr. Antonio Rodriguez.

Equipe de recherche au sein de la HES-SO: Perez Uribe Andres , Luthier Gabriel , Popescu-Belis Andrei , Ramirez Atrio Alejandro

Partenaires académiques: IICT

Durée du projet: 01.05.2019 - 30.06.2020

Montant global du projet: 23'937 CHF

Statut: Terminé

2013

Multi-factor segmentation for topic visualization and recommendation: the MUST-VIS system

Article scientifique

Chidansh Amitkumar Bhatt, Popescu-Belis Andrei, Maryam Habibi, Ingram Sandy, Stefano Masneri, Fergus McInnes, Nikolaos Pappas, Oliver Schreer

Proceedings of the 21st ACM international conference on Multimedia, 2013 , pp. 365-368

Lien vers la publication

Résumé:

This paper presents the MUST-VIS system for the MediaMixer/VideoLectures .NET Temporal Segmentation and Annotation Grand Challenge. The system allows users to visualize a lecture as a series of segments represented by keyword clouds, with relations to other similar lectures and segments. Segmentation is performed using a multi-factor algorithm which takes advantage of the audio (through automatic speech recognition and word-based segmentation) and video (through the detection of actions such as writing on the blackboard). The similarity across segments and lectures is computed using a content-based recommendation algorithm. Overall, the graph-based representation of segment similarity appears to be a promising and cost-effective approach to navigating lecture databases.

Processing and linking audio events in large multimedia archives: The EU inevent project

Article scientifique

Ingram Sandy, Hervé Bourlard, Marc Ferras, Nikolaos Pappas, Popescu-Belis Andrei, Steve Renals, Fergus McInnes, Peter J Bell, Maël Guillemot

First Workshop on Speech, Language and Audio in Multimedia, 2013

Lien vers la publication

Résumé:

In the inEvent EU project, we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected “hyper-events”(a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal.

2023

Assessing the importance of frequency versus compositionality for subword-based tokenization in NMT

Conférence ArODES

Benoist Wolleb, Romain Silvestri, Giorgos Vernikos, Ljiljana Dolamic, Andrei Popescu-Belis

Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 12 – 15 June 2023, Tampere, Finland

Lien vers la conférence

Résumé:

Subword tokenization is the de-facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently put forward in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to separate frequency (the first advantage) from compositionality, thanks to the use of Huffman coding, which tokenizes words using a fixed amount of symbols. Experiments with CS-DE, EN-FR and EN-DE NMT show that frequency alone accounts for approximately 90% of the BLEU scores reached by BPE, hence compositionality has less importance than previously thought.

A simplified training pipeline for low-resource and unsupervised machine translation

Conférence ArODES

Alex R. Atrio, Alexis Allemann, Ljiljana Dolamic, Andrei Popescu-Belis

Proceedings of the 6th Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)

Lien vers la conférence

Résumé:

Training neural MT systems for low-resource language pairs or in unsupervised settings (i.e.{ with no parallel data) often involves a large number of auxiliary systems. These may include parent systems trained on higher-resource pairs and used for initializing the parameters of child systems, multilingual systems for neighboring languages, and several stages of systems trained on pseudo-parallel data obtained through back-translation. We propose here a simplified pipeline, which we compare to the best submissions to the WMT 2021 Shared Task on Unsupervised MT and Very Low Resource Supervised MT. Our pipeline only needs two parents, two children, one round of back-translation for low-resource directions and two for unsupervised ones and obtains better or similar scores when compared to more complex alternatives.

GPoeT :

Conférence ArODES

a language model trained for rhyme generation on synthetic data

Andrei Popescu-Belis, Alex R. Atrio, Bastien Bernath, Étienne Boisson, Teo Ferrari, Xavier Theimer-Lienhardt, Giorgos Vernikos

Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Lien vers la conférence

Résumé:

Poem generation with language models requires the modeling of rhyming patterns. We propose a novel solution for learning to rhyme, based on synthetic data generated with a rule-based rhyming algorithm. The algorithm and an evaluation metric use a phonetic dictionary and the definitions of perfect and assonant rhymes. We fine-tune a GPT-2 English model with 124M parameters on 142 MB of natural poems and find that this model generates consecutive rhymes infrequently (11%). We then fine-tune the model on 6 MB of synthetic quatrains with consecutive rhymes (AABB) and obtain nearly 60% of rhyming lines in samples generated by the model. Alternating rhymes (ABAB) are more difficult to model because of longer-range dependencies, but they are still learnable from synthetic data, reaching 45% of rhyming lines in generated samples.

2022

On the interaction of regularization factors in low-resource neural machine translation

Conférence ArODES

Alex R. Atrio, Andrei Popescu-Belis

Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

Lien vers la conférence

Résumé:

We explore the roles and interactions of the hyper-parameters governing regularization, and propose a range of values applicable to low-resource neural machine translation. We demonstrate that default or recommended values for high-resource settings are not optimal for low-resource ones, and that more aggressive regularization is needed when resources are scarce, in proportion to their scarcity. We explain our observations by the generalization abilities of sharp vs. flat basins in the loss landscape of a neural network. Results for four regularization factors corroborate our claim: batch size, learning rate, dropout rate, and gradient clipping. Moreover, we show that optimal results are obtained when using several of these factors, and that our findings generalize across datasets of different sizes and languages.

Constrained language models for interactive poem generation

Conférence ArODES

Andrei Popescu-Belis, Alex R. Atrio, Valentin Minder, Aris Xanthos, Gabriel Luthier, Simon Mattei, Antonio Rodriguez

Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)

Lien vers la conférence

Résumé:

This paper describes a system for interactive poem generation, which combines neural language models (LMs) for poem generation with explicit constraints that can be set by users on form, topic, emotion, and rhyming scheme. LMs cannot learn such constraints from the data, which is scarce with respect to their needs even for a well-resourced language such as French. We propose a method to generate verses and stanzas by combining LMs with rule-based algorithms, and compare several approaches for adjusting the words of a poem to a desired combination of topics or emotions. An approach to automatic rhyme setting using a phonetic dictionary is proposed as well. Our system has been demonstrated at public events, and log analysis shows that users found it engaging.

2021

Small batch sizes improve training of low-resource neural MT

Conférence ArODES

Alex R. Atrio, Andrei Popescu-Belis

Proceedings of ICON 2021: 18th International Conference on Natural Language Processing

Lien vers la conférence

Résumé:

We study the role of an essential hyperparameter that governs the training of Transformers for neural machine translation in a low-resource setting: the batch size. Using theoretical insights and experimental evidence, we argue against the widespread belief that batch size should be set as large as allowed by the memory of the GPUs. We show that in a low-resource setting, a smaller batch size leads to higher scores in a shorter training time, and argue that this is due to better regularization of the gradients during training.

Subword mapping and anchoring across languages

Conférence ArODES

Giorgos Vernikos, Andrei Popescu-Belis

Proceedings of Findings of the Association for Computational Linguisitics: EMLNP 2021

Lien vers la conférence

Résumé:

State-of-the-art multilingual systems rely on shared vocabularies that sufficiently cover all considered languages. To this end, a simple and frequently used approach makes use of subword vocabularies constructed jointly over several languages. We hypothesize that such vocabularies are suboptimal due to false positives (identical subwords with different meanings across languages) and false negatives (different subwords with similar meanings). To address these issues, we propose Subword Mapping and Anchoring across Languages (SMALA), a method to construct bilingual subword vocabularies. SMALA extracts subword alignments using an unsupervised stateof-the-art mapping technique and uses them to create cross-lingual anchors based on subword similarities. We demonstrate the benefits of SMALA for cross-lingual natural language inference (XNLI), where it improves zero-shot transfer to an unseen language without taskspecific data, but only by sharing subword embeddings. Moreover, in neural machine translation, we show that joint subword vocabularies obtained with SMALA lead to higher BLEU scores on sentences that contain many false positives and false negatives.

The IICT-Yverdon system for the WMT 2021 unsupervised MT and very low resource supervised MT task

Conférence ArODES

Àlex R. Atrio, Gabriel Luthier, Axel Fahy, Giorgos Vernikos, Andrei Popescu-Belis, Ljiljana Dolamic

Proceedings of EMNLP 2021 6th Conference on Machine Translation (WMT21)

Lien vers la conférence

Résumé:

In this paper, we present the systems submitted by our team from the Institute of ICT (HEIG-VD / HES-SO) to the Unsupervised MT and Very Low Resource Supervised MT task. We first study the improvements brought to a baseline system by techniques such as back-translation and initialization from a parent model. We find that both techniques are beneficial and suffice to reach performance that compares with more sophisticated systems from the 2020 task. We then present the application of this system to the 2021 task for low-resource supervised Upper Sorbian (HSB) to German translation, in both directions. Finally, we present a contrastive system for HSBDE in both directions, and for unsupervised German to Lower Sorbian (DSB) translation, which uses multi-task training with various training schedules to improve over the baseline.

2020

A consolidated dataset for knowledge-based question generation using predicate mapping of linked data

Conférence ArODES

Johanna Melly, Gabriel Luthier, Andrei Popescu-Belis

Proceedings of the 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation (ISA-16), 12 May 2020, Marseille France

Lien vers la conférence

Résumé:

In this paper, we present the ForwardQuestions data set, made of human-generated questions related to knowledge triples. This data setresults from the conversion and merger of the existing SimpleDBPediaQA and SimpleQuestionsWikidata data sets, including the mapping of predicates from DBPedia to Wikidata, and the selection of ‘forward’ questions as opposed to ‘backward’ ones. The new data set can be used to generate novel questions given an unseen Wikidata triple, by replacing the subjects of existing questions with the new one and then selecting the best candidate questions using semantic and syntactic criteria. Evaluation results indicate that the question generation method using ForwardQuestions improves the quality of questions by about 20% with respect to a baseline not using ranking criteria.

2018

Self-attentive residual decoder for neural machine translation

Conférence ArODES

Lesly Miculicich Werlen, Nikolaos Pappas, Dhananjay Ram, Andrei Popescu-Belis

Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Lien vers la conférence

Résumé:

Neural sequence-to-sequence networks with attention have achieved remarkable performance for machine translation. One of the reasons for their effectiveness is their ability to capture relevant source-side contextual information at each time-step prediction through an attention mechanism. However, the target-side context is solely based on the sequence model which, in practice, is prone to a recency bias and lacks the ability to capture effectively nonsequential dependencies among words. To address this limitation, we propose a target-sideattentive residual recurrent network for decoding, where attention over previous words contributes directly to the prediction of the next word. The residual learning facilitates the flow of information from the distant past and is able to emphasize any of the previously translated words, hence it gains access to a wider context. The proposed model outperforms a neural MT baseline as well as a memory and self-attention network on three language pairs. The analysis of the attention learned by the decoder confirms that it emphasizes a wider context, and that it captures syntactic-like structures.

Machine translation of low-resource spoken dialects :

Conférence ArODES

strategies for normalizing swiss german

Pierre-Edouard Honnet, Andrei Popescu-Belis, Claudiu Musat, Michael Baeriswyl

Proceedings of the 11th Edition of the Language Resources and Evaluation Conference, Miyazaki (Japan), 7-12 May 2018

Lien vers la conférence

Résumé:

The goal of this work is to design a machine translation (MT) system for a low-resource family of dialects, collectively known as Swiss German, which are widely spoken in Switzerland but seldom written. We collected a significant number of parallel written resources to start with, up to a total of about 60k words. Moreover, we identified several other promising data sources for Swiss German. Then, we designed and compared three strategies for normalizing Swiss German input in order to address the regional diversity. We found that character-based neural MT was the best solution for text normalization. In combination with phrase-based statistical MT, our solution reached 36% BLEU score when translating from the Bernese dialect. This value, however, decreases as the testing data becomes more remote from the training one, geographically and topically. These resources and normalization techniques are a first step towards full MT of Swiss German dialects.

Réalisations

PEOPLE@HES-SO
Annuaire et Répertoire des compétences

Popescu-Belis Andrei

Professeur HES associé

Compétences principales

Traitement automatique des langues

Traduction automatique

Dialogue humain-machine

Recherche d'information

Machine Learning

Professeur HES associé

PEOPLE@HES-SO Annuaire et Répertoire des compétences

Popescu-Belis Andrei

Professeur HES associé

Compétences principales

Traitement automatique des langues

Traduction automatique

Dialogue humain-machine

Recherche d'information

Machine Learning

Professeur HES associé

PEOPLE@HES-SO
Annuaire et Répertoire des compétences