Portage d'un système de génération de poèmes du français vers l'anglais
The CR-PO system is able to generate French poetry using language models. These poems can then be modified by the users to their liking by changing the theme and emotion of the poem and generating rhymes following a given scheme. The goal of this BSc thesis was to port the CR-PO system to English, to improve the system's quality, and to allow users to enter the first words of a poem.
To this end, the main achievements of this BSc thesis are:
- The creation of multiple corpora of English poetry: a general corpus used for poem generation and specific corpora for adjustment to themes and emotions.
- The general corpus is extracted from Project Gutenberg, a library of free eBooks.
- The specific corpora are made by finding smaller corpora, either online or by web scraping, and training classifiers with these smaller corpora to classify the general corpus into 5 topic-specific and 3 emotion-specific sub-corpora.
- The training of language models based on more advanced technologies than the French CR-PO system, namely, GPT-2 and RoBERTa.
- GPT-2 replaces TextGenRNN and leads to much better results regarding poem generation. GPT-2 is also used to evaluate the possible rhymes and select the best one.
- RoBERTa replaces CamemBERT and allows theme and emotion modification in English.
- The creation or adaptation of linguistic resources used by the system, such as a phonetic dictionary or a list of words with their associations score to specific emotions or themes.
- The adaptation of the code of the project to integrate the new models and resources created.
- A new functionality that allows users to input the beginning of the poem if they wish to do so.
- The translation of the user interface into English.
As a result, the system now has a working English version with generally better results than the French version. Such results were possible thanks to the usage of cutting-edge language models such as GPT-2 and RoBERTa and the creation of larger datasets.
Etudiant: Teo Ferrari
Année: 2022
Département: TIC
Filière: Informatique et systèmes de communication (anciennement Informatique) avec orientation en Logiciel
Type de formation: Plein temps
Enseignant responsable: Andrei Popescu-Belis
Institut: IICT
Téléchargements:
- Télécharger l'affiche
- Télécharger le rapport