Fully funded NLP/ML PhD position (M/F): Language modeling under distribution shifts

CNRS, France

Updated: 29 days ago

Location: Orsay, LE DE FRANCE

Job Type: FullTime

Deadline: 17 Jun 2024

24 May 2024
Job Information

Organisation/Company: CNRS
Department: Laboratoire Interdisciplinaire des Sciences du Numérique
Research Field: Engineering
Computer science
Mathematics
Researcher Profile: First Stage Researcher (R1)
Country: France
Application Deadline: 17 Jun 2024 - 23:59 (UTC)
Type of Contract: Temporary
Job Status: Full-time
Hours Per Week: 35
Offer Starting Date: 1 Oct 2024
Is the job funded through the EU Research Framework Programme?: Not funded by an EU programme
Is the Job related to staff position within a Research Infrastructure?: No

Offer Description

The PhD will take place at Université Paris-Saclay (LISN), with the possibility of joining the laboratory of one of the other supervisors, depending on the candidate's geographical preferences.

The candidate will meet with his supervisors at least once per week.

This PhD is part of the InExtenso project led by Karën Fort: https://anr-inextenso.loria.fr/
As such, the candidate can collaborate with a large team of researchers, including other PhD students. In this context, the PhD student will attend meetings and conferences in France and abroad.

The candidate will have access to the GPU clusters of Université Paris-Saclay. Additional resources may be provided by the Jean-Zay supercomputer.

Large language models (LLMs) and deep contextual embeddings (i.e., BERT and its variants) are trained on large corpora, usually extracted from various web sources. These models may produce undesirable outputs, especially due to data distribution shifts with downstream applications: web corpora may not be representative of actual use cases. Although it is possible to train a model from scratch or to fine-tune models using specialized datasets, this requires an important engineering effort by large teams of researchers (for example, see CroissantLLM [1] and SaulLm-7B [2]).

In the case of text generation, a common approach is to align the model with human preferences, either via reinforcement learning [3] or via minimization of a supervised loss [4]. Unfortunately, this requires annotated datasets of human preferences. In practice, for cost reasons, these datasets are often collected ahead of training and from distinct models. In other words, common alignment methods do not benefit from online feedback [5].

The aim of this PhD project is to propose novel methods to build distributionally robust LLMs and/or BERT-like models. Instead of using datasets labeled with human preferences, we will explore the use of existing resources like databases, lexicons, linguistic knowledge, etc. Depending on the candidate's interests, considered downstream applications may cover:

- adaptation to specialized domains (biomedical data, legal data);
- improving multilingual support (i.e., languages under-represented in training data);
- mitigating gender bias;
- etc.

To this end, two research directions can be considered:

- training/fine-tuning strategies that could be used to improve the robustness of models, see for example [6, 7, 8];
- test-time modifications of the model that allow the addition of preferences and constraints on generated outputs [9, 10, 11].

Priority will be given to methods that are frugal in terms of computing resources.

The candidate is expected to have either a strong computer science background with an interest in applied mathematics or a strong mathematical background with some knowledge in deep learning and Python+Pytorch. As the goal of the project is to propose novel methods, the candidate is expected to be able to develop their own code and hack libraries like Pytorch and HuggingFace, i.e., the project will require going beyond the simple use of tools.

Outcomes of this project are expected to be published in the main natural language processing conferences/journals (*ACL/EMNLP/TACL) and/or main machine learning conferences/journals (NeurIPS/ICLR/ICML/AISTATS/TMLR).

References

[1] CroissantLLM: A Truly Bilingual French-English Language Model (Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F.T. Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo) https://arxiv.org/abs/2402.00786

[2] SaulLM-7B: A pioneering Large Language Model for Law (Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, Michael Desa) https://arxiv.org/abs/2403.03883

[3] Deep reinforcement learning from human preferences (Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei) https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad50…

[4] Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn) https://arxiv.org/abs/2305.18290

[5] Direct Language Model Alignment from Online AI Feedback (Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, Mathieu Blondel) https://arxiv.org/abs/2402.04792

[6] Distributionally Robust Models with Parametric Likelihood Ratios (Paul Michel, Tatsunori Hashimoto, Graham Neubig) https://arxiv.org/pdf/2204.06340.pdf

[7] Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets (Zhang-Wei Hong, Aviral Kumar, Sathwik Karnik, Abhishek Bhandwaldar, Akash Srivastava, Joni Pajarinen, Romain Laroche, Abhishek Gupta, Pulkit Agrawal) https://proceedings.neurips.cc/paper_files/paper/2023/file/0ff3502bb295…

[8] Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization (Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, Percy Liang) https://arxiv.org/abs/1911.08731

[9] Tractable Control for Autoregressive Language Generation (Honghua Zhang, Meihua Dang, Nanyun Peng, Guy Van den Broeck) https://arxiv.org/abs/2304.07438

[10] Gradient-Based Constrained Sampling from Language Models (Sachin Kumar, Biswajit Paria, Yulia Tsvetkov) https://aclanthology.org/2022.emnlp-main.144/

[11] Structured Voronoi Sampling (Afra Amini, Li Du, Ryan Cotterell) https://arxiv.org/abs/2306.03061

Requirements

Research Field: Engineering
Education Level: Master Degree or equivalent

Research Field: Computer science
Education Level: Master Degree or equivalent

Research Field: Mathematics
Education Level: Master Degree or equivalent

Languages: FRENCH
Level: Basic

Research Field: Engineering
Years of Research Experience: None

Research Field: Computer science
Years of Research Experience: None

Research Field: Mathematics
Years of Research Experience: None

Additional Information

Website for additional job details: https://emploi.cnrs.fr/Offres/Doctorant/UMR9015-PIEZWE-007/Default.aspx

Work Location(s)

Number of offers available: 1
Company/Institute: Laboratoire Interdisciplinaire des Sciences du Numérique
Country: France
City: ORSAY CEDEX
Geofield

Where to apply

Website: https://emploi.cnrs.fr/Candidat/Offre/UMR9015-PIEZWE-007/Candidater.aspx

Contact

City: ORSAY CEDEX
Website: http://www.lisn.upsaclay.fr

STATUS: EXPIRED

View or Apply

Similar Positions

Postdoc Or Ph D Position: The Biology Of Microglia In Neuroinflammatory Disease , Nature Careers, France, 32 minutes ago
Neurodegenerative and neuroinflammatory diseases are increasingly frequent. They often involve peripheral immune cells that infiltrate the CNS and tissue-resident cells, among which microglia are ...