PhD in Natural Language Processing: Statistical analyses of lexical distributions with an...

Updated: over 1 year ago
Job Type: FullTime
Deadline: 18 Oct 2022

This thesis is a partnership between the Laboratoire Interdisciplinaire des Sciences du Numérique (LISN, Université Paris-Saclay) and the International Research Laboratory on Learning Systems (ILLS, Montréal). A joint supervision with McGill University or the École de Technologie Supérieure (ETS) of Montreal under the co-direction of Pablo Piantanida (Director of the ILLS) is planned. The PhD student will share the academic year between LISN at Paris-Saclay University and ILLS in Montreal, which will facilitate collaborations with other researchers from Canadian institutions involved in ILLS (MILA, ETS, McGill University).

The doctoral school at the Université Paris-Saclay will be the ICST doctoral school in Pôle B (Data, knowledge, learning and interactions)

Forged texts and misinformation are ongoing issues and are in existence all around us in biased softwares that amplifies only our opinions for a “better” more seamless user experience. On social media platforms, these are used by rogue states, businesses and individuals to create misinformation, amplify doubts about factual data or to tarnish their competitors or adversaries, thereby enhancing their own strategic or economic positions. This spread may be the result of different factors and incentives; however, each pose the same fundamental issue to humanity: the misunderstanding of what is true and what is false.

Leveraging deep learning models for large-scale text generation such as GPT-3 has seen widespread use in recent years due to superior performance over traditional generation methods, demonstrating an ability to produce text of great quality, coherence and relevance that are sometines hard to distinguish from human productions. These models generate text via an autoregressive procedure that samples from a distribution learnt to mimic the "true" distribution of human written texts. Malecious uses of these technologies thus constitute a major threat to a truthful information.

Artificial text detection can viewed as a special case of anomaly detection, broadly defined as the task of identifying examples that deviate from regular ones to a degree that arouses suspicion. Current research in anomalies detection largely focuses either on deep classifiers (e.g., out-of-distribution detection, adversarial attack) or rely on the output of large language models (LMs) when label are unavailable. Although these lines of research are appealing, they do not scale without requiring large amount of compute. Additionally, these methods make the fundamental assumptions that (1) the statistical information needed to identify anomalies is available in the trained model, (2) the model uncertainty can be trusted, which is typically not the case as illustrated in presence of a small shift in the input distribution. LM-based approaches do not perform well when used on large text fragments, as may be needed in practical applications (e.g. novel, story or news generation), because of the fixed length context used when training the language model.

This PhD thesis focuses on developing hybrid anomaly detection methods using deep neural network based techniques and word frequency distributions that are linguistically inspired. Most of the research on language models to date focus on sentence-level processing and fail to capture long-range dependencies at the discourse level. Instead, we will leverage on word frequency distributions and information measures to characterize long documents, incorporating a very large number of rare words, which often leads to strange statistical phenomena such as mean frequencies that systematically keep changing as the number of observations is increased. Advanced concepts from statistics and information measures are necessary to understand the analysis of word frequency distributions and to capture the document level information. Extensive experiments on real-world datasets will be executed to showcase t viability of our approach.



Similar Positions