Skip to main content
Norway

Job offer

  • JOB
  • France
  • inria
  • Posted on: 25 March 2026

Research Internship / Priming Preference Elicitation with Goal-Conditioned Reinforcement Learning (F/M)

Apply now
25 Mar 2026

Job Information

Organisation/Company
Inria, the French national research institute for the digital sciences
Research Field
Computer science
Researcher Profile
First Stage Researcher (R1)
Application Deadline
Country
France
Type of Contract
Temporary
Job Status
Full-time
Hours Per Week
38.5
Offer Starting Date
Is the job funded through the EU Research Framework Programme?
Not funded by a EU programme
Reference Number
2026-09717
Is the Job related to staff position within a Research Infrastructure?
No

Offer Description

At Scool, Gautron (2022) have turned a high-fidelity crop simulator into an RL environment. In this problem, an AI advises a farmer throughout a harvesting season, deciding daily how much should the farmer water, fertilize, and so on, with a goal of striking a balance between several criteria such as yield or nitrate pollution under varying weather conditions. By running an off-the-shelf deep RL algorithm such as PPO (Schulman et al., 2017), it was shown in Gautron (2022) that RL can find more efficient solutions than human expert policies. However, the main drawback of the current decision support system is that it provides recommendations under a pre-defined trade-off between the different criteria (such as yield, pollution or work load) and can thus not adapt to the varying needs of individual farmers. An existing solution in the literature is to wrap an RL solver around a preference elicitation mechanism to allow non-RL-expert users to tune the reward function to their needs, while only interacting with the AI at a very abstract level. This is the so called Preference-based RL (PbRL, Wirth et al. (2017)), also known as RL from human feedback (RLHF). These methods have had a recent surge of popularity as they were shown to be useful for training large language models (Ouyang et al., 2022). An illustration of the RL from Human Feedback (RLHF) framework is given in Figure 1 and a survey can be found in Wirth et al. (2017).

Despite continuous efforts to improve PbRL algorithms (Hu et al., 2024; Zhu et al., 2025; Driss et al.), in their current state, they remain inadequate for real-world applications such as the aforementioned task. One of the limitation of current PbRL methods is that they use a costly policy optimization step using deep RL between each query round making the overall interaction last potentially several hours. The idea of this M2 internship is to exploit specificities of the task, namely that preferences can be expressed as proximity to a goal (a vector containing a target average crop yield, amount of used fertilizer, etc.), and use an unsupervised training phase with goal-conditioned RL (Liu et al., 2022) to learn quantities and models that can speed-up PbRL. These models include for instance a prior about possible goals, a set of pre-computed queries and a goal-conditioned policy able to reach target goals and adapt faster to specific user preferences.

For more information, please see Scool's job offer website https://team.inria.fr/scool/job-offers/

Main Acitivies:

  • Perform a literature review of PbRL and goal-conditioned RL and propose a model of users and user responses in PbRL from a goal-conditioned RL perspective.
  • Use existing goal-conditioned RL algorithms on the gym-dssat task to learn goal-conditioned policies and a prior over achievable goals.
  • Propose an approach for using the goal-conditioned policy and the goal prior to speed-up query generation and policy optimization in PbRL. Compare against existing PbRL baselines and evaluate improvements in compute time and reduction in the number of queries.

Optional Activities:

  • Review the literature of preference elicitation with a focus on Bayesian aproaches such as (Viappiani and Boutilier, 2010).
  • With the previously learned prior and user response model, develop a Bayesian query generation mechanism and evaluate its performance against existing query selection mechanisms in PbRL.

Requirements

Languages
FRENCH
Level
Basic
Languages
ENGLISH
Level
Good

Additional Information

Benefits
  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave (on a full time annual basis): 7 weeks of annual leave
  • Possibility of teleworking  and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage

In accordance with current regulations

Selection process

Please send your CV and cover letter

Website for additional job details

Work Location(s)

Number of offers available
1
Company/Institute
Inria
Country
France
City
Villeneuve d'Ascq
Geofield

Contact

City
LE CHESNAY CEDEX
Website
Street
Domaine de Voluceau - Rocquencourt
Postal Code
78153

Share this page