- JOB
- France
Job Information
- Organisation/Company
- Inria, the French national research institute for the digital sciences
- Research Field
- Computer science
- Researcher Profile
- First Stage Researcher (R1)
- Application Deadline
- Country
- France
- Type of Contract
- Temporary
- Job Status
- Full-time
- Hours Per Week
- 38.5
- Offer Starting Date
- Is the job funded through the EU Research Framework Programme?
- Not funded by a EU programme
- Reference Number
- 2026-09717
- Is the Job related to staff position within a Research Infrastructure?
- No
Offer Description
At Scool, Gautron (2022) have turned a high-fidelity crop simulator into an RL environment. In this problem, an AI advises a farmer throughout a harvesting season, deciding daily how much should the farmer water, fertilize, and so on, with a goal of striking a balance between several criteria such as yield or nitrate pollution under varying weather conditions. By running an off-the-shelf deep RL algorithm such as PPO (Schulman et al., 2017), it was shown in Gautron (2022) that RL can find more efficient solutions than human expert policies. However, the main drawback of the current decision support system is that it provides recommendations under a pre-defined trade-off between the different criteria (such as yield, pollution or work load) and can thus not adapt to the varying needs of individual farmers. An existing solution in the literature is to wrap an RL solver around a preference elicitation mechanism to allow non-RL-expert users to tune the reward function to their needs, while only interacting with the AI at a very abstract level. This is the so called Preference-based RL (PbRL, Wirth et al. (2017)), also known as RL from human feedback (RLHF). These methods have had a recent surge of popularity as they were shown to be useful for training large language models (Ouyang et al., 2022). An illustration of the RL from Human Feedback (RLHF) framework is given in Figure 1 and a survey can be found in Wirth et al. (2017).
Despite continuous efforts to improve PbRL algorithms (Hu et al., 2024; Zhu et al., 2025; Driss et al.), in their current state, they remain inadequate for real-world applications such as the aforementioned task. One of the limitation of current PbRL methods is that they use a costly policy optimization step using deep RL between each query round making the overall interaction last potentially several hours. The idea of this M2 internship is to exploit specificities of the task, namely that preferences can be expressed as proximity to a goal (a vector containing a target average crop yield, amount of used fertilizer, etc.), and use an unsupervised training phase with goal-conditioned RL (Liu et al., 2022) to learn quantities and models that can speed-up PbRL. These models include for instance a prior about possible goals, a set of pre-computed queries and a goal-conditioned policy able to reach target goals and adapt faster to specific user preferences.
For more information, please see Scool's job offer website https://team.inria.fr/scool/job-offers/
Main Acitivies:
- Perform a literature review of PbRL and goal-conditioned RL and propose a model of users and user responses in PbRL from a goal-conditioned RL perspective.
- Use existing goal-conditioned RL algorithms on the gym-dssat task to learn goal-conditioned policies and a prior over achievable goals.
- Propose an approach for using the goal-conditioned policy and the goal prior to speed-up query generation and policy optimization in PbRL. Compare against existing PbRL baselines and evaluate improvements in compute time and reduction in the number of queries.
Optional Activities:
- Review the literature of preference elicitation with a focus on Bayesian aproaches such as (Viappiani and Boutilier, 2010).
- With the previously learned prior and user response model, develop a Bayesian query generation mechanism and evaluate its performance against existing query selection mechanisms in PbRL.
Where to apply
Requirements
- Languages
- FRENCH
- Level
- Basic
- Languages
- ENGLISH
- Level
- Good
Additional Information
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave (on a full time annual basis): 7 weeks of annual leave
- Possibility of teleworking and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage
In accordance with current regulations
Please send your CV and cover letter
- Website for additional job details
Work Location(s)
- Number of offers available
- 1
- Company/Institute
- Inria
- Country
- France
- City
- Villeneuve d'Ascq
- Geofield
Contact
- City
- LE CHESNAY CEDEX
- Website
- Street
- Domaine de Voluceau - Rocquencourt
- Postal Code
- 78153