Bryan de Oliveira

Bryan de Oliveira Hi! I'm Bryan, a 29-year-old AI researcher and PhD Student in Computer Science at the Federal University of Goiás (UFG), based in Goiânia, Brazil. I work at CEIA and AKCIT, where I lead interdisciplinary research teams at the intersection of Deep Reinforcement Learning and decision-making under uncertainty.

My work centers on agents that make good decisions under uncertainty — combining Deep RL, information-seeking behavior, and Bayesian methods. At AKCIT, I lead research on two main fronts: how agents can reason and act under uncertainty across different abstraction levels, from language models seeking information through dialogue to embodied agents planning in physical environments; and the empirical and theoretical foundations of modern RL algorithms. I also lead a team working on autonomous humanoid robots, covering teleoperation and sim-to-real transfer. Through CEIA, I additionally lead applied R&D with industry partners across energy, finance, and software development.

My research expertise spans Deep Reinforcement Learning, representation learning, world models, and autonomous decision-making. I have proven ability to structure and execute complex R&D projects from conception to publication, complemented by extensive engineering experience in developing and deploying ML systems at scale. This background provides unique insights into both the theoretical foundations and practical constraints of AI systems.

I'm deeply interested in the fundamental questions of artificial intelligence and neuroscience — particularly how we can develop AI systems that learn better representations and world models to effectively plan and adapt in uncertain environments. I believe in a principled, interdisciplinary approach to AI research that combines theory with rigorous empirical validation and real-world impact.

CONV-TO-BENCH: Evaluating Language Models via User–Assistant Dialogues in Code Tasks

ICLR 2026 — DATA-FM Workshop · April 2026 Benchmark Design

VM dos Santos, AC Castro, SLS Toledo, BML Calura, LCM Menezes, RCR Mata, TWL Soares, BLM de Oliveira

CONV-TO-BENCH creates evaluation benchmarks from raw user–assistant interaction logs in code tasks, enabling model assessment on realistic multi-turn dialogue data.

CONV-TO-BENCH: Evaluating Language Models via User–Assistant Dialogues in Code Tasks

Partial Reasoning in Language Models: Search and Refinement Guided by Uncertainty

AAAI 2026 — LaCATODA Workshop · January 2026 RL under Uncertainty

ML da Luz, B Brandão, LGB Martins, G Oliveira, BLM de Oliveira, LC Melo, et al.

We propose partial reasoning in language models guided by uncertainty, using search and refinement to selectively truncate chain-of-thought while preserving performance.

Partial Reasoning in Language Models: Search and Refinement Guided by Uncertainty

Do Reasoning Models Ask Better Questions? A Formal Information-Theoretic Analysis on Multi-Turn LLM Games

AAAI 2026 — NeusymBridge Workshop · January 2026 RL under Uncertainty

DM Pedrozo, TWL Soares, BLM de Oliveira

A formal information-theoretic analysis on multi-turn LLM games evaluating whether reasoning models ask better questions than standard models.

Do Reasoning Models Ask Better Questions? A Formal Information-Theoretic Analysis on Multi-Turn LLM Games

Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments

NeurIPS 2025 — LatinX in AI Workshop · November 2025 RL Fundamentals

BLM de Oliveira*, FV Frujeri*, MPCM Queiroz*, LGB Martins, TWL Soares, et al.

Theoretical and empirical study revisiting GRPO in classical RL environments, comparing it against PPO and analyzing the role of the critic.

Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments

Personalizing Fairness: Adaptive RL with User Diversity Preference for Recommender Systems

RLC 2025 — Workshop on Practical Insights into RL for Real-World Systems · August 2025 Applied RL

LGB Martins, BLM de Oliveira, B Brandão, TWL Soares, et al.

We propose an adaptive RL approach for recommender systems that personalizes fairness based on user diversity preferences.

Personalizing Fairness: Adaptive RL with User Diversity Preference for Recommender Systems

Reinforcement Learning for Debt Pricing: A Case Study in Financial Services

RLC 2025 — Workshop on Practical Insights into RL for Real-World Systems · June 2025 Applied RL

B Brandão*, LGB Martins*, BLM de Oliveira*, LC Melo, ML da Luz, E Garcia, et al.

Offline RL with LTV-based rewards and bandit orchestration at a large financial institution improved collection values.

Reinforcement Learning for Debt Pricing: A Case Study in Financial Services

Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

ICML 2025 (Proceedings of Machine Learning Research) · May 2025 RL Fundamentals

BLM de Oliveira, LGB Martins, B Brandão, ML da Luz, TWL Soares, LC Melo

SPGym extends the 8-tile puzzle to evaluate RL agents by scaling representation learning complexity while keeping environment dynamics fixed, revealing opportunities for advancing representation learning for decision-making research.

Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context Oral

RLC 2025 — RLBrew Workshop · March 2025 RL under Uncertainty

BLM de Oliveira, LGB Martins, B Brandão, LC Melo

A benchmark for evaluating how LLMs handle ambiguous open-ended requests through dialogue, revealing that current models struggle to ask effective clarifying questions.

InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context

PulseRL: Enabling Offline Reinforcement Learning for Digital Marketing Systems via Conservative Q-Learning Oral

NeurIPS 2021 — 2nd Offline RL Workshop (Oral) · October 2021 Applied RL

LC Melo*, LGB Martins*, BLM de Oliveira*, B Brandão*, DW Soares, et al.

PulseRL is an offline reinforcement learning system for optimizing communication channels in Digital Marketing Systems (DMS) using Conservative Q-Learning (CQL). It learns from historical data, avoiding costly interactions, and reduces bias from out-of-distribution actions. PulseRL outperformed RL baselines in real-world DMS experiments, proving its effectiveness at scale.

PulseRL: Enabling Offline Reinforcement Learning for Digital Marketing Systems via Conservative Q-Learning

These are a few of my most relevant work. Other projects can be found on Github and LinkedIn.