Building On-Device Browser Agents: Fine-Tuning Small LLMs with gRPO and BrowserGym
Writer
Quiz available
Take a quick quiz for this article.
The open-source AI community is experiencing a massive shift. We are moving away from models that simply overfit on abstract text benchmarks and moving toward agentic systems capable of solving real-world, multimodal tasks. One of the most promising frontiers is automated browser control—training models to navigate complex UIs, complete forms, and execute workflows entirely on their own.
However, most solutions currently on the market rely on massive, closed-source models running on remote servers. This introduces significant privacy concerns (do you want to send your bank account HTML to a remote API?).
In this post, we will explore a more private, efficient approach: fine-tuning a small, fast model (Liquid AI’s LFM-2 350M) to execute browser control tasks locally. We will cover the infrastructure, the algorithms, and the exact tech stack required to make this happen using Reinforcement Learning (gRPO).
1. The “Why”: Browser Automation and the RL Advantage
Browser control models ingest an environment state and an instruction, and output precise system actions.
- Text-based Input: HTML State + Instruction -> Action (e.g., “click element 13”).
- Multimodal/VLM Input: Screenshot + Instruction -> Action (the future trend of agentic AI).
While the positive use cases are immense—such as accessibility tools that automate healthcare appointment managers for the elderly or personal agents that securely execute automated personal bill payments—the technology also presents severe security risks. A prime example is scalable bot armies capable of bypassing standard UI bot-protections to post fake reviews or create fake accounts. Understanding how to build these systems is crucial to defending against them.
Why SFT Fails and Reinforcement Learning Shines
If you want to train a model to navigate a website, your first instinct might be Supervised Fine-Tuning (SFT).
The Data Problem: SFT requires a massive dataset of perfect 'happy path' input/output examples. Collecting comprehensive, diverse examples of these paths is practically impossible given the sheer number of UI ramifications.
As Andrej Karpathy recently joked, “Reinforcement learning is terrible. It just happens that anything we had before is even worse.”
We use RL because browser control tasks have a major Verification Advantage: they are hard to provide exact step-by-step examples for, but exceptionally easy to verify. Did the model successfully book the ticket? If yes, Reward = 1. If no, Reward = 0.
Furthermore, RL teaches a model Self-Correction. During RL rollouts, the model is forced to make mistakes and learn to recover. It will inevitably click the wrong button, realize its mistake, and navigate back. SFT models, having only seen perfect paths, panic and freeze when they encounter non-happy-path scenarios in production.

2. The Three Building Blocks of the RL Stack
To train our agent, we need three distinct components interacting in a continuous loop:
- The Environment (BrowserGym): A sandbox holding the UI state and providing verifiable rewards (1 for success, 0 for failure). We use BrowserGym hosted on a Hugging Face Space.
- The Policy (LFM-2 350M): The language model acting as the actor making decisions. We chose a 350-million parameter model because it is small enough to train quickly and deploy on edge devices, enabling complete data privacy.
- The Optimizer/Algorithm (gRPO): We use Group Relative Policy Optimization (gRPO), an algorithm popularized by DeepSeek.

gRPO vs. PPO
Unlike older algorithms like PPO (Proximal Policy Optimization) which require holding a second language model in memory to compute a “value function”, gRPO relies on group-level relative rewards. This removes the need for the second model, effectively saving massive amounts of VRAM and making it viable for smaller compute budgets.
3. The Training Loop Mechanics
The orchestration happens via Hugging Face’s TRL library. Here is how a single step works:
- Rollouts: The model is prompted to generate trajectories. For example, 4 rollouts, capped at a max of 10 steps each to prevent infinite loops.
- Sparse Rewards: Initially, the model behaves randomly. Most trajectories fail, returning sparse rewards (0s). Occasionally, it succeeds, returning a 1.
- Loss & Update: The gRPO algorithm calculates the loss from these sparse rewards, performs backward propagation, adjusts parameters, and updates the active model for the next loop.
4. Infrastructure & Implementation: Tips from the Trenches
Executing RL is notoriously fragile. Here is the highly optimized tech stack and the “gotchas” to watch out for.
Lightning-Fast Tooling: uv and vLLM
- uv by Astral: We manage our environment entirely with
uv, arguably the lightning-fast Python package manager available today, ensuring rapid dependency resolution. - vLLM for Inference: During the RL loop, the model must generate fast rollouts. Standard Hugging Face Transformers are far too slow for this. You must wrap your inference generation in vLLM for high-throughput batching.
Serverless Compute with Modal
Instead of wrestling with AWS/GCP instances, we use Modal (Serverless GPUs) to rent compute (like A100s or L40s) per second, completely orchestrated via Python code.

- Collocation: To save money, we collocate the gRPO trainer (backward-pass logic) and the vLLM inference engine on the same GPU using Modal’s serverless functions. This keeps the CPU-bound environment separate.
- Volume Caching: Because Modal spins instances up and down, redownloading the base model from Hugging Face every time is inefficient. We attach a persistent storage volume to cache the downloaded HF weights.
- Logging with Secrets: Track everything via Weights & Biases (W&B) using securely injected environment variable secrets.
Hard-Won Engineering Tips
Decouple Configuration: Use YAML files for hyperparameters instead of hardcoding them in your Python scripts. This is the first step toward mature LLMOps.
Dependency Pinning: Actively developing libraries (like BrowserGym) must be pinned to exact Git commit hashes to avoid breaking changes next week.
Log Everything: RL loss curves are notoriously unhelpful. Track your Average Reward per Step. For a simple task like mini_world_of_bits, you should see average reward jump from ~0.75 to 1.0 within a few steps. Complex tasks will be much slower.
5. Conclusion and Evaluation
Evaluating RL models often breaks traditional ML rules: testing on the same environment used for training (a quirk of RL). While this feels like “cheating,” it is a necessary sanity check to ensure the model has established a solid behavioral baseline before specialized downstream finetuning.
The end goal isn’t necessarily to deploy the model directly into a novel production environment purely off this initial run. Instead, by fine-tuning on a benchmark like BrowserGym with RL, you “warm up” the model. You establish a foundational understanding of navigating DOM trees and rectifying mistakes. From there, taking this base model and fine-tuning it further on your specific domain requires vastly less data and yields incredibly robust, private, and localized AI agents.
Related Articles
More articles coming soon...