Posted by Matthew McCullough, VP of Product Management, Android Developer
We want to make it faster and easier for you to build high-quality Android apps, and one way we’re helping you be more productive is by putting AI at your fingertips. We know you want AI that truly understands the nuances of the Android platform, which is why we’ve been measuring how LLMs perform Android development tasks. Today we released the first version of Android Bench, our official leaderboard of LLMs for Android development.
Our goal is to provide model creators with a benchmark to evaluate LLM capabilities for Android development. By establishing a clear, reliable baseline for what high quality Android development looks like, we’re helping model creators identify gaps and accelerate improvements—which empowers developers to work more efficiently with a wider range of helpful models to choose for AI assistance—which ultimately will lead to higher quality apps across the Android ecosystem.
Designed with real-world Android development tasks
We created the benchmark by curating a task set against a range of common Android development areas. It is composed of real challenges of varying difficulty, sourced from public GitHub Android repositories. Scenarios include resolving breaking changes across Android releases, domain-specific tasks like networking on wearables, and migrating to the latest version of Jetpack Compose, to name a few.
Each evaluation attempts to have an LLM fix the issue reported in the task, which we then verify using unit or instrumentation tests. This model-agnostic approach allows us to measure a model’s ability to navigate complex codebases, understand dependencies, and solve the kind of problems you encounter every day.
We validated this methodology with several LLM makers, including JetBrains.
“Measuring AI’s impact on Android is a massive challenge, so it’s great to see a framework that’s this sound and realistic. While we’re active in benchmarking ourselves, Android Bench is a unique and welcome addition. This methodology is exactly the kind of rigorous evaluation Android developers need right now.”
– Kirill Smelov, Head of AI Integrations at JetBrains.
The first Android Bench results
For this initial release, we wanted to purely measure model performance and not focus on agentic or tool use. The models were able to successfully complete 16-72% of the tasks. This is a wide range that demonstrates some LLMs already have a strong baseline for Android knowledge, while others have more room for improvement. Regardless of where the models are at now, we’re anticipating continued improvement as we encourage LLM makers to enhance their models for Android development.
The LLM with the highest average score for this first release is Gemini 3.1 Pro, followed closely by Claude Opus 4.6. You can try all of the models we evaluated for AI assistance for your Android projects by using API keys in the latest stable version of Android Studio.
Providing developers and LLM makers with transparency
We value an open and transparent approach, so we made our methodology, dataset, and test harness publicly available on GitHub.
One challenge for any public benchmark is the risk of data contamination, where models may have seen evaluation tasks during their training process. We have taken measures to ensure our results reflect genuine reasoning rather than memorization or guessing, including a thorough manual review of agent trajectories, or the integration of a canary string to discourage training.
Looking ahead, we will continue to evolve our methodology to preserve the integrity of the dataset, while also making improvements for future releases of the benchmark—for example, growing the quantity and complexity of tasks.
We’re looking forward to how Android Bench can improve AI assistance long-term. Our vision is to close the gap between concept and quality code. We’re building the foundation for a future where no matter what you imagine, you can build it on Android.












