Deep dive into: Google Supercharges Gemini 3 Flash with Agentic Vision

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Cory Benfield discusses the evolution of Swift from an app language to a critical tool for secure, high-scale services. He explains how Swift’s lack of a garbage collector eliminates tail latency and shares how its "zero-cost abstractions" rival C performance. He shares Apple’s roadmap for incremental adoption and demonstrates groundbreaking new interoperability for C++ and Java ecosystems.

Sub‑100-ms APIs emerge from disciplined architecture using latency budgets, minimized hops, async fan‑out, layered caching, circuit breakers, and strong observability. But long‑term speed depends on culture, with teams owning p99, monitoring drift, managing thread pools, and treating performance as a shared, continuous responsibility.

Sarah Usher discusses the architectural "breaking point" where warehouses like BigQuery struggle with latency and cost. She explains the necessity of a conceptual data lifecycle (Raw, Curated, Use Case) to regain control over lineage and innovation. She shares practical strategies to design a single source of truth that empowers both ML teams and analytics without bottlenecking scale.

Thiago Ghisi discusses the strategic evolution required to lead 100+ engineers without breaking the organization. He explains his "Three Levels of Impact" framework and shares practical lessons on speeding up decision-making, cultivating leadership teams, and building organizational resilience. Learn why he views reorgs as a continuous deployment feature rather than a one-time traumatic event.

As systems grow, observability becomes harder to maintain and incidents harder to diagnose. Agentic observability layers AI on existing tools, starting in read-only mode to detect anomalies and summarize issues. Over time, agents add context, correlate signals, and automate low-risk tasks. This approach frees engineers to focus on analysis and judgment.

InfoQ Homepage News Google Supercharges Gemini 3 Flash with Agentic Vision

Sergio De Simone

Analysis & Development

Google has added agentic vision to Gemini 3 Flash, combining visual reasoning with code execution to "ground answers in visual evidence". According to Google, this not only improves accuracy but, more importantly, unlocks entirely new AI-driven behaviors.

Briefly, rather than analyzing an image in a single pass, Gemini 3 Flash now approaches vision as an agent‑like investigation: planning steps, manipulating the image, and using code to verify details before answering.

This leads to a "think -> act -> observe" loop, in which the model first analyzes the prompt and the image to plan a multi-step approach; then it generates and executes Python code to manipulate the image and extract additional information from it, such as cropping, zooming, annotating, or calculating; and finally, appends the transformed image to its context before producing a new answer.

According to Google, this approach yields a 5-10% improvement in accuracy on vision tasks across most benchmarks, driven by two major factors.

First, code execution enables fine-grained inspection of image details by zooming into smaller visual elements, such as tiny text, rather than relying on guesswork. Gemini can also annotate images by drawing bounding boxes and labels to strengthen its visual reasoning, for example, by correctly counting objects. Using such annotations, Google claims to have solved the notoriously "hard problem" of counting the digits on a hand.

Second, visual arithmetic and data visualization can be offloaded to deterministic Python code using Matplotlib, reducing hallucinations in complex, image‑based math.

Reading this makes earlier vision tools feel incomplete in hindsight. So many edge cases existed simply because models couldn’t intervene or verify visually. Agentic Vision feels like the direction everyone will eventually adopt.

The implications of this are massive. Essentially they've unlocked visual reasoning for AI to be implemented in actual physical robots. Robots will have tons more context awareness and agentic capabilities.

Other redditors noted that ChatGPT has employed a similar approach for quite some time via Code Interpreter; nevertheless, it still appears unable to reliably count the digits on a hand.

Future Impact

Google's roadmap for agentic vision includes more implicit behavior, such as automatically triggering zooming, rotation, and other actions without explicit prompts; adding new tools such as web and reverse image search to enhance the evidence available to the model; and extending support to other models in the Gemini family beyond Flash.

Agentic Vision is accessible through the Gemini API in Google AI Studio and Vertex AI, and is starting to roll out in the Gemini app in Thinking mode.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Reliability rules have changed. At QCon London 2026, unlearn legacy patterns and get the blueprints from senior engineers scaling production AI today.

InfoQ.com and all content copyright © 2006-2026 C4Media Inc. Privacy Notice, Terms And Conditions, Cookie Policy

Source: View Original