Close your eyes for a moment and think about how you test an application manually. Not the process. Not the checklist. The actual physical experience of it. You open the app. You look at the screen. You see a login form. You do not inspect the DOM to confirm it is there. You do not check the CSS class name of the input field. You just see it. A box for your email. A box for your password. A button that says “Sign In.” You type. You click. You look at what happens next.
Now think about how your automated tests do the same thing. They do not see anything. They parse a document object model. They search for elements by class names, IDs, data attributes, or XPath expressions. They interact with an abstraction of the page, not the page itself. They have no idea what the screen looks like. They could not tell you if the button is red or blue, if the layout is broken, or if half the page is hidden behind an overlay. They operate in the dark, navigating by coordinates they were given, hoping nothing has moved.
That gap between how humans test and how machines test is the source of almost every frustration in automation. And for the first time, we have the technology to close it.
The Blindness We Accepted as Normal
Think about this honestly. We built an entire industry around tools that cannot see. And we accepted the consequences of that blindness as the normal cost of automation.
- ●Tests break when a class name changes. We call this “maintenance.” But really it is the consequence of a tool that cannot see the button is still right there, in the same place, doing the same thing. Only its name in the code changed. A human tester would not even notice.
- ●Tests pass while the page is visually broken. The element exists in the DOM, so the test finds it. But it is behind a modal, or pushed off screen, or rendered at zero opacity. The test says “pass.” The user sees a broken page. A human tester would catch it instantly.
- ●Separate tools for web and mobile exist because the underlying document structures are completely different. But visually? A login screen looks like a login screen. On desktop, on iPhone, on Android. A human tester would test them the same way without thinking about it.
Every one of these problems traces back to the same root cause. The tool cannot see. It operates on a structural representation of the page, not the visual reality. And we have spent two decades building increasingly elaborate workarounds for that fundamental limitation instead of asking: what if the tool could just look at the screen?
A Child Can Do What Selenium Cannot
There is a thought experiment I keep coming back to. Hand a phone to a seven year old and ask them to log into an app. They have never seen the app before. They do not know what framework built it. They cannot read HTML. They have no concept of a DOM.
And yet they will figure it out. They will see the text fields. They will see the button. They will tap the right things in the right order. Not because they understand the code, but because they understand what they are looking at. Their eyes and their intuition are enough.
Now give that same task to the most sophisticated test automation framework on the market. Without selectors, without element IDs, without any structural metadata, it is helpless. It cannot find the login button because it does not know what a login button looks like. It only knows what a login button is called in the code.
A seven year old can outperform our best automation tools. That should bother us more than it does. Not because our tools are bad, but because the approach itself has a ceiling. And we hit that ceiling a long time ago.
Teaching Machines to Look
The breakthrough is not complicated to understand. It is the same breakthrough that lets you upload a photo to an AI and ask “what is in this picture?” Vision models can now look at a screenshot of an application and understand what they are seeing. Not in the way OCR reads text. In the way a person reads a screen.
It sees structure.
“There is a navigation bar at the top. Below it, a form with two input fields and a button. The first input is labeled Email. The second is labeled Password. The button says Sign In.”
It sees context.
“This appears to be a login page. The form is centered on the screen. There is a link below the form that says Forgot Password and another that says Create Account.”
It sees problems.
“The Sign In button appears to be partially obscured by an overlapping element. The password field is present but has no visible label, which may be an accessibility issue.”
This is not science fiction. This is what multimodal AI models can do right now, today. The technology that lets you ask an AI to describe a photo is the same technology that can power a test agent that actually sees your application. The pieces are here. They just need to be assembled with care and purpose.
Two Kinds of Seeing
Here is where things get genuinely interesting. Because once your test tool can see, it can see in different ways depending on what you need.
Think about how you look at a page differently depending on what you are testing. When you test whether the login flow works, you look loosely. You are scanning for the form, the button, the result. You do not care about exact pixel positions or font sizes. Your brain automatically filters out visual details that are irrelevant to the task. You are testing behavior.
But when your designer asks you to verify the new layout matches the Figma mockup, you look completely differently. Suddenly you care about every pixel. The spacing, the alignment, the exact shade of blue, the border radius. You switch into a different mode of seeing. Same eyes, different focus.
A vision based test agent can do exactly the same thing.
- ●Functional mode: The agent sees loosely. It identifies elements by what they are and what they do, not by their exact appearance or position. A login button is a login button whether it is rounded or square, blue or green, left aligned or centered. The test passes as long as the behavior is correct.
- ●Visual mode: The agent sees precisely. It compares the current screen against a baseline screenshot with pixel level attention. Spacing changes, color shifts, alignment drifts, missing elements. Everything gets flagged. The design stays honest.
Two modes of seeing. One tool. The same architectural capability that makes functional testing resilient also makes visual testing precise. You do not need separate products. You need one product that knows how to look at things differently depending on what you are asking.
The Obvious Question About Platforms
If the test agent sees the screen, what does it matter which platform rendered it?
This is the realization that changes everything, and it is almost embarrassingly simple once you think about it. Traditional tools need separate implementations for web, iOS, and Android because each platform has a different structural model. The DOM, UIKit view hierarchy, Android view tree. They are completely different systems with completely different APIs for finding and interacting with elements.
But screens are screens. A login form on Chrome looks like a login form on Safari which looks like a login form on the iOS app which looks like a login form on the Android app. Different rendering engines, same visual output. Different code underneath, same experience on top.
A vision based agent operates at the visual layer. It does not need to know what platform it is looking at. It does not need platform specific drivers or adapters. It looks at the screen, understands what is there, and interacts with it.
One test. Every platform. Not because someone built a clever abstraction layer, but because the tool operates at a level where the platform differences simply do not exist. That is not incremental progress. That is a category shift.
What Changes When Tests Can See
Imagine this is your life. You write a test in plain English: “Log in. Go to settings. Change the profile photo. Verify it appears on the dashboard.” The agent runs it on your website. Then on your iOS app. Then on your Android app. Same test. All three pass.
Next week, your design team ships a major redesign. New colors. New layout. Restructured navigation. Your functional tests still pass because the agent can see that the settings page is still the settings page, even though everything looks different. It found the profile photo upload by looking at the screen, not by looking for a CSS class that no longer exists.
But your visual tests flag forty two differences. New spacing on the dashboard header. A color that does not match the design spec on the settings page. A button that overlaps its neighbor on smaller screens. You review them, approve the intentional changes, and file tickets for the three that were accidents. Your design system stays clean.
You did not maintain any selectors. You did not rewrite any tests. You did not fight with platform specific locators. You spent your morning reviewing meaningful results instead of debugging meaningless failures. That is what changes when tests can see.
Opening Our Eyes
For twenty years, we asked our test tools to navigate the world blind. We gave them coordinates and hoped the landmarks had not moved. When they stumbled, we blamed the tools, or we blamed ourselves, or we blamed the developers who changed the code. But we never questioned the fundamental assumption: that tests should interact with the structure of a page instead of the reality of what the page looks like.
That assumption made sense when machines could not see. It does not make sense anymore. The technology to build vision based testing agents exists today. The question is no longer “is this possible?” The question is “who is going to build it right?”
At Yalitest, that is exactly the question we wake up to every morning. We are building a testing tool that sees your application the way you do. One tool for web and mobile. Functional tests that understand intent, not selectors. Visual tests that catch what your designers care about. Because the moment your tests can see, everything about the testing experience changes. Not incrementally. Fundamentally. And if you have ever felt the frustration of a test that broke because it could not see what was right in front of it, you already know why this matters.