BLOG / Research

Introducing the Hybrid Browser Toolkit: Faster, Smarter Web Automation for MCP

All you need to know about the Hybrid browser toolkit

October 2, 2025|32 min read

If you've been working with our original Camel BrowserToolkit, you might have noticed a pattern: it got the job done but had some limitations. It operated mainly by taking screenshots and injecting custom IDs into pages to find elements. It was a single-mode, monolithic Python setup that worked for basic tasks, but we knew we could do better. The screenshot-based approach meant you were essentially teaching an AI to click on pictures, which worked but felt a bit like using a smartphone with thick gloves on. Plus, the quality of those visual snapshots wasn't always great, and you couldn't easily access the underlying page structure when you needed it.

Enter the Hybrid Browser Toolkit

That's where our new Hybrid Browser Toolkit comes in. We've rebuilt everything from the ground up using a TypeScript-Python architecture that gives you the best of both worlds. Why TypeScript? It's not just about Playwright's native AI-friendly snapshot features – TypeScript is fundamentally better suited for efficient browser operations with its event-driven nature, native async/await support, and direct access to browser APIs without the overhead of language bridges. The TypeScript server handles all the heavy lifting of browser control while Python provides the familiar interface you love. But we didn't stop there. We've added support for CDP (Chrome DevTools Protocol) mode to connect to existing browser instances, and MCP (Model Context Protocol) integration for enhanced AI agent capabilities. You're not just limited to visual operations anymore – now you can seamlessly switch between visual and text-based interactions, access detailed DOM information, and enjoy snapshots that are crisp, accurate, and actually make sense. We've obsessed over the details too, from better element detection to smarter action handling, making the whole experience feel more natural and reliable. It's like upgrading from that smartphone-with-gloves setup to having direct, precise control over everything you need to do in a browser.

This blog organized into four main chapters:

Architecture Overview

This chapter provides a comprehensive comparison between the legacy BrowserToolkit and the new HybridBrowserToolkit, highlighting the architectural improvements, new features, and enhanced capabilities.

Architecture Evolution
Core Architectural Improvements
- 1. Multi-Mode Operation System
  2. TypeScript Framework Integration
1. Enhanced Element Identification
1. _snapshotForAI and ARIA Mapping Mechanism
1. Enhanced Stealth Mechanism
1. Tool Registration and Screenshot Handling

Architecture Evolution

The name "Hybrid" hints at the key change: we combined Python and TypeScript to get the best of both worlds. Instead of one heavy Python process doing everything, we now have a layered architecture where Python and a Node.js (TypeScript) server work together.

In the hybrid_browser_toolkit architecture, when you issue a command, it goes through a WebSocket to a TypeScript server that is tightly integrated with Playwright's Node.js API. This server manages a pool of browser instances and routes commands asynchronously. Python remains the interface (so you still write Python code to use the toolkit), but the heavy lifting happens in Node.js. Why is this good news? Because direct Node.js calls to Playwright are faster, and Playwright's latest features (like new selectors or the _snapshotForAI function) are fully available to us.

This layered design also makes the system modular. We have:

A Python layer for the API you call and configuration management.
A WebSocket bridge connecting the Python and TypeScript layers.
A TypeScript server layer that acts as the controller (routing commands, managing sessions).
A Browser control layer with controllers for different modes (text, visual, hybrid) and connection types.
The Playwright integration layer where actual browser actions happen with Playwright's Node.js capabilities (including things like _snapshotForAI and ARIA selectors).

In simpler terms, Python is the brain giving high-level instructions, and TypeScript is the brawn executing them efficiently. By splitting responsibilities this way, the toolkit can do more in parallel and handle complicated tasks without getting stuck.

HybridBrowserToolkit Architecture

The HybridBrowserToolkit introduces a modular, multi-layer architecture:

Core Architectural Improvements

1. Multi-Mode Operation System

The HybridBrowserToolkit supports three distinct operating modes:

Text Mode: Pure textual snapshot from _snapshotForAI
Visual Mode: Text snapshot filtered and visualized as SoM screenshot
Hybrid Mode: Intelligent switching between text and visual outputs

2. TypeScript Framework Integration

Advantage	Legacy Python Approach	TypeScript Framework	Benefits
Browser API Integration	Python → JS bridge with overhead	Direct native Playwright API calls	- Lower latency - Better performance - Access to latest features
Asynchronous Operations	Limited async support	Native async/await throughout	- Non-blocking operations - Better concurrency - Efficient resource usage
Element Interaction	Custom JavaScript injection	Native Playwright methods	- More reliable - Better error handling - Cleaner code
Real-time Events	Polling-based updates	WebSocket event streaming	- Instant updates - Lower resource usage - Better responsiveness
Type Safety	Runtime type checking only	Compile-time type checking	- Catch errors early - Better IDE support - Safer refactoring
Performance	Multiple language contexts	Single runtime environment	- Low-latency calls - Lower CPU usage
Browser Features	Limited to Python bindings	Full Playwright API access	- Playwright SnapshotForAI - Advanced debugging
Error Handling	Cross-language error propagation	Native error boundaries	- Clearer stack traces - Better error recovery - Easier debugging

3. Enhanced Element Identification

Legacy System:

# Custom ID injection
page.evaluate("__elementId = '123'")
target = page.locator("[__elementId='123']")

‍

New ARIA Mapping System

// Native Playwright ARIA selectors
await page.locator('[aria-label="Submit"]').click()
await page.getByRole('button', { name: 'Submit' }).click()//
_snapshotForAI integration
const snapshot = await page._snapshotForAI();// Returns structured element data with ref mappings

‍

4. _snapshotForAI and ARIA Mapping Mechanism

‍

Pipeline:

_snapshotForAI analyzes the DOM and extracts ARIA properties
Elements are classified by their semantic roles
A unified ref ID system maps to ARIA selectors
The same foundation serves both text and visual modes
Visual mode is built on top of the text snapshot by filtering and adding markers

5. Enhanced Stealth Mechanism

Key Stealth Enhancements:

Legacy Approach:
- Single flag
- Hardcoded user agent string
- Applied only during browser launch
- No flexibility for different contexts
HybridBrowserToolkit Approach:
- Comprehensive Flag Set: Multiple anti-detection browser arguments
- Configurable System: StealthConfig object allows customization
- Context Adaptation: Different behavior for CDP vs standard launch
- Dynamic Headers: Can set custom HTTP headers and user agents
- Persistent Context Support: Maintains stealth across sessions

6. Tool Registration and Screenshot Handling

Key Differences from Legacy:

Legacy: Screenshot stored in memory, passed as object
Hybrid: Screenshot saved to disk, agent accesses via file path
Memory Efficiency: Only file path in memory, not entire image
Agent Integration: Uses registered agent pattern for clean separation

7. Form Filling Optimization

New Features:

Multi-input support in single command
Intelligent dropdown detection
Diff snapshot for dynamic content
Error recovery mechanisms

‍

Tools Reference

This chapter provides a comprehensive reference for all tools available in the HybridBrowserToolkit. Each tool is designed for specific browser automation tasks, from basic navigation to complex interactions

Browser Session Management

browser_open

Opens a new browser session. This must be the first browser action before any other operations.

Parameters:

None

Returns:

result (str): Confirmation message
snapshot (str): Initial page snapshot (unless in full_visual_mode)
tabs (List[Dict]): Information about all open tabs
current_tab (int): Index of the active tab
total_tabs (int): Total number of open tabs

Example:

# Basic browser opening
toolkit = HybridBrowserToolkit(headless=False)
result = await toolkit.browser_open()

print(f"Browser opened: {result['result']}")
print(f"Initial page snapshot: {result['snapshot']}")
print(f"Total tabs: {result['total_tabs']}")

# With default URL configuration
toolkit = HybridBrowserToolkit(
    default_start_url="https://www.google.com"
)
result = await toolkit.browser_open()
# Browser opens directly to Google

‍

browser_close

Closes the browser session and releases all resources. Should be called at the end of automation tasks.

Parameters:

None

Returns:

(str): Confirmation message

Example:

# Always close the browser when done
try:
    await toolkit.browser_open()
    # ... perform automation tasks ...
finally:
    result = await toolkit.browser_close()
    print(result)  # "Browser session closed."

‍

browser_visit_page

Opens a URL in a new browser tab and switches to it. Creates a new tab each time it’s called.

Parameters:

url (str): The web address to load

Returns:

result (str): Confirmation message
snapshot (str): Page snapshot after navigation
tabs (List[Dict]): Updated tab information
current_tab (int): Index of the new active tab
total_tabs (int): Updated total number of tabs

Example:

# Visit a single page
result = await toolkit.browser_visit_page("https://example.com")
print(f"Navigated to: {result['result']}")
print(f"Page elements: {result['snapshot']}")

# Visit multiple pages (creates multiple tabs)
sites = ["https://github.com", "https://google.com", "https://stackoverflow.com"]
for site in sites:
    result = await toolkit.browser_visit_page(site)
    print(f"Tab {result['current_tab']}: {site}")
print(f"Total tabs open: {result['total_tabs']}")

‍

browser_back

Navigates back to the previous page in browser history for the current tab.

Parameters:

None

Returns:

result (str): Confirmation message
snapshot (str): Snapshot of the previous page
tabs (List[Dict]): Current tab information
current_tab (int): Index of active tab
total_tabs (int): Total number of tabs

Example:

# Navigate through history
await toolkit.browser_visit_page("https://example.com")
await toolkit.browser_visit_page("https://example.com/about")

# Go back
result = await toolkit.browser_back()
print(f"Navigated back to: {result['result']}")

‍

browser_forward

Navigates forward to the next page in browser history for the current tab.

Parameters:

None

Returns:

Same as browser_back

Example:

# Navigate forward after going back
await toolkit.browser_visit_page("https://example.com")
await toolkit.browser_back()  # Back to homepage

# Go forward again
result = await toolkit.browser_forward()
print(f"Navigated forward to: {result['result']}")

‍

Information Retrieval Tools

browser_get_page_snapshot

Note: This is a passive tool that must be explicitly called to retrieve page information. It does not trigger any page actions.

Gets a textual snapshot of all interactive elements on the current page. Each element is assigned a unique ref ID for interaction.

Parameters:

None (uses viewport_limit setting from toolkit initialization)

Returns:

(str): Formatted string listing all interactive elements with their ref IDs

Example:

# Get full page snapshot
snapshot = await toolkit.browser_get_page_snapshot()
print(snapshot)
# Output:
# - link "Home" [ref=1]
# - button "Sign In" [ref=2]
# - textbox "Search" [ref=3]
# - link "Products" [ref=4]

# With viewport limiting
toolkit_limited = HybridBrowserToolkit(viewport_limit=True)
visible_snapshot = await toolkit_limited.browser_get_page_snapshot()
# Only returns elements currently visible in viewport

‍

browser_get_som_screenshot

Captures a screenshot with interactive elements highlighted and marked with ref IDs (Set of Marks). This tool uses an advanced injection-based approach with browser-side optimizations for accurate element detection.

Technical Features:

‍1. Injection-based Implementation: The SoM (Set of Marks) functionality is injected directly into the browser context, ensuring accurate element detection and positioning

Efficient Occlusion Detection: Browser-side algorithms detect when elements are hidden behind other elements, preventing false positives
Parent-Child Element Fusion: Intelligently merges parent and child elements when they represent the same interactive component (e.g., a button containing an icon and text)
Smart Label Positioning: Automatically finds optimal positions for ref ID labels to avoid overlapping with page content

Parameters:

read_image (bool, optional): If True, uses AI to analyze the screenshot. Default: True
instruction (str, optional): Specific guidance for AI analysis

Returns:

(str): Confirmation message with file path and optional AI analysis

Example:

# Basic screenshot capture
result = await toolkit.browser_get_som_screenshot(read_image=False)
print(result)
# "Screenshot captured with 42 interactive elements marked (saved to: ./assets/screenshots/page_123456_som.png)"

# With AI analysis
result = await toolkit.browser_get_som_screenshot(
    read_image=True,
    instruction="Find all form input fields"
)
# "Screenshot captured... Agent analysis: Found 5 form fields: username [ref=3], password [ref=4], email [ref=5], phone [ref=6], submit button [ref=7]"

# For visual verification
result = await toolkit.browser_get_som_screenshot(
    read_image=True,
    instruction="Verify the login button is visible and properly styled"
)

# Complex UI with overlapping elements
result = await toolkit.browser_get_som_screenshot(read_image=False)
# The tool automatically handles:
# - Dropdown menus that overlay other content
# - Modal dialogs
# - Nested interactive elements
# - Elements with transparency

# Parent-child fusion example
# A button containing an icon and text will be marked as one element, not three
# <button [ref=5]>
#   <i class="icon"></i>
#   <span>Submit</span>
# </button>
# Will appear as single "button Submit [ref=5]" instead of separate elements

‍

browser_get_tab_info

Note: This is a passive information retrieval tool that provides current tab state without modifying anything.

Gets information about all open browser tabs including titles, URLs, and which tab is active.

Parameters:

None

Returns:

tabs (List[Dict]): List of tab information, each containing:
id (str): Unique tab identifier
title (str): Page title
url (str): Current URL
is_current (bool): Whether this is the active tab
current_tab (int): Index of the active tab
total_tabs (int): Total number of open tabs

Example:

# Check all open tabs
tab_info = await toolkit.browser_get_tab_info()

print(f"Total tabs: {tab_info['total_tabs']}")
print(f"Active tab index: {tab_info['current_tab']}")

for i, tab in enumerate(tab_info['tabs']):
    status = "ACTIVE" if tab['is_current'] else ""
    print(f"Tab {i}: {tab['title']} - {tab['url']} {status}")

# Find a specific tab
github_tab = next(
    (tab for tab in tab_info['tabs'] if 'github.com' in tab['url']),
    None
)
if github_tab:
    await toolkit.browser_switch_tab(tab_id=github_tab['id'])

‍

Interaction Tools

browser_click

Performs a click action on an element identified by its ref ID.

Parameters:

ref (str): The ref ID of the element to click

Returns:

result (str): Confirmation of the action
snapshot (str): Updated page snapshot after click
tabs (List[Dict]): Current tab information
current_tab (int): Index of active tab
total_tabs (int): Total number of tabs
newTabId (str, optional): ID of newly opened tab if click opened a new tab

Example:

# Simple click
result = await toolkit.browser_click(ref="2")
print(f"Clicked: {result['result']}")

# Click that opens new tab
result = await toolkit.browser_click(ref="external-link")
if 'newTabId' in result:
    print(f"New tab opened with ID: {result['newTabId']}")
    # Switch to the new tab
    await toolkit.browser_switch_tab(tab_id=result['newTabId'])

# Click with error handling
try:
    result = await toolkit.browser_click(ref="submit-button")
except Exception as e:
    print(f"Click failed: {e}")

‍

browser_type

Types text into input elements. Supports both single and multiple inputs with intelligent dropdown detection and automatic child element discovery.

Special Features:

Intelligent Dropdown Detection:
- When typing into elements that might trigger dropdown options (such as combobox, search fields, or autocomplete inputs), the tool automatically:
  - Detects if new options appear after typing
  - Returns only the newly appeared options via diffSnapshot instead of the full page snapshot
  - This optimization reduces noise and makes it easier to interact with dynamic dropdowns
Automatic Child Element Discovery:
- If the specified ref ID points to a container element that cannot accept text input directly, the tool automatically:
  - Searches through child elements to find an input field
  - Attempts to type into the first suitable child input element found
  - This is particularly useful for complex UI components where the visible element is a wrapper around the actual input

Parameters (Single Input):

ref (str): The ref ID of the input element (or container with input child)
text (str): The text to type

Parameters (Multiple Inputs):

inputs (List[Dict[str, str]]): List of dictionaries with ‘ref’ and ‘text’ keys

Returns:

result (str): Confirmation message
snapshot (str): Updated page snapshot (full snapshot for regular inputs)
diffSnapshot (str, optional): For dropdowns, shows only newly appeared options
details (Dict, optional): For multiple inputs, success/error status for each
Tab information fields

Example:

# Single input
result = await toolkit.browser_type(ref="3", text="john.doe@example.com")

# Handle dropdown/autocomplete with intelligent detection
result = await toolkit.browser_type(ref="search", text="laptop")
if 'diffSnapshot' in result:
    print("Dropdown options appeared:")
    print(result['diffSnapshot'])
    # Example output:
    # - option "Laptop Computers" [ref=45]
    # - option "Laptop Bags" [ref=46]
    # - option "Laptop Accessories" [ref=47]

    # Click on one of the options
    await toolkit.browser_click(ref="45")
else:
    # No dropdown appeared, continue with regular snapshot
    print("Page snapshot:", result['snapshot'])

# Autocomplete example with diff detection
result = await toolkit.browser_type(ref="city-input", text="San")
if 'diffSnapshot' in result:
    # Only shows newly appeared suggestions
    print("City suggestions:")
    print(result['diffSnapshot'])
    # - option "San Francisco" [ref=23]
    # - option "San Diego" [ref=24]
    # - option "San Antonio" [ref=25]

# Multiple inputs at once
inputs = [
    {'ref': '3', 'text': 'username123'},
    {'ref': '4', 'text': 'SecurePass123!'},
    {'ref': '5', 'text': 'john.doe@example.com'}
]
result = await toolkit.browser_type(inputs=inputs)
print(result['details'])  # Success/failure for each input

# Clear and type
await toolkit.browser_click(ref="3")  # Focus
await toolkit.browser_press_key(keys=["Control+a"])  # Select all
await toolkit.browser_type(ref="3", text="new_value")  # Replaces content

# Working with combobox elements
async def handle_searchable_dropdown():
    # Type to search/filter options
    result = await toolkit.browser_type(ref="country-select", text="United")

    if 'diffSnapshot' in result:
        # Shows only countries containing "United"
        print("Filtered countries:", result['diffSnapshot'])
        # - option "United States" [ref=87]
        # - option "United Kingdom" [ref=88]
        # - option "United Arab Emirates" [ref=89]

        # Select one of the filtered options
        await toolkit.browser_click(ref="87")

# Automatic child element discovery
# When the ref points to a container, browser_type finds the input child
result = await toolkit.browser_type(ref="search-container", text="product name")
# Even though ref="search-container" might be a <div>, the tool will find
# and type into the actual <input> element inside it

# Complex UI component example
# The visible element might be a styled wrapper
result = await toolkit.browser_type(ref="styled-date-picker", text="2024-03-15")
# Tool automatically finds the actual input field within the date picker component

‍

browser_select

Selects an option in a dropdown (

Introducing the Hybrid Browser Toolkit: Faster, Smarter Web Automation for MCP

All you need to know about the Hybrid browser toolkit

October 2, 2025|32 min read

Enter the Hybrid Browser Toolkit

This blog organized into four main chapters:

Architecture Overview

Architecture Evolution
Core Architectural Improvements
- 1. Multi-Mode Operation System
  2. TypeScript Framework Integration
1. Enhanced Element Identification
1. _snapshotForAI and ARIA Mapping Mechanism
1. Enhanced Stealth Mechanism
1. Tool Registration and Screenshot Handling

Architecture Evolution

This layered design also makes the system modular. We have:

A Python layer for the API you call and configuration management.
A WebSocket bridge connecting the Python and TypeScript layers.
A TypeScript server layer that acts as the controller (routing commands, managing sessions).
A Browser control layer with controllers for different modes (text, visual, hybrid) and connection types.
The Playwright integration layer where actual browser actions happen with Playwright's Node.js capabilities (including things like _snapshotForAI and ARIA selectors).

HybridBrowserToolkit Architecture

The HybridBrowserToolkit introduces a modular, multi-layer architecture:

Core Architectural Improvements

1. Multi-Mode Operation System

The HybridBrowserToolkit supports three distinct operating modes:

Text Mode: Pure textual snapshot from _snapshotForAI
Visual Mode: Text snapshot filtered and visualized as SoM screenshot
Hybrid Mode: Intelligent switching between text and visual outputs

2. TypeScript Framework Integration

Advantage	Legacy Python Approach	TypeScript Framework	Benefits
Browser API Integration	Python → JS bridge with overhead	Direct native Playwright API calls	- Lower latency - Better performance - Access to latest features
Asynchronous Operations	Limited async support	Native async/await throughout	- Non-blocking operations - Better concurrency - Efficient resource usage
Element Interaction	Custom JavaScript injection	Native Playwright methods	- More reliable - Better error handling - Cleaner code
Real-time Events	Polling-based updates	WebSocket event streaming	- Instant updates - Lower resource usage - Better responsiveness
Type Safety	Runtime type checking only	Compile-time type checking	- Catch errors early - Better IDE support - Safer refactoring
Performance	Multiple language contexts	Single runtime environment	- Low-latency calls - Lower CPU usage
Browser Features	Limited to Python bindings	Full Playwright API access	- Playwright SnapshotForAI - Advanced debugging
Error Handling	Cross-language error propagation	Native error boundaries	- Clearer stack traces - Better error recovery - Easier debugging

3. Enhanced Element Identification

Legacy System:

# Custom ID injection
page.evaluate("__elementId = '123'")
target = page.locator("[__elementId='123']")

‍

New ARIA Mapping System

// Native Playwright ARIA selectors
await page.locator('[aria-label="Submit"]').click()
await page.getByRole('button', { name: 'Submit' }).click()//
_snapshotForAI integration
const snapshot = await page._snapshotForAI();// Returns structured element data with ref mappings

‍

4. _snapshotForAI and ARIA Mapping Mechanism

‍

Pipeline:

_snapshotForAI analyzes the DOM and extracts ARIA properties
Elements are classified by their semantic roles
A unified ref ID system maps to ARIA selectors
The same foundation serves both text and visual modes
Visual mode is built on top of the text snapshot by filtering and adding markers

5. Enhanced Stealth Mechanism

Key Stealth Enhancements:

Legacy Approach:
- Single flag
- Hardcoded user agent string
- Applied only during browser launch
- No flexibility for different contexts
HybridBrowserToolkit Approach:
- Comprehensive Flag Set: Multiple anti-detection browser arguments
- Configurable System: StealthConfig object allows customization
- Context Adaptation: Different behavior for CDP vs standard launch
- Dynamic Headers: Can set custom HTTP headers and user agents
- Persistent Context Support: Maintains stealth across sessions

6. Tool Registration and Screenshot Handling

Key Differences from Legacy:

Legacy: Screenshot stored in memory, passed as object
Hybrid: Screenshot saved to disk, agent accesses via file path
Memory Efficiency: Only file path in memory, not entire image
Agent Integration: Uses registered agent pattern for clean separation

7. Form Filling Optimization

New Features:

Multi-input support in single command
Intelligent dropdown detection
Diff snapshot for dynamic content
Error recovery mechanisms

‍

Tools Reference

Browser Session Management

browser_open

Opens a new browser session. This must be the first browser action before any other operations.

Parameters:

None

Returns:

result (str): Confirmation message
snapshot (str): Initial page snapshot (unless in full_visual_mode)
tabs (List[Dict]): Information about all open tabs
current_tab (int): Index of the active tab
total_tabs (int): Total number of open tabs

Example:

# Basic browser opening
toolkit = HybridBrowserToolkit(headless=False)
result = await toolkit.browser_open()

print(f"Browser opened: {result['result']}")
print(f"Initial page snapshot: {result['snapshot']}")
print(f"Total tabs: {result['total_tabs']}")

# With default URL configuration
toolkit = HybridBrowserToolkit(
    default_start_url="https://www.google.com"
)
result = await toolkit.browser_open()
# Browser opens directly to Google

‍

browser_close

Closes the browser session and releases all resources. Should be called at the end of automation tasks.

Parameters:

None

Returns:

(str): Confirmation message

Example:

# Always close the browser when done
try:
    await toolkit.browser_open()
    # ... perform automation tasks ...
finally:
    result = await toolkit.browser_close()
    print(result)  # "Browser session closed."

‍

browser_visit_page

Opens a URL in a new browser tab and switches to it. Creates a new tab each time it’s called.

Parameters:

url (str): The web address to load

Returns:

result (str): Confirmation message
snapshot (str): Page snapshot after navigation
tabs (List[Dict]): Updated tab information
current_tab (int): Index of the new active tab
total_tabs (int): Updated total number of tabs

Example:

# Visit a single page
result = await toolkit.browser_visit_page("https://example.com")
print(f"Navigated to: {result['result']}")
print(f"Page elements: {result['snapshot']}")

# Visit multiple pages (creates multiple tabs)
sites = ["https://github.com", "https://google.com", "https://stackoverflow.com"]
for site in sites:
    result = await toolkit.browser_visit_page(site)
    print(f"Tab {result['current_tab']}: {site}")
print(f"Total tabs open: {result['total_tabs']}")

‍

browser_back

Navigates back to the previous page in browser history for the current tab.

Parameters:

None

Returns:

result (str): Confirmation message
snapshot (str): Snapshot of the previous page
tabs (List[Dict]): Current tab information
current_tab (int): Index of active tab
total_tabs (int): Total number of tabs

Example:

# Navigate through history
await toolkit.browser_visit_page("https://example.com")
await toolkit.browser_visit_page("https://example.com/about")

# Go back
result = await toolkit.browser_back()
print(f"Navigated back to: {result['result']}")

‍

browser_forward

Navigates forward to the next page in browser history for the current tab.

Parameters:

None

Returns:

Same as browser_back

Example:

# Navigate forward after going back
await toolkit.browser_visit_page("https://example.com")
await toolkit.browser_back()  # Back to homepage

# Go forward again
result = await toolkit.browser_forward()
print(f"Navigated forward to: {result['result']}")

‍

Information Retrieval Tools

browser_get_page_snapshot

Note: This is a passive tool that must be explicitly called to retrieve page information. It does not trigger any page actions.

Gets a textual snapshot of all interactive elements on the current page. Each element is assigned a unique ref ID for interaction.

Parameters:

None (uses viewport_limit setting from toolkit initialization)

Returns:

(str): Formatted string listing all interactive elements with their ref IDs

Example:

# Get full page snapshot
snapshot = await toolkit.browser_get_page_snapshot()
print(snapshot)
# Output:
# - link "Home" [ref=1]
# - button "Sign In" [ref=2]
# - textbox "Search" [ref=3]
# - link "Products" [ref=4]

# With viewport limiting
toolkit_limited = HybridBrowserToolkit(viewport_limit=True)
visible_snapshot = await toolkit_limited.browser_get_page_snapshot()
# Only returns elements currently visible in viewport

‍

browser_get_som_screenshot

Technical Features:

‍1. Injection-based Implementation: The SoM (Set of Marks) functionality is injected directly into the browser context, ensuring accurate element detection and positioning

Efficient Occlusion Detection: Browser-side algorithms detect when elements are hidden behind other elements, preventing false positives
Parent-Child Element Fusion: Intelligently merges parent and child elements when they represent the same interactive component (e.g., a button containing an icon and text)
Smart Label Positioning: Automatically finds optimal positions for ref ID labels to avoid overlapping with page content

Parameters:

read_image (bool, optional): If True, uses AI to analyze the screenshot. Default: True
instruction (str, optional): Specific guidance for AI analysis

Returns:

(str): Confirmation message with file path and optional AI analysis

Example:

# Basic screenshot capture
result = await toolkit.browser_get_som_screenshot(read_image=False)
print(result)
# "Screenshot captured with 42 interactive elements marked (saved to: ./assets/screenshots/page_123456_som.png)"

# With AI analysis
result = await toolkit.browser_get_som_screenshot(
    read_image=True,
    instruction="Find all form input fields"
)
# "Screenshot captured... Agent analysis: Found 5 form fields: username [ref=3], password [ref=4], email [ref=5], phone [ref=6], submit button [ref=7]"

# For visual verification
result = await toolkit.browser_get_som_screenshot(
    read_image=True,
    instruction="Verify the login button is visible and properly styled"
)

# Complex UI with overlapping elements
result = await toolkit.browser_get_som_screenshot(read_image=False)
# The tool automatically handles:
# - Dropdown menus that overlay other content
# - Modal dialogs
# - Nested interactive elements
# - Elements with transparency

# Parent-child fusion example
# A button containing an icon and text will be marked as one element, not three
# <button [ref=5]>
#   <i class="icon"></i>
#   <span>Submit</span>
# </button>
# Will appear as single "button Submit [ref=5]" instead of separate elements

‍

browser_get_tab_info

Note: This is a passive information retrieval tool that provides current tab state without modifying anything.

Gets information about all open browser tabs including titles, URLs, and which tab is active.

Parameters:

None

Returns:

tabs (List[Dict]): List of tab information, each containing:
id (str): Unique tab identifier
title (str): Page title
url (str): Current URL
is_current (bool): Whether this is the active tab
current_tab (int): Index of the active tab
total_tabs (int): Total number of open tabs

Example:

# Check all open tabs
tab_info = await toolkit.browser_get_tab_info()

print(f"Total tabs: {tab_info['total_tabs']}")
print(f"Active tab index: {tab_info['current_tab']}")

for i, tab in enumerate(tab_info['tabs']):
    status = "ACTIVE" if tab['is_current'] else ""
    print(f"Tab {i}: {tab['title']} - {tab['url']} {status}")

# Find a specific tab
github_tab = next(
    (tab for tab in tab_info['tabs'] if 'github.com' in tab['url']),
    None
)
if github_tab:
    await toolkit.browser_switch_tab(tab_id=github_tab['id'])

‍

Interaction Tools

browser_click

Performs a click action on an element identified by its ref ID.

Parameters:

ref (str): The ref ID of the element to click

Returns:

result (str): Confirmation of the action
snapshot (str): Updated page snapshot after click
tabs (List[Dict]): Current tab information
current_tab (int): Index of active tab
total_tabs (int): Total number of tabs
newTabId (str, optional): ID of newly opened tab if click opened a new tab

Example:

# Simple click
result = await toolkit.browser_click(ref="2")
print(f"Clicked: {result['result']}")

# Click that opens new tab
result = await toolkit.browser_click(ref="external-link")
if 'newTabId' in result:
    print(f"New tab opened with ID: {result['newTabId']}")
    # Switch to the new tab
    await toolkit.browser_switch_tab(tab_id=result['newTabId'])

# Click with error handling
try:
    result = await toolkit.browser_click(ref="submit-button")
except Exception as e:
    print(f"Click failed: {e}")

‍

browser_type

Types text into input elements. Supports both single and multiple inputs with intelligent dropdown detection and automatic child element discovery.

Special Features:

Intelligent Dropdown Detection:
- When typing into elements that might trigger dropdown options (such as combobox, search fields, or autocomplete inputs), the tool automatically:
  - Detects if new options appear after typing
  - Returns only the newly appeared options via diffSnapshot instead of the full page snapshot
  - This optimization reduces noise and makes it easier to interact with dynamic dropdowns
Automatic Child Element Discovery:
- If the specified ref ID points to a container element that cannot accept text input directly, the tool automatically:
  - Searches through child elements to find an input field
  - Attempts to type into the first suitable child input element found
  - This is particularly useful for complex UI components where the visible element is a wrapper around the actual input

Parameters (Single Input):

ref (str): The ref ID of the input element (or container with input child)
text (str): The text to type

Parameters (Multiple Inputs):

inputs (List[Dict[str, str]]): List of dictionaries with ‘ref’ and ‘text’ keys

Returns:

result (str): Confirmation message
snapshot (str): Updated page snapshot (full snapshot for regular inputs)
diffSnapshot (str, optional): For dropdowns, shows only newly appeared options
details (Dict, optional): For multiple inputs, success/error status for each
Tab information fields

Example:

# Single input
result = await toolkit.browser_type(ref="3", text="john.doe@example.com")

# Handle dropdown/autocomplete with intelligent detection
result = await toolkit.browser_type(ref="search", text="laptop")
if 'diffSnapshot' in result:
    print("Dropdown options appeared:")
    print(result['diffSnapshot'])
    # Example output:
    # - option "Laptop Computers" [ref=45]
    # - option "Laptop Bags" [ref=46]
    # - option "Laptop Accessories" [ref=47]

    # Click on one of the options
    await toolkit.browser_click(ref="45")
else:
    # No dropdown appeared, continue with regular snapshot
    print("Page snapshot:", result['snapshot'])

# Autocomplete example with diff detection
result = await toolkit.browser_type(ref="city-input", text="San")
if 'diffSnapshot' in result:
    # Only shows newly appeared suggestions
    print("City suggestions:")
    print(result['diffSnapshot'])
    # - option "San Francisco" [ref=23]
    # - option "San Diego" [ref=24]
    # - option "San Antonio" [ref=25]

# Multiple inputs at once
inputs = [
    {'ref': '3', 'text': 'username123'},
    {'ref': '4', 'text': 'SecurePass123!'},
    {'ref': '5', 'text': 'john.doe@example.com'}
]
result = await toolkit.browser_type(inputs=inputs)
print(result['details'])  # Success/failure for each input

# Clear and type
await toolkit.browser_click(ref="3")  # Focus
await toolkit.browser_press_key(keys=["Control+a"])  # Select all
await toolkit.browser_type(ref="3", text="new_value")  # Replaces content

# Working with combobox elements
async def handle_searchable_dropdown():
    # Type to search/filter options
    result = await toolkit.browser_type(ref="country-select", text="United")

    if 'diffSnapshot' in result:
        # Shows only countries containing "United"
        print("Filtered countries:", result['diffSnapshot'])
        # - option "United States" [ref=87]
        # - option "United Kingdom" [ref=88]
        # - option "United Arab Emirates" [ref=89]

        # Select one of the filtered options
        await toolkit.browser_click(ref="87")

# Automatic child element discovery
# When the ref points to a container, browser_type finds the input child
result = await toolkit.browser_type(ref="search-container", text="product name")
# Even though ref="search-container" might be a <div>, the tool will find
# and type into the actual <input> element inside it

# Complex UI component example
# The visible element might be a styled wrapper
result = await toolkit.browser_type(ref="styled-date-picker", text="2024-03-15")
# Tool automatically finds the actual input field within the date picker component

‍

browser_select

Selects an option in a dropdown (

Enter the Hybrid Browser Toolkit

Table of Contents

Architecture Overview

Table of Contents

Architecture Evolution

HybridBrowserToolkit Architecture

Core Architectural Improvements

1. Multi-Mode Operation System

2. TypeScript Framework Integration

3. Enhanced Element Identification

Legacy System:

New ARIA Mapping System

4. _snapshotForAI and ARIA Mapping Mechanism

5. Enhanced Stealth Mechanism

6. Tool Registration and Screenshot Handling

7. Form Filling Optimization

Tools Reference

Browser Session Management

browser_open

browser_close

Navigation Tools

browser_visit_page

browser_back

browser_forward

Information Retrieval Tools

browser_get_page_snapshot

browser_get_som_screenshot

browser_get_tab_info

Interaction Tools

browser_click

browser_type

browser_select

Recent Posts

SETA: Scaling Environments for Terminal Agents

Brainwash Your Agent: How We Keep The Memory Clean

How CAMEL Rebuilt Browser Automation: From Python to TypeScript for Reliable AI Agents

Enter the Hybrid Browser Toolkit

Table of Contents

Architecture Overview

Table of Contents

Architecture Evolution

HybridBrowserToolkit Architecture

Core Architectural Improvements

1. Multi-Mode Operation System

2. TypeScript Framework Integration

3. Enhanced Element Identification

Legacy System:

New ARIA Mapping System

4. _snapshotForAI and ARIA Mapping Mechanism

5. Enhanced Stealth Mechanism

6. Tool Registration and Screenshot Handling

7. Form Filling Optimization

Tools Reference

Browser Session Management

browser_open

browser_close

Navigation Tools

browser_visit_page

browser_back

browser_forward

Information Retrieval Tools

browser_get_page_snapshot

browser_get_som_screenshot

browser_get_tab_info

Interaction Tools

browser_click

browser_type

browser_select

Recent Posts

SETA: Scaling Environments for Terminal Agents

Brainwash Your Agent: How We Keep The Memory Clean

How CAMEL Rebuilt Browser Automation: From Python to TypeScript for Reliable AI Agents