I'd be really interested in feedback on the security model of client-side agents giving extension-bridge access, and taking questions on the implementation!
jadbox 1 days ago [-]
I tried setting the LLM to "http://0.0.0.0:8080" and the extension crashed and now continues to crash at startup.
I see. The visual effect requires the browser to support webgl2.
The core functionality should not crash because the visual effect crashed. Not a good practice. I will fix that asap.
Thanks for noticing. Btw the video should work now.
koakuma-chan 5 hours ago [-]
Yes it works, nice effects.
selimenes1 16 hours ago [-]
The "inside-out" framing resonates with me. I have been building embeddable scripts that get dropped into third-party sites via a script tag, and the architectural decisions you are making here mirror a lot of the same trade-offs I have encountered.
The biggest challenge with any in-page tool is the tension between needing deep DOM access and maintaining isolation. For the agent UI itself, you almost certainly want iframe isolation -- CSS conflicts with the host page are a constant headache otherwise. But for the actual DOM interaction (reading page state, simulating events), you need to be in the host page context. This dual architecture (iframe for your UI, direct access for page interaction) adds complexity but is worth it for reliability across diverse sites.
One thing I would flag as a real production concern: Content Security Policy. A significant number of enterprise and SaaS sites set strict CSP headers that will block inline scripts, eval, and sometimes even dynamically created script elements. If your target audience includes embedding this in production apps, you will hit CSP issues quickly. The bookmarklet approach cleverly sidesteps this for demos, but for a proper integration the host app needs to explicitly whitelist your script origin.
The HTML dehydration approach you described in the comments (parsing live HTML, stripping to semantic essentials, indexing interactive elements) is smart. In my experience, the fidelity of that serialization step is where most of the edge cases live. Shadow DOM, canvas elements, dynamically loaded content, iframes-within-iframes -- each one needs special handling and you end up building a progressively more complex serializer over time. Keeping that layer thin and well-tested is probably the highest-leverage investment for long-term maintainability.
simon_luv_pho 11 hours ago [-]
Really appreciate the in-depth feedback.
Iframe and CSP are big problems. For the in-page version, I chose to leave out Shadow DOM, canvas, and iframes. Although I know one of the developers forked a version to control same-origin iframes. I don't think it's practical to try to hack around browser security (and website security) — that's why I built the browser extension. I'm hoping the bridge that lets a page call the extension can cover most use cases.
My original HTML dehydration script was ported from `browser-use`. You're absolutely right that it's getting heavier over time, and it's the key factor influencing the overall task success rate. I'm looking to refactor that part and add an extension system for developers to patch their own sites. Hope it turns out well.
Thank you for the feedback. I'll be extra cautious to keep the dehydration code maintainable.
mentalgear 1 days ago [-]
> Data processed via servers in Mainland China
Appreciate the transparency, but maybe you could add some European (preferably) alternatives ?
simon_luv_pho 1 days ago [-]
Please use your own LLM api instead!
The free testing LLM is Qwen hosted by Aliyun. Qwen and DeepSeek are the only ones I can afford to offer for free. It's just there to lower the try-out barrier; please DO NOT rely on it.
The library itself does NOT include any backend service. Your data only goes to the LLM api you configured.
That looks great! I also thought about calling the Gemini nano model embedded into Chrome (only extensions can do that). But after some testing on smaller models I found that anything smaller than 9b can’t really handle the complex tool call schema I use.
Qwen3.5 4b is quite good but still gives messy json quite often. But it’s very promising!
Maybe after one more model iteration or some fine-toning we can go fully embedded?
simon_luv_pho 1 days ago [-]
I'm looking into a European testing endpoint. The legal and compliance requirements are quite hassle, and persuading my company to pay for that infrastructure is gonna be a tough sell.
hrmtst93837 17 hours ago [-]
Ask the project to offer an EU-hosted endpoint or a self-hosted Docker image, and to publish a clear dataflow diagram showing which inputs, inference steps, logs and backups are stored or processed in Mainland China.
Practically that can be done by provisioning EU clusters with Terraform on AWS eu-west-1 or a European host like Hetzner, using geolocation DNS or Cloudflare load balancing to steer users and pin accounts to a region, while accepting higher costs, more complex CI/CD and subtle GDPR issues around backups and telemetry.
The library does NOT include backend services. This is an open source project. I’m not selling any service here…
ed_mercer 3 hours ago [-]
Cool, but likely to become obsolete with the rise of agents that ship with the browser.
dworks 1 days ago [-]
Very interesting. Is this related to CoPaw and AgentScope? I think the AG-UI integration for dynamic UI would be useful here, are you using that?
I'm building a web UI workspace right now where I have been planning to integrate the agent as an app or component instead of having it be the entire UI. I may fork PageAgent for that, lets see.
simon_luv_pho 19 hours ago [-]
Currently the only dependency is zod for schema parsing.
I'm intentionally building on a lightweight, in-page JavaScript foundation to carve out some differentiation from the Python-heavy agent ecosystem.
The "protocol" layer of AG-UI does look interesting. I'll look into it to see if I can reuse something, although it seems to be evolving more toward an integration framework rather than an open protocol.
Really glad this resonates with your use case. Lightweight embedding is exactly my priority scenario. Would love to hear how the work goes!
catapart 16 hours ago [-]
This looks really useful! I'm having a hard time understanding how it might be used by each specific user, using their own LLM instance, though. Is that because it does not support that type of bring-your-own-llm scheme, or am I just not putting two and two together with some kind of chain of user authentication, then token exchange?
simon_luv_pho 14 hours ago [-]
This library does not include a LLM services. The one on the homepage is only for demonstration and testing. The npm package and extension requires your own LLM api config. Doc here https://alibaba.github.io/page-agent/docs/features/models
pscanf 1 days ago [-]
Very cool!
I'm particularly impressed by the bookmark "trick" to install it on a page. Despite having spent 15 years developing for the browser, I had somehow missed that feature of the bookmarks bar. But awesome UX for people to try out the tool. Congrats!
simon_luv_pho 1 days ago [-]
Thanks!
Bookmarklets are such an underrated feature. It's super convenient to inject and test scripts on any page. Seemed like the perfect low-friction entry point for people to try it out.
Spent some time on that UX because the concept is a bit hard to explain. Glad it worked!
1 days ago [-]
swaminarayan 1 days ago [-]
If an AI agent runs inside the page and can see the DOM and the user’s session, how do you keep it safe without limiting what it can actually do?
claud_ia 18 hours ago [-]
The tension is real, but I think it's the same trust model problem that browser extensions solved years ago — just re-emerging with sharper stakes. The key insight is that 'inside the page' doesn't mean 'unlimited': you can constrain the agent to a declared action space (a list of semantic intents your app exposes) rather than letting it operate on arbitrary DOM mutations. Essentially the app becomes the API surface, and the agent calls into it rather than scripting the UI directly. The session inheritance is then a feature, not a risk, because the agent operates exactly at the permission level of the authenticated user — it can't escalate beyond what a human clicking around could do. The harder unsolved problem is prompt injection: if the page content itself can influence the agent's instructions (e.g., a user-generated comment telling the agent to 'click delete account'), you need the same kind of sandboxing logic that email clients use to strip active content.
simon_luv_pho 17 hours ago [-]
This is the problem every agent has to face.
PageAgent’s differentiator is that site developers can embed it directly into their own pages. In that scenario, with proper system instructions plus a built-in whitelist/blacklist API for interactive elements, the risk is pretty manageable.
For the general-agent case, operating on pages you don’t control, the risk is definitely higher. I’m currently working on the human-in-the-loop feature so the user can intervene before sensitive actions.
Would love to hear other approaches if anyone has ideas.
westurner 23 hours ago [-]
Advantages and disadvantages of sandboxing agents with OS DAC/MAC, VM, container, user-space, WASM runtime, browser extension permissions, and IDK IFrames and Origins?
How are AI agents built into browsers sandboxed by comparison?
Similar principles, just embed a script tag and you get an agent that can type/click/select to onboard/demo/checkout users.
I tried on your website and it was reeaaaally slow. Quick question:
- you are injecting numbering on to the UI. Are you taking screenshots? But I don't see any screenshots in the request being sent, what is the point of the numbering?
I don't think building on browser-use is the way to go, it was the worst performing harness of all we tested [https://www.rtrvr.ai/blog/web-bench-results]. We built out our own logic to build custom Action Trees that don't require any ARIA or accessibility setup from websites.
Would love to meet and trade notes, if possible (rtrvr.ai/request-demo)!
carl_dr 23 hours ago [-]
Am I right in thinking you’re asking me to put an API in frontend code?
simon_luv_pho 22 hours ago [-]
No and please don’t do that.
If you only use it as a personal assistant. You can connect to your llm service directly.
If you plan to integrate it into your web app. It’s better to have a proxy api for the llm and auth the request with cookie or something.
general_reveal 1 days ago [-]
I’ve been thinking about something like this. If it’s just a one line script import, how the heck are you trusting natural language to translate to commands for an arbitrary ui?
The only thing I can think of is you had the AI rewrite and embed selectors on the entire build file and work with that?
simon_luv_pho 1 days ago [-]
Everything happens at runtime, on the HTML level.
It uses a similiar process as `browser-use` but all in the web page. A script parses the live HTML, strips it down to its semantic essentials (HTML dehydration), and indexes every interactive element. That snapshot goes to the LLM, which returns actions referencing elements by index. The agent then simulates mouse/keyboard events on those elements via JS.
This works best on pages with proper semantic HTML and accessibility markup. You can test it right now on any page using the bookmarklet on the homepage (unless that page CSP blocks script injection of course).
WebMCP doesn’t seem to be available for use inside webpages or extensions.
dzink 1 days ago [-]
Is this Affiliated with the Chinese company Alibaba? Any chance data goes there too?
simon_luv_pho 1 days ago [-]
Full transparency: I work at Alibaba and published this under Alibaba's open-source org. I sometines maintain it during work hours, so yes, Alibaba technically pays me for it. That said, this is my project — it's MIT-licensed, includes no backend service, and is open for anyone to audit.
The free testing LLM endpoint is hosted on Alibaba Cloud because I happen to have some company quota to spend, but it's not part of the library. Bring your own LLM and there is zero data transmission to Alibaba or anywhere else you haven't configured yourself.
I highly recommend using it with a local Ollama setup.
Zetaphor 24 hours ago [-]
Thank you for sharing this!
redindian75 22 hours ago [-]
i tested the chrome extension, it worked great - i asked it to change the light/dark mode of a website, it navigated to settings, clicked a few tabs, scrolled and found it to toggle the setting.
thanks for sharing!
simon_luv_pho 19 hours ago [-]
Glad it worked well! The Chrome extension is my focus right now. It handles simple tasks pretty reliably and fast, but still has a long way to go for more complex workflows. Lots to improve.
Mnexium 1 days ago [-]
Curious - how does it perform with captchas and other "are you human" stuff on the web?
simon_luv_pho 1 days ago [-]
I added in the system prompt that it should skip CAPTCHAs and hand control back to the user. Currently working on a proper human-in-the-loop feature. That's actually one of the key advantages of running the agent inside your own browser.
Mnexium 1 days ago [-]
Makes sense.
For curiosity's sake, have you had it try to attempt captchas?
If so, what were the results?
simon_luv_pho 1 days ago [-]
I haven’t. I don’t think it will work well.
I use a text-based approach. Captchas like “crossroad” usually need a screenshot, a visual model and coordinate-based mouse events.
CloakHQ 1 days ago [-]
[flagged]
bsenftner 15 hours ago [-]
How is this secure? Seems like this PageAgent could be the user pretty easily and cause all kinds of problems.
simon_luv_pho 13 hours ago [-]
Could you elaborate on what kind of security problems you’re referring to? Like hallucination?
bsenftner 13 hours ago [-]
The PageAgent has access to the security tokens of the currently logged in user. They can do anything the user can on the site, including become them. What is to prevent the PageAgent from being exploited and send these security tokens elsewhere? It would be trivial for some other package to look for your PageAgent and override key functions, and then it is all over.
simon_luv_pho 11 hours ago [-]
PageAgent operates at the HTML/DOM level with the same privileges as any other JavaScript running on the page and nothing more. The security token concern you're describing applies equally to every third-party script, npm package, or browser extension that runs in-page. It's not unique to PageAgent.
The browser extension can be more risky because it's more privileged. I've designed a simple authorization mechanism so that only pages explicitly approved by the user can call the extension.
That said, I'd welcome more eyes on this. If anyone wants to review the security model, the code is fully open source.
coreylane 1 days ago [-]
Looks cool! Are you open to adding AWS Bedrock or LiteLLM support?
simon_luv_pho 1 days ago [-]
Thanks!
It supports any OpenAI-compatible API out of the box, so AWS Bedrock, LiteLLM, Ollama, etc. should all work. The free testing LLM is just there for a quick demo. Please bring your own LLM for long-time usage.
jadbox 1 days ago [-]
Firefox support?
simon_luv_pho 22 hours ago [-]
In my plan. Should be easy since I use wxt as the extension framework.
MeteorMarc 1 days ago [-]
Confusing name because of the existence of pageant, the putty agent.
simon_luv_pho 1 days ago [-]
Darn. Pageant would've been a nice name though. Maybe `page-agent.js` is more relevant in web dev community.
graypegg 1 days ago [-]
I think every successful Show HN post ends up with a "thought this was about X" or "didn't look up the name first?" comment. Consider it a win! I don't think anyone will mistake a tool for putty with your tool, but you might share a google search page with it.
mmarian 1 days ago [-]
I think page agent is good. I've never heard of putty's pageant. And I think it's better to distinguish it from general meaning of pageant (for beauty).
simon_luv_pho 1 days ago [-]
Thanks!
kirth_gersen 1 days ago [-]
Came here to say missed opportunity to call it "PAgent". Rolls off the tongue better than Page Agent.
simon_luv_pho 1 days ago [-]
I'm 2 years too late for that one...
popalchemist 1 days ago [-]
Does it support long-click / click-and-drag?
simon_luv_pho 1 days ago [-]
Not yet. Currently focused on the more common interaction patterns. PRs welcome though!
popalchemist 1 days ago [-]
Gotcha. Still very cool! Congrats on the release.
simon_luv_pho 1 days ago [-]
Thanks!
jauntywundrkind 1 days ago [-]
Not exactly the same but I'd also point to Paul Kinlan's FolioLM as a very interesting project in this space. A very nice browser extension,
> Collect and query content from tabs, bookmarks, and history - your AI research companion. FolioLM helps you collect sources from tabs, bookmarks, and history, then query and transform that content using AI.
- GitHub: https://github.com/alibaba/page-agent
- Live Demo (No sign-up): https://alibaba.github.io/page-agent/ (you can drag the bookmarklet from here to try it on other sites)
- Browser Extension: https://chromewebstore.google.com/detail/page-agent-ext/akld...
I'd be really interested in feedback on the security model of client-side agents giving extension-bridge access, and taking questions on the implementation!
Even it’s not, it’s not supposed to crash on startup. Can you post some screenshots and details on GitHub issues? I’m looking into this.
I mean, not even the readme video?
That gives me 404
I see the homepage but no chat or anything else that could be an agent.
Uncaught (in promise) Error: WebGL2 is required but not available. setupGL https://alibaba.github.io/page-agent/assets/SimulatorMask-B8... K https://alibaba.github.io/page-agent/assets/SimulatorMask-B8... <anonymous> https://alibaba.github.io/page-agent/assets/SimulatorMask-B8... nt https://alibaba.github.io/page-agent/assets/SimulatorMask-B8... maskReady https://alibaba.github.io/page-agent/assets/PageAgent-oX13Jj...
Because I have WebGL disabled.
The core functionality should not crash because the visual effect crashed. Not a good practice. I will fix that asap.
Thanks for noticing. Btw the video should work now.
The biggest challenge with any in-page tool is the tension between needing deep DOM access and maintaining isolation. For the agent UI itself, you almost certainly want iframe isolation -- CSS conflicts with the host page are a constant headache otherwise. But for the actual DOM interaction (reading page state, simulating events), you need to be in the host page context. This dual architecture (iframe for your UI, direct access for page interaction) adds complexity but is worth it for reliability across diverse sites.
One thing I would flag as a real production concern: Content Security Policy. A significant number of enterprise and SaaS sites set strict CSP headers that will block inline scripts, eval, and sometimes even dynamically created script elements. If your target audience includes embedding this in production apps, you will hit CSP issues quickly. The bookmarklet approach cleverly sidesteps this for demos, but for a proper integration the host app needs to explicitly whitelist your script origin.
The HTML dehydration approach you described in the comments (parsing live HTML, stripping to semantic essentials, indexing interactive elements) is smart. In my experience, the fidelity of that serialization step is where most of the edge cases live. Shadow DOM, canvas elements, dynamically loaded content, iframes-within-iframes -- each one needs special handling and you end up building a progressively more complex serializer over time. Keeping that layer thin and well-tested is probably the highest-leverage investment for long-term maintainability.
Iframe and CSP are big problems. For the in-page version, I chose to leave out Shadow DOM, canvas, and iframes. Although I know one of the developers forked a version to control same-origin iframes. I don't think it's practical to try to hack around browser security (and website security) — that's why I built the browser extension. I'm hoping the bridge that lets a page call the extension can cover most use cases.
My original HTML dehydration script was ported from `browser-use`. You're absolutely right that it's getting heavier over time, and it's the key factor influencing the overall task success rate. I'm looking to refactor that part and add an extension system for developers to patch their own sites. Hope it turns out well.
Thank you for the feedback. I'll be extra cautious to keep the dehydration code maintainable.
Appreciate the transparency, but maybe you could add some European (preferably) alternatives ?
The free testing LLM is Qwen hosted by Aliyun. Qwen and DeepSeek are the only ones I can afford to offer for free. It's just there to lower the try-out barrier; please DO NOT rely on it.
The library itself does NOT include any backend service. Your data only goes to the LLM api you configured.
I tested it on local Ollama models it works fine.
Qwen3.5 4b is quite good but still gives messy json quite often. But it’s very promising!
Maybe after one more model iteration or some fine-toning we can go fully embedded?
Practically that can be done by provisioning EU clusters with Terraform on AWS eu-west-1 or a European host like Hetzner, using geolocation DNS or Cloudflare load balancing to steer users and pin accounts to a region, while accepting higher costs, more complex CI/CD and subtle GDPR issues around backups and telemetry.
The library does NOT include backend services. This is an open source project. I’m not selling any service here…
I'm building a web UI workspace right now where I have been planning to integrate the agent as an app or component instead of having it be the entire UI. I may fork PageAgent for that, lets see.
I'm intentionally building on a lightweight, in-page JavaScript foundation to carve out some differentiation from the Python-heavy agent ecosystem.
The "protocol" layer of AG-UI does look interesting. I'll look into it to see if I can reuse something, although it seems to be evolving more toward an integration framework rather than an open protocol.
Really glad this resonates with your use case. Lightweight embedding is exactly my priority scenario. Would love to hear how the work goes!
I'm particularly impressed by the bookmark "trick" to install it on a page. Despite having spent 15 years developing for the browser, I had somehow missed that feature of the bookmarks bar. But awesome UX for people to try out the tool. Congrats!
Bookmarklets are such an underrated feature. It's super convenient to inject and test scripts on any page. Seemed like the perfect low-friction entry point for people to try it out.
Spent some time on that UX because the concept is a bit hard to explain. Glad it worked!
PageAgent’s differentiator is that site developers can embed it directly into their own pages. In that scenario, with proper system instructions plus a built-in whitelist/blacklist API for interactive elements, the risk is pretty manageable.
For the general-agent case, operating on pages you don’t control, the risk is definitely higher. I’m currently working on the human-in-the-loop feature so the user can intervene before sensitive actions.
Would love to hear other approaches if anyone has ideas.
How are AI agents built into browsers sandboxed by comparison?
Recent work in sandboxing agents; https://news.ycombinator.com/item?id=47223974
We just launched Rover (https://rover.rtrvr.ai/) as the first Embeddable Web Agent.
Similar principles, just embed a script tag and you get an agent that can type/click/select to onboard/demo/checkout users.
I tried on your website and it was reeaaaally slow. Quick question:
- you are injecting numbering on to the UI. Are you taking screenshots? But I don't see any screenshots in the request being sent, what is the point of the numbering?
I don't think building on browser-use is the way to go, it was the worst performing harness of all we tested [https://www.rtrvr.ai/blog/web-bench-results]. We built out our own logic to build custom Action Trees that don't require any ARIA or accessibility setup from websites.
Would love to meet and trade notes, if possible (rtrvr.ai/request-demo)!
If you only use it as a personal assistant. You can connect to your llm service directly.
If you plan to integrate it into your web app. It’s better to have a proxy api for the llm and auth the request with cookie or something.
The only thing I can think of is you had the AI rewrite and embed selectors on the entire build file and work with that?
It uses a similiar process as `browser-use` but all in the web page. A script parses the live HTML, strips it down to its semantic essentials (HTML dehydration), and indexes every interactive element. That snapshot goes to the LLM, which returns actions referencing elements by index. The agent then simulates mouse/keyboard events on those elements via JS.
This works best on pages with proper semantic HTML and accessibility markup. You can test it right now on any page using the bookmarklet on the homepage (unless that page CSP blocks script injection of course).
The free testing LLM endpoint is hosted on Alibaba Cloud because I happen to have some company quota to spend, but it's not part of the library. Bring your own LLM and there is zero data transmission to Alibaba or anywhere else you haven't configured yourself.
I highly recommend using it with a local Ollama setup.
thanks for sharing!
For curiosity's sake, have you had it try to attempt captchas?
If so, what were the results?
I use a text-based approach. Captchas like “crossroad” usually need a screenshot, a visual model and coordinate-based mouse events.
The browser extension can be more risky because it's more privileged. I've designed a simple authorization mechanism so that only pages explicitly approved by the user can call the extension.
That said, I'd welcome more eyes on this. If anyone wants to review the security model, the code is fully open source.
It supports any OpenAI-compatible API out of the box, so AWS Bedrock, LiteLLM, Ollama, etc. should all work. The free testing LLM is just there for a quick demo. Please bring your own LLM for long-time usage.
> Collect and query content from tabs, bookmarks, and history - your AI research companion. FolioLM helps you collect sources from tabs, bookmarks, and history, then query and transform that content using AI.
https://github.com/PaulKinlan/NotebookLM-Chrome https://chromewebstore.google.com/detail/foliolm/eeejhgacmlh...