Generate HTML from an image with prompts
Locate GUI elements using instructions
Localize a click on a UI image based on your instruction
Predict UI click coordinates from a screenshot and instruction