how far are we from claudes "computer use" running locally?
claude has a "computer use" demo that can interact with a desktop PC and click stuff.
the code looks like its just sending screenshots to their api and getting cursor positions back.
i cant imagine that's doable with a visual classification model like llava etc, since those don't actually know exact pixel positions within an image. there's something else going on before or after its fed into a visual model. maybe each element is isolated using filters and then classified?
Does anyone know how this stuff works or maybe even an existing open source project that is trying to build this on top of the ollama visual api?