April 6, 2025 · Wiki · multimodal

WikiBot chats log

The most interesting thing is that WikiBot is trained only on texts from our Wiki, where there is a partial description of our App. However, when receiving an image as input, it is able to determine what screen it is, what it is for, and which screen comes next.

That means we can drop a screenshot into a chat with the bot and still get a meaningful answer:

which part of the product this screen belongs to,
what the user is likely trying to do there,
which article from the Wiki explains this flow.

Even without image embeddings, this already helps support, analytics and product teams navigate real‑world screenshots much faster than manually searching through documentation.

Screenshots

Here are a couple of example screenshots from internal chats with WikiBot:

WikiBot explains which part of the product a screen belongs to — WikiBot explains which part of the product this screen belongs to and what the user is trying to do.

WikiBot suggests the next screen in the flow based on the screenshot — WikiBot suggests which screen comes next in the flow and links the relevant Wiki article.

Additional WikiBot conversation example — Another conversation where WikiBot reasons about the interface based only on screenshots and Wiki texts.

The next step is to add vector search over real screenshots and link it with Wiki text. Then WikiBot will be able to find «similar» states of the interface even for screens it has never seen before.