Multimodal models are trained knowing how to understand encoded images. It really is magic. Base64 image data is a binary-to-text encoding scheme that represents image bytes as a printable ASCII string, formatted as data:[<mediatype>][;base64],<data>. We think of llms as only good at text, but any structured data is predictable. As long as it can be turned into an N dimensional vector that represents a complex idea in the LLM hidden weights, the output of the model treats that essentially like text. With sufficient training data, it understands the text in what to us looks like noise.
this is super interesting - LLM's do not distinguish between pictures/screenshots and text - all are vectorized. LLM's process everything together and is part of the thinking process- it is magic and breakthrough.. My guess is that this was not by design but a nice after-effect of the core attention design.. a lot of papers are written on it - you will find it a very interested read.
Multimodal models are trained knowing how to understand encoded images. It really is magic. Base64 image data is a binary-to-text encoding scheme that represents image bytes as a printable ASCII string, formatted as data:[<mediatype>][;base64],<data>. We think of llms as only good at text, but any structured data is predictable. As long as it can be turned into an N dimensional vector that represents a complex idea in the LLM hidden weights, the output of the model treats that essentially like text. With sufficient training data, it understands the text in what to us looks like noise.
this is super interesting - LLM's do not distinguish between pictures/screenshots and text - all are vectorized. LLM's process everything together and is part of the thinking process- it is magic and breakthrough.. My guess is that this was not by design but a nice after-effect of the core attention design.. a lot of papers are written on it - you will find it a very interested read.