DeepSeek-OCR: How Images Help Chatbots Conduct Long Conversations

Chinese AI researchers aim to keep chatbots fast and inexpensive with images for long contexts. Optical context compression is intended to improve AI assistants

listen Print view
Mobile phone screen with DeepSeek logo

(Image: Runrun2 / Shutterstock.com)

3 min. read

Chinese AI researchers want to use images to ensure that chatbots remain fast and inexpensive even in conversations with a long history. With the help of optical context compression, AI assistants could become significantly better, as the developers of DeepSeek-OCR are convinced. The model currently has experimental status. Despite tenfold compression, they have already been able to demonstrate an accuracy of 97 percent.

The problem with today's AI chatbots is that they have to reprocess the entire history with every response. With optical compression, the conversation history is instead stored as an image and requires fewer tokens for processing. Instead of 1000 tokens, only about 100 would be needed. This would enable tenfold faster response times. This would also be helpful when processing long PDF documents.

Via OCR (Optical Character Recognition), the images are converted back into text by the AI when needed. However, DeepSeek's approach goes far beyond classic OCR. The system can not only recognize text, but also convert diagrams into Excel-compatible tables, convert chemical formulas into machine-readable SMILES formats, and analyze geometric figures. In addition, it masters almost 100 languages in a single model.

The developers of DeepSeek have also worked with different resolutions and developed the idea of imitating human memory with different degrees of sharpness. Contexts that have occurred only a short time ago could be stored in higher resolution and would thus be sharper in the AI's memory. Memories from further back would increasingly fade due to lower resolutions.

Videos by heise

For practical application, DeepSeek has compiled extensive training data: 30 million PDF pages in around 100 languages, 20 million images of natural scenes, and millions of synthetic samples for diagrams, chemical formulas, and geometric figures. In production, the system can already process over 200,000 pages per day – with just one older Nvidia A100 accelerator. This makes it interesting for mass data processing, for example in insurance companies, authorities, or publishing houses.

The researchers themselves describe DeepSeek-OCR in their paper as a "preliminary exploration" and name open questions. For example, how does the system behave when searching for the "needle in a haystack" ("Needle-in-a-Haystack" tests) when specific information is sought from very long contexts?

DeepSeek is thus testing a different architectural approach for AI. The Chinese AI forge has been striving for some time to build a counterweight to US AI companies such as OpenAI, Google, or Anthropic, which primarily focus on scaling. The code of DeepSeek-OCR, along with model weights, is available for download on GitHub and can be tried out by interested parties.

(mki)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.