DeepSeek-OCR: Images Simplify Text for Large Language Models

DeepSeek is experimenting with an OCR model and shows that compressed images are more memory-friendly for calculations on GPUs than many text tokens.

listen Print view
Mobile phone screen with DeepSeek's logo

(Image: Runrun2/Shutterstock.com)

10 min. read
By
  • Dr. Christian Winkler
Contents

Although many company documents are available as PDFs, they are often scanned. Even though it sounds simple, these documents can often only be converted to text with great effort, especially if the structure of the documents is more complex and needs to be preserved. Images, tables, and graphics are also frequent sources of errors. Therefore, in recent months there has been a veritable flood of OCR software that relies on large language models (LLMs).

Chinese AI developer DeepSeek is now also entering this field and, following the Reasoning Model R1, is releasing an experimental OCR model under MIT license. At first glance, this may be surprising, as OCR did not seem to be DeepSeek's core competence until now. And indeed, the new model is initially a technology demo for a new approach in document processing by large language models.

Prof. Dr. Christian Winkler

Prof. Dr. Christian Winkler beschäftigt sich speziell mit der automatisierten Analyse natürlichsprachiger Texte (NLP). Als Professor an der TH Nürnberg konzentriert er sich bei seiner Forschung auf die Optimierung der User Experience.

DeepSeek attempts to compress long text contexts in images, as this allows for a higher information density with fewer tokens. DeepSeek sets high expectations and reports that the model achieves an accuracy of 97 percent even at high compression rates (factor 10). While accuracy decreases with even stronger compression, it remains relatively high. All of this is said to work faster than with other OCR models and can process up to 200.000 pages per day on an Nvidia A100 GPU.

Videos by heise

Large Language Models have memory problems when the context of prompts becomes very large. This is the case when the model has to process long texts or multiple documents. The reason for this is the key-value cache, which is important for efficient calculations and grows quadratically with the context size. The cost of GPUs increases sharply with memory, which means that long texts are very expensive to process. Training such models is also complex. However, this is less due to storage space and more due to the quadratically growing complexity of calculations. Therefore, LLM providers are intensively researching how to represent this context more efficiently.

This is where DeepSeek introduces the idea of representing the context as an image: Images have a high information density, and vision tokens for optical compression could represent a long text with fewer tokens. With DeepSeek-OCR, the developers have tested this basic idea - it should therefore be understood as an experiment to show how well optical compression works.

The associated Preprint consists of three parts: a quantitative analysis of how well optical compression works, a new encoder model, and the actual OCR model. The result of the analysis shows that small language models can learn how to convert compressed visual representations into text.

To this end, the researchers have developed a model called DeepEncoder, which can achieve good results even with high-resolution images with few activations. The encoder uses a mixture of Window and Global Attention combined with a compressor that uses convolutions (Convolutional Compressor). The faster Window Attention only looks at individual parts of the documents and prepares the data, while the slower Global Attention considers the entire context and only works with the compressed data. The convolutions reduce the resolution of the vision tokens, thereby reducing memory requirements.

DeepSeek-OCR combines the DeepEncoder with DeepSeek-3B-MoE. This LLM uses six out of 64 experts and two shared experts at a time, adding up to 570 million active parameters. Unlike many other OCR models such as MinerU, docling, Nanonets, PaddleOCR, DeepSeek-OCR can also convert charts into data, recognize chemical formulas and geometric shapes. It also masters mathematical formulas, although this also works partly with other models.

However, the DeepSeek developers emphasize that this is a preliminary analysis and preliminary results. It will be exciting to see how this technology develops further and where it can be used. In any case, the DeepSeek-OCR model differs considerably from all others. However, to know how well and fast it works, one has to try the model oneself.

The test object is a page from a iX, which is in JPEG format. DeepSeek-OCR can operate in different configurations: Gundam, Large, and Tiny. In Gundam mode, automatic resizing takes place. Currently, this is still somewhat unstable; if you mess up the parameters, you produce CUDA kernel errors and have to start over.

For the test, a news page from iX 6/2025 is used. It is set in three columns and contains several main and one sub-heading, as well as a screenshot of a diagram.

If you want to extract text from documents, you need to prompt the model appropriately. DeepSeek recommends the command <image><|grounding|>Convert the document to markdown.. The result is the Markdown syntax and additional images in a folder, as well as a visual explanation of which different fragments were recognized. In Gundam mode, this works well for the iX page:

In Gundam mode, DeepSeek-OCR recognizes all text and relevant elements and can also reconstruct the text flow of the magazine.

The model recognized the text practically flawlessly and took about 40 seconds on an RTX 4090. This is still far from the advertised 200.000 pages per day, but Gundam also uses only a compression factor of two: 791 image tokens correspond to 1.580 text tokens. At least the model correctly recognizes the text flow in the article. This is a common problem with other models.

With about 50 seconds, the Large variant takes only slightly longer than Gundam, but the results are much worse, possibly due to the higher compression factor: 299 image tokens correspond to 2068 text tokens. The image shows this with the less accurately recognized boxes around the text – there is still room for improvement here. Furthermore, the model does not recognize the texts cleanly; sometimes only unreadable characters like "¡ ¢" appear, which could possibly indicate encoding errors and actually Chinese characters.

The Large mode compresses images more than Gundam, leading to less accurate recognition. The text boxes are less clearly defined, and unreadable characters appear, indicating faulty encoding.

There are no errors with unreadable characters in the Tiny model. It calculates again a bit faster with a duration of 40 seconds and uses a compression factor of 25.8 - 64 image tokens correspond to 1652 text tokens. However, due to the high compression, the model hallucinates heavily and generates text like "Erweist, bei der Formulierung der Ab- fragen kann ein KI-Assistent helfen. Bis Start gilt es auf Caffès offiziell die Gewicht 50 Prozent der Früh-, der Prüfung und 50 Prozent für den Arzt- und NEUT und in Kürze folgen. (Spezielle)". This has nothing to do with the content – so you cannot rely on this model variant.

The Tiny variant has the highest compression factor for images and hallucinates heavily in text output. Therefore, one should not rely on the results.

In addition to Markdown conversion, DeepSeek-OCR also allows Free OCR, which does not consider the layout. With this, the model works much faster and still produces good results even in the Large version with high compression. However, this variant only makes sense if you know that it is continuous text without a difficult layout.

During parsing, DeepSeek-OCR recognized the images contained in the article and saved them separately. The model saves the diagram in a poorly readable resolution.

The diagram extracted with Gundam is blurry and difficult to decipher with the naked eye.

Now it gets exciting, because DeepSeek-OCR is supposed to be able to extract data from this diagram, which can be done with the prompt <image>Parse the figure.. As a result, the model creates the following table:

2024 2023 2022
I have a good understanding of what artificial intelligence is 67% 67% 64%
I know which types of products and services use artificial intelligence 52% 51% 50%
Products and services using artificial intelligence have profoundly changed my daily life in the past 3-5 years 50% 50% 49%
Products and services using artificial intelligence will profoundly change my daily life in the next 3-5 years 66% 66% 60%
Products and services using artificial intelligence have more benefits than drawbacks 55% 54% 52%
I trust people not to discriminate or show bias toward any group of people 45% 45% 44%
I trust artificial intelligence to not discriminate or show bias toward any group of people 54% 54% 50%
I trust that companies that use artificial intelligence will protect my personal data 47% 50% 50%
Products and services using artificial intelligence make me nervous 39% 39% 39%

Apparently, errors have crept into the table, but at least the model has correctly recognized the blurred text. This shows the strength of the encoder, but the English labels also simplify the process for the model. Most of the percentage values are correct, as is the structure of the data. However, using a higher resolution only improves the results marginally.

In addition to diagrams, DeepSeek-OCR can also recognize mathematical formulas and convert them into LaTeX syntax. It also has chemical structure formulas in its repertoire and converts them into SMILES format.

DeepSeek has once again come up with an exciting technical approach and convincingly demonstrated it with DeepSeek-OCR. Text recognition works well, especially in Gundam mode, and the parsing of diagrams is also convincing. However, other models like MinerU, Nanonets, and PaddleOCR-VL are also very good at pure text recognition and sometimes even deliver better results, for example by merging separate words. PaddleOCR-VL, which is also brand new, is particularly noteworthy, as it reliably extracts data from diagrams and even performed better than DeepSeek-OCR in our own tests. A true race has broken out in OCR.

However, DeepSeek seems to be focusing not only on OCR with the model, but also wants to show that vision tokens are a good representation for storing context in large language models in a particularly compact way. With low compression, this already works well, but with higher compression, the results suffer noticeably. However, this approach is still in its very early stages.

DeepSeek-OCR is relatively fast in all configurations. Experiments with MinerU, Nanonets, and PaddleOCR-VL were all at least 50 percent slower. Nanonets did produce a table from the diagram, but without the years, although the continuous text was recognized much better. The brand new PaddleOCR-VL could even recognize the diagram better than DeepSeek-OCR, but is not trained for chemical structure formulas and similar content.

DeepSeek-OCR is – as clearly stated by the developers – a technology demonstration that already works extremely well. It remains to be seen how the technology can be integrated into classic LLMs and used there for more efficient processing of longer contexts.

Further information can be found on GitHub, Hugging Face, and in the arXiv preprint.

(mack)

Don't miss any news – follow us on Facebook, LinkedIn or Mastodon.

This article was originally published in German. It was translated with technical assistance and editorially reviewed before publication.