DeepSeek-OCR: Images Simplify Text for Large Language Models
DeepSeek is experimenting with an OCR model and shows that compressed images are more memory-friendly for calculations on GPUs than many text tokens.
(Image: Runrun2/Shutterstock.com)
- Dr. Christian Winkler
Although many company documents are available as PDFs, they are often scanned. Even though it sounds simple, these documents can often only be converted to text with great effort, especially if the structure of the documents is more complex and needs to be preserved. Images, tables, and graphics are also frequent sources of errors. Therefore, in recent months there has been a veritable flood of OCR software that relies on large language models (LLMs).
Chinese AI developer DeepSeek is now also entering this field and, following the Reasoning Model R1, is releasing an experimental OCR model under MIT license. At first glance, this may be surprising, as OCR did not seem to be DeepSeek's core competence until now. And indeed, the new model is initially a technology demo for a new approach in document processing by large language models.
DeepSeek attempts to compress long text contexts in images, as this allows for a higher information density with fewer tokens. DeepSeek sets high expectations and reports that the model achieves an accuracy of 97 percent even at high compression rates (factor 10). While accuracy decreases with even stronger compression, it remains relatively high. All of this is said to work faster than with other OCR models and can process up to 200.000 pages per day on an Nvidia A100 GPU.
Videos by heise
The Context Problem
Large Language Models have memory problems when the context of prompts becomes very large. This is the case when the model has to process long texts or multiple documents. The reason for this is the key-value cache, which is important for efficient calculations and grows quadratically with the context size. The cost of GPUs increases sharply with memory, which means that long texts are very expensive to process. Training such models is also complex. However, this is less due to storage space and more due to the quadratically growing complexity of calculations. Therefore, LLM providers are intensively researching how to represent this context more efficiently.
This is where DeepSeek introduces the idea of representing the context as an image: Images have a high information density, and vision tokens for optical compression could represent a long text with fewer tokens. With DeepSeek-OCR, the developers have tested this basic idea - it should therefore be understood as an experiment to show how well optical compression works.
The Model Architecture
The associated Preprint consists of three parts: a quantitative analysis of how well optical compression works, a new encoder model, and the actual OCR model. The result of the analysis shows that small language models can learn how to convert compressed visual representations into text.
To this end, the researchers have developed a model called DeepEncoder, which can achieve good results even with high-resolution images with few activations. The encoder uses a mixture of Window and Global Attention combined with a compressor that uses convolutions (Convolutional Compressor). The faster Window Attention only looks at individual parts of the documents and prepares the data, while the slower Global Attention considers the entire context and only works with the compressed data. The convolutions reduce the resolution of the vision tokens, thereby reducing memory requirements.
DeepSeek-OCR combines the DeepEncoder with DeepSeek-3B-MoE. This LLM uses six out of 64 experts and two shared experts at a time, adding up to 570 million active parameters. Unlike many other OCR models such as MinerU, docling, Nanonets, PaddleOCR, DeepSeek-OCR can also convert charts into data, recognize chemical formulas and geometric shapes. It also masters mathematical formulas, although this also works partly with other models.
However, the DeepSeek developers emphasize that this is a preliminary analysis and preliminary results. It will be exciting to see how this technology develops further and where it can be used. In any case, the DeepSeek-OCR model differs considerably from all others. However, to know how well and fast it works, one has to try the model oneself.
DeepSeek-OCR Tried Out
The test object is a page from a iX, which is in JPEG format. DeepSeek-OCR can operate in different configurations: Gundam, Large, and Tiny. In Gundam mode, automatic resizing takes place. Currently, this is still somewhat unstable; if you mess up the parameters, you produce CUDA kernel errors and have to start over.
If you want to extract text from documents, you need to prompt the model appropriately. DeepSeek recommends the command <image><|grounding|>Convert the document to markdown.. The result is the Markdown syntax and additional images in a folder, as well as a visual explanation of which different fragments were recognized. In Gundam mode, this works well for the iX page:
The model recognized the text practically flawlessly and took about 40 seconds on an RTX 4090. This is still far from the advertised 200.000 pages per day, but Gundam also uses only a compression factor of two: 791 image tokens correspond to 1.580 text tokens. At least the model correctly recognizes the text flow in the article. This is a common problem with other models.
With about 50 seconds, the Large variant takes only slightly longer than Gundam, but the results are much worse, possibly due to the higher compression factor: 299 image tokens correspond to 2068 text tokens. The image shows this with the less accurately recognized boxes around the text – there is still room for improvement here. Furthermore, the model does not recognize the texts cleanly; sometimes only unreadable characters like "¡ ¢" appear, which could possibly indicate encoding errors and actually Chinese characters.
There are no errors with unreadable characters in the Tiny model. It calculates again a bit faster with a duration of 40 seconds and uses a compression factor of 25.8 - 64 image tokens correspond to 1652 text tokens. However, due to the high compression, the model hallucinates heavily and generates text like "Erweist, bei der Formulierung der Ab- fragen kann ein KI-Assistent helfen. Bis Start gilt es auf Caffès offiziell die Gewicht 50 Prozent der Früh-, der Prüfung und 50 Prozent für den Arzt- und NEUT und in Kürze folgen. (Spezielle)". This has nothing to do with the content – so you cannot rely on this model variant.
In addition to Markdown conversion, DeepSeek-OCR also allows Free OCR, which does not consider the layout. With this, the model works much faster and still produces good results even in the Large version with high compression. However, this variant only makes sense if you know that it is continuous text without a difficult layout.
Extracting Information from Graphics
During parsing, DeepSeek-OCR recognized the images contained in the article and saved them separately. The model saves the diagram in a poorly readable resolution.
Now it gets exciting, because DeepSeek-OCR is supposed to be able to extract data from this diagram, which can be done with the prompt <image>Parse the figure.. As a result, the model creates the following table:
| 2024 | 2023 | 2022 | |
| I have a good understanding of what artificial intelligence is | 67% | 67% | 64% |
| I know which types of products and services use artificial intelligence | 52% | 51% | 50% |
| Products and services using artificial intelligence have profoundly changed my daily life in the past 3-5 years | 50% | 50% | 49% |
| Products and services using artificial intelligence will profoundly change my daily life in the next 3-5 years | 66% | 66% | 60% |
| Products and services using artificial intelligence have more benefits than drawbacks | 55% | 54% | 52% |
| I trust people not to discriminate or show bias toward any group of people | 45% | 45% | 44% |
| I trust artificial intelligence to not discriminate or show bias toward any group of people | 54% | 54% | 50% |
| I trust that companies that use artificial intelligence will protect my personal data | 47% | 50% | 50% |
| Products and services using artificial intelligence make me nervous | 39% | 39% | 39% |
Apparently, errors have crept into the table, but at least the model has correctly recognized the blurred text. This shows the strength of the encoder, but the English labels also simplify the process for the model. Most of the percentage values are correct, as is the structure of the data. However, using a higher resolution only improves the results marginally.
In addition to diagrams, DeepSeek-OCR can also recognize mathematical formulas and convert them into LaTeX syntax. It also has chemical structure formulas in its repertoire and converts them into SMILES format.
Conclusion
DeepSeek has once again come up with an exciting technical approach and convincingly demonstrated it with DeepSeek-OCR. Text recognition works well, especially in Gundam mode, and the parsing of diagrams is also convincing. However, other models like MinerU, Nanonets, and PaddleOCR-VL are also very good at pure text recognition and sometimes even deliver better results, for example by merging separate words. PaddleOCR-VL, which is also brand new, is particularly noteworthy, as it reliably extracts data from diagrams and even performed better than DeepSeek-OCR in our own tests. A true race has broken out in OCR.
However, DeepSeek seems to be focusing not only on OCR with the model, but also wants to show that vision tokens are a good representation for storing context in large language models in a particularly compact way. With low compression, this already works well, but with higher compression, the results suffer noticeably. However, this approach is still in its very early stages.
DeepSeek-OCR is relatively fast in all configurations. Experiments with MinerU, Nanonets, and PaddleOCR-VL were all at least 50 percent slower. Nanonets did produce a table from the diagram, but without the years, although the continuous text was recognized much better. The brand new PaddleOCR-VL could even recognize the diagram better than DeepSeek-OCR, but is not trained for chemical structure formulas and similar content.
DeepSeek-OCR is – as clearly stated by the developers – a technology demonstration that already works extremely well. It remains to be seen how the technology can be integrated into classic LLMs and used there for more efficient processing of longer contexts.
Further information can be found on GitHub, Hugging Face, and in the arXiv preprint.
(mack)