Wikidata, world's largest structured knowledge DB, offers MCP access
Wikimedia Deutschland, operator of Wikidata, opens its structured data as vectorised embeddings for use by LLMs via MCP.
(Image: Shutterstock/Alexander Supertramp)
Wikidata, Wikimedia Deutschland's database for structured knowledge, will in future offer a freely accessible interface for LLMs. The project has vectorised its data and makes it available in a vector database as embeddings that developers can link to LLMs via Retrieval Augmented Generation (RAG) and Model Context Protocol (MPC).
According to the operators, Wikidata is the world's largest open knowledge graph, containing around 119 million entries and maintained by around 24,000 volunteers worldwide every month. The database contains structured data as graphs from Wikimedia projects such as Wikipedia, Wikivoyage and Wikisource.
Open access to this data is intended to improve the quality of LLMs by providing them with access to structured, up-to-date and verified knowledge via RAG. This can reduce incorrect answers and hallucinations. Wikimedia sees potential applications such as fact checks or tools to combat vandalism.
Combination of graph and vector searches
The operators recommend using the semantic vector search to identify the correct data records and then using the graph database to utilise the knowledge in a structured way (GraphRAG). In addition to the vector search, there is a keyword search function and descriptive queries for the precise identification of terms. The system combines these approaches, which should make queries more convenient and successful.
Wikidata can also be identified as the source so that users can see where the search results come from. The vector database currently supports search queries in English, French, and Arabic. The operator plans to add Spanish and Mandarin by the end of the year. Other languages are to follow.
The embedding project has been in development since September 2024 with two partners: Jina AI transforms the data from Wikidata into vectors that end up in the Astra DB vector database. The application's source code is available under the open MIT licence.
Videos by heise
Response to the big tech companies
Wikimedia also emphasises a social aspect: the new technology is intended to offer developers worldwide opportunities to make LLMs more transparent, reliable and fair – and thus create a counterbalance to the offerings of large tech companies. Thanks to the work of a large international community of volunteers, Wikidata can also map underrepresented topics and perspectives and thus create a more diverse database for generative AI development.
Interested parties have the opportunity to learn practical tips and application examples in the free webinar on 9 October.
(who)