Announcing Documentalist

Imagine someone in your team who memorised all company files and can retrieve any information from them, in most languages. Someone who can help you for tasks involving large quantities of unstructured information, and who has no workload limit.

Meet Documentalist: a knowledge engine with the following components, leveraging the latest advances in AI.

Knowledge Base: an AI-enriched index of your files, turning them into a memory.
Knowledge Search: a next generation search engine, to explore this memory.
Knowledge Work: a platform to automate knowledge-intensive tasks.

Privacy: we manage a private enclave for each customer, that can even be deployed on-prem.

Activate Corporate Memory

Companies have no memory

In most companies, employees produce knowledge content every day that's useful to them, but lost to others. Corporate directories are like giant libraries full of books nobody will ever read again, despite their value… It's as if the library of Alexandria was burning every day.

And knowledge management is a struggle that costs billions, in the form of hours spent searching for information (reportedly 2 hours a day for knowledge workers). Current Knowledge Management solutions either fail for lack of perseverance, or succeed at great cost (for reference, BCG has over 1,000 people in their Knowledge team).

Poor knowledge management is not only about wasting time searching for internal documentation and re-inventing the wheel. It's also about missed opportunities to extract new value from that information, in the form repurposed content, new insights, improved customer relationships, or better products.

AI can change that

With the latest advances in Natural Language Processing and Generative AI though, it is clear that we can rethink the whole approach to organising, retrieving and generating information. Technically, and economically.

AI models can understand concepts within text, image, videos, and audio files, in most languages. We can use them to make knowledge management more affordable (removing most of the need for manual classification, tagging, etc) and to build new capabilities (automating increasingly sophisticated knowledge tasks).

The only way to access documents at scale while maintaining accuracy is through a well-maintained index. That's where we shall start.

Knowledge Base

Documentalist synchronises with your files (ie: text files (PDFs, Word, Docs, Slides, …), images, audio, video). Then, it processes and indexes them on different databases, using AI models to capture their content, and the relations between their topics and entities.

Under the hood:

Dense Embedding: Compact representation learned by an AI model that captures the meaning of a piece of text, or that of an image, in the form of a vector.
Sparse Embedding: Sparse representation learned by an AI model that combines keyword-level matching with semantic abilities. Pioneered in La Sorbonne¹.
Embedding indexing: implementation of efficient indexing to handle vector representations (dense or sparse) and make them searchable in milliseconds.
Named Entity Recognition: AI model that automatically labels Organizations, People and Locations.
Metadata: The metadata associated with the file (name, date, collaborators, …), and 3rd party metadata: associated with the file but stored in 3rd party systems (eg. Salesforce).
Knowledge Graph: uncover relationships between the entities across files using LLMs to automate graph generation².

You can explore our blog (or contact us) for more details.

With that, Documentalist maintains an AI-enriched index of your files: your corporate memory.

The next step is to design applications tapping into this memory. The first should obviously be a search engine. It is already immensely useful as it is, and il will be the base of "knowledge work" use cases.

Knowledge Search

Coming both from Google, we have been deeply influenced by its mission to make information accessible and useful. How come we have such a great tool for searching the web, but so lame corporate search engines? We can definitely do better, leveraging the following capabilities:

Keyword search of course: quickly locate specific information
Dense-vector search: semantic matching of queries to relevant passages,
Sparse-vector search: matching both individual keywords and semantic relevant passages,
Graph search: improving search relevance by leveraging relationships between data points (people, places, items, concepts) across files, eg: find subject matter experts within the company.
Re-ranking models: proprietary cross-encoders trained on domain-specific datasets,
Query intent models: map the user intent with the relevant search method (keyword, dense, sparse, or graph).

January 2023: we are shipping the v0.1 of our search engine to one of our design partners (fiscal consulting firm). It does not have graph search capacities yet. We are focusing on our sparse embedding + reranking models.

Knowledge Work

Once you can effectively retrieve relevant information from your files, you can pass that information as inputs for different AI models to perform actions or produce new outputs.

The use cases are countless. To list a few:

Say you are working on a RFP: you can ask Documentalist to read every document in a given corpus, take notes of passages that can be relevant in a scratchpad³, jump from one document to another using fine-tuned multi-hop reasoning models⁴, and finally prepare a draft for the answer with transparency on its thought process and sources used.
Or you work in CRM: you ask Documentalist to read transcripts of recorded customer calls and tickets, apply your very-own scoring system⁵ to sort them all and extract the most relevant topics.
Or you provide Documentalist with a few lines of text for your next customer meeting, and ask it to identify relevant slides that you can re-use using a Vision-Instruction model like "LLaVA" tuned on corporate data.
Or check which contracts have clauses conflicting with some new regulation reusing multi-hop reasoning components with direct documentation comparison in the prompt (leading to O(N²) calls).
Or…, many other use cases our first customers are getting creative with. What's yours?

No doubt it will soon be possible to delegate such complex knowledge tasks to machines. Every employee or team will be able to program such tasks with natural language instructions. It will be a bit like using Excel today, and will certainly reduce the need for ad-hoc software.

To unlock such advanced reasoning use cases, companies will increasingly use methods like chains-of-thoughts, custom reward models, or even neuro-symbolic engines. There won't be a one-size-fits-all model, but multiple task-specialised models that have to be orchestrated.

But for enterprises to embrace such capabilities, conditions must be met:

Privacy guardrails, with a special European flavour.
Transparency: model weights have to be accessible.
Compute: costs must be kept under control.

Privacy

Wait, what!? Indexing all corporate knowledge? Where?

Enterprise SaaS solutions usually rely on a multi-tenant infrastructure to provide their service at a lower cost, but that poses security and privacy risks. Today, we can ensure better privacy guardrails, while still mutualizing the stateless parts of our solution. In other words: each customer has its own database container, its own persistent disks, its own API container and so forth. All of these containers run inside your enclave with strict networking rules allowing only your containers to talk to each other.

Our solution is built on CNCF graduated projects, among them: Kubernetes, Helm, Cilium, Longhorn and Open Policy Agent.
We offer managed vGPU with the same level of privacy as our enclave, by leveraging the Nvidia MIG feature⁶.
Our datacenter provider is Scaleway FR, we use machines in the Paris datacenter.
Run your enclave where you want: opt for our cost-effective managed Kubernetes cluster, or choose on-premise deployment.

Open-Source models = Transparency + Performance - Costs

Transparency + Performance: Open-weight models let users evaluate them, which make them essential for transparency, and their performance is catching-up with the most famous closed-source models (eg. GPT-4, Claude, Gemini). They can now be considered strong foundational models, ready to be specialised (fine-tuned) for custom compute-intensive custom applications.

Costs: The amount of compute necessary to perform a given task is falling at astonishing rates. Moreover, running open-weight models on our own servers allows a tradeoff between quality and cost. Adapting the model to the task is also cost-effective: for many tasks it makes no sense to use large language models (neither economically nor ecologically).

The "cost of intelligence" is falling faster than the famous Moore's law, suggesting more and more use cases will be explorable in the years ahead, leveraging open source technologies for complete transparency and cost control.

Become our design partner!

If you read this far, then you might be interested in working with us!

We are looking for companies willing to embark on a R&D journey with us: we build your Knowledge Base, provide you with our Knowledge Search engine, and we design a Knowledge Work use case together, that we integrate to our product and will maintain over time.

We are mostly interested in design partners from industries where employees handle large amounts of unstructured information, such as consulting, legal, creative, finance & insurance. But our Enterprise Search solution can benefit any company who has been struggling with knowledge management.

Let's chat!

Sparse embeddings have been getting some attention with SPLADE by Formal, 2021 at La Sorbonne. We started to contribute to the opensource pg_sparse extension for PostgreSQL, a promising avenue for the industry. ↩
Inspired by advances in knowledge graph generation from scratch using LLM with Exploring Large Language Models for Knowledge Graph Completion by L. Yao, 2023. ↩
The use of a scratchpad can be seen as some kind of adaptive compute, the study that inspires us the most has been published by Google researchers with Scratchpad for Intermediate Computation by Nye et al, 2021. ↩
Complex questions require multiple pieces of information from multiple documents in a very large search space. This has been studied under the multi-hop terminology, and the paper that inspires us the most has been published by Stanford researchers with Robust Multi-Hop Reasoning at Scale via Condensed Retrieval by Khattab, 2021. ↩
Our idea is to leverage strong LLMs as a scalable way to approximate human preferences in complex scoring/evaluation scenarios, as it has been studied in LLM-as-a-Judge by L Zheng, 2023. Last week, Meta researchers have reused this technique in Self-Rewarding Language Models by W Yuan, 2024, showing that the power of LLM-as-a-Judge is higher than the power of producing new outputs, leading to a virtuous circle where the model can improve itself significantly. ↩
MIG stands for Multi-Instance GPU and its allows us to dedicate a slice of an H100 to your enclave with fully isolated compute and privacy. ↩

Using AI to upgrade Enterprise Search

Documentalist is a new kind of Enterprise Search solution that enables Generative AI use cases grounded into corporate knowledge. It uses AI to (1) provide more accurate answers, and (2) handle information processing tasks.