What is a Corpus?

A corpus is a collection of documents used to inform your AI agent, allowing them to produce personalized responses, and actions.

The problem

AI models like ChatGPT are trained on vast amounts of data spanning everything from Cat pictures, to Reddit threads about baking cookies. Often this data is not directly helpful to organizational needs like technical writing, or answering questions about a specific documentation.

Additionally foundational models training data is frozen at a certain date, meaning that the model is unaware of any information about a topic from a certain time onwards. This leads to answers often being incomplete or misleading.

Solution: Extending AI context

A Corpus is a RAG, a method for extending AI context to include relevant, targeted, specific information about a certain Topic. Katara allows organizations to utilize data loader agents to pull information from live sources, like Discord, Slack, websites, GitHub, etc. This data will be collected and categorized in your Corpus.

Key Concepts

To effectively manage your corpus, you should understand how Katara handles document access and organization:

  • Collections: Organize your documents into logical groups to scope AI context.

  • Document Ownership: Understand who is responsible for each document and how ownership is assigned.

  • Sharing: Learn how to grant access to specific users within your organization.

  • Visibility: Discover the rules that determine who can see and search for your documents.

  • Sensitivity Classification: Protect your most confidential data with clearance-based access. Generative agents will use the corpus to inform answers and content. This is critical to producing meaningful, and accurate answers. You can then periodically or automatically refresh the links to pull in the latest data about the topic.

Last updated