Enterprise AI 12 min read

Mastering Data Ingestion for AI Agents: A Deep Dive into Foundry IQ Knowledge Sources

Mastering Data Ingestion for AI Agents: A Deep Dive into Foundry IQ Knowledge Sources
Learn how Microsoft's Foundry IQ shifts the paradigm by treating data ingestion as a first-class capability through Knowledge Sources, connecting modern protocols like MCP, and handling enterprise governance.

Building a highly capable AI agent is only half the battle; an agent is fundamentally constrained by the quality and accessibility of its underlying data. In real-world enterprise environments, knowledge doesn’t live in a single silo. It is fragmented across SharePoint sites, data lakes, blob storage, search indexes, and third-party systems.

Historically, connecting these diverse sources required managing fragile glue code, building custom ingestion pipelines, and handling complex orchestration logic. Microsoft’s Foundry IQ shifts this paradigm by treating data ingestion as a first-class capability through Knowledge Sources.

Here is a technical walkthrough of how Foundry IQ normalizes data pipelines, connects modern protocols like MCP, and handles enterprise governance—freeing your application code to focus entirely on user intent.


Architectural Paradigm: Separation of Concerns

At its core, a Knowledge Source in Foundry IQ is a managed connection wrapped inside a Knowledge Base. This design enforces a strict separation of concerns:

  • The Knowledge Base: Handles the messy realities of data connections, authentication, chunking, and retrieval logic.
  • The Agent: Interacts with the Knowledge Base as a single, unified endpoint, focusing exclusively on planning, user intent, and action execution.

By abstracting the orchestration burden away from your application code, Foundry IQ allows agents to fluidly reason across structured lists, unstructured PDFs, and public web content without custom routing logic.

Separation of Concerns: The Agent interacts cleanly with a Knowledge Base


The Two Pillars of Knowledge Sources: Indexed vs. Remote

Foundry IQ categorizes sources into two primary ingestion patterns, depending on where the data lives and how it needs to be queried.

1. Indexed Sources (The Managed Pipeline)

For data residing in environments like Azure Blob Storage, OneLake (which seamlessly shortcuts to AWS S3, GCP, or custom lakehouses), or existing Azure AI Search indexes, Foundry IQ fully automates the ingestion pipeline.

  • Automated Processing: Content is automatically chunked, vectorized, and enriched.
  • Advanced Retrieval: It configures the underlying query engine for hybrid search, leveraging keyword, vector, and semantic ranking simultaneously.
  • Content Understanding Service: By enabling “Standard Mode” during setup, Foundry IQ applies layout-aware extraction. It intelligently parses complex structures like tables, figures, and headings, ensuring high-quality grounding without writing custom parsing scripts.
  • Automated Freshness: Indexers are automatically scheduled (e.g., hourly by default) to keep the vector database synchronized with the source files.

Automated Indexing Pipeline: Documents are processed and indexed seamlessly

2. Remote Sources (Query on Demand)

Sometimes, data is too dynamic, or governance models dictate that data cannot be duplicated into a secondary index. Remote sources query the target system live, merging and reranking the results alongside your indexed content.

  • Model Context Protocol (MCP): Currently in private preview, Foundry IQ supports MCP servers as native knowledge sources. This allows you to plug into any tool-backed system exposing an MCP server, treating external application states as queryable knowledge.
  • The Web: Public grounding via a Bing endpoint.
  • Remote SharePoint: This utilizes the Microsoft 365 Retrieval API to query SharePoint directly. Crucially, this method respects user permissions and sensitivity labels dynamically at query time. (Note: Requires an M365 Copilot license).
  • Indexed SharePoint (Alternative): If you require granular control over SharePoint data preparation, you can opt for Indexed SharePoint, which extracts the data into an Azure AI Search index via an Entra App registration.

The Agentic Retrieval Engine in Action

You don’t have to write complex routing logic to figure out which source to query. Foundry IQ utilizes an Agentic Retrieval Engine that takes the user’s prompt and formulates a plan.

It executes parallel subqueries across your selected sources. It then evaluates the returned evidence. If the context is sufficient, it exits early to save latency; if not, it iterates and refines the subqueries to improve coverage.

⚙️

Developer Control: You retain control over the Retrieval Reasoning Effort. You can dial this from Minimal to Medium, deciding whether to utilize an LLM to formulate complex subqueries and synthesize answers, allowing you to balance speed, cost, and depth based on your specific use case.


Implementation Guide: Tips and Best Practices

Setting up these sources is remarkably streamlined. Whether you are working in the Foundry UI, the Azure Portal (under Azure AI Search -> Agentic Retrieval), or directly in VS Code using the Python SDK, the process provisions the underlying Azure AI Search indexes, indexers, and data sources automatically.

Step-by-Step UI Setup (Blob Storage Example)

Prerequisite: Ensure you have an Agent created first.

  1. Navigate to Build -> Knowledge -> Knowledge Bases.
  2. Click Create Knowledge Source and select your source type (e.g., Azure Blob Storage).
💡

Pro-Tip: The Description Field is Critical Do not treat the description field as an afterthought. This text is heavily relied upon by the agent’s planner. If you are connecting a blob container with HR documents, explicitly describe it (e.g., “This source contains internal company policies regarding PTO and benefits”). Accurate descriptions prevent the agent from wasting time querying irrelevant sources.

  • Resource Reusability: Knowledge Sources are decoupled from specific Knowledge Bases. You can create a connection to your enterprise data lake once, and seamlessly attach it to dozens of different agents across your organization.
  • Model Selection: When configuring an indexed source, you will need to map your specific Azure OpenAI deployments: an embedding model for vectorizing the text, and a chat completions model to generate descriptions for any images found in your files.
  • Programmatic Access: For developers preferring infrastructure-as-code or CI/CD pipelines, everything demonstrated in the Foundry UI is fully accessible via the Python SDK.

Python Implementation: Creating an Indexed Blob Knowledge Source

Here is a practical example of how you can automate the creation of a Knowledge Source using the Azure AI / Foundry SDK.

This script directly mirrors the Blob Storage scenario discussed above. It programmatically sets up the connection, configures the content understanding processing, and assigns the necessary AI models for vectorization and image analysis.

Prerequisites

You would typically need the Azure Identity and Azure AI Projects (Foundry) libraries installed:

Code
pip install azure-identity azure-ai-projects

The Code

Code
import os
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    KnowledgeSource,
    BlobStorageConfiguration,
    IndexProcessingConfiguration,
    ModelConfiguration
)

def create_blob_knowledge_source():
    # 1. Authenticate using DefaultAzureCredential (supports Managed Identity, Azure CLI, etc.)
    credential = DefaultAzureCredential()

    # 2. Initialize the Foundry/Project Client
    # The connection string can be found in your Microsoft Foundry Project settings
    project_client = AIProjectClient.from_connection_string(
        credential=credential,
        conn_str=os.environ.get("FOUNDRY_PROJECT_CONNECTION_STRING")
    )

    # 3. Define the underlying Storage details
    blob_config = BlobStorageConfiguration(
        storage_account_name="zavastorageacc",
        container_name="zava-files",
        # Utilizing system-assigned identity for secure, keyless access
        authentication_type="SystemAssignedIdentity" 
    )

    # 4. Configure how the data is processed into the index
    # Setting mode to "Standard" invokes the Content Understanding service 
    # for layout-aware extraction (tables, figures, etc.)
    processing_config = IndexProcessingConfiguration(
        extraction_mode="Standard",
        # Map to your specific model deployments in Foundry
        embedding_model=ModelConfiguration(
            deployment_name="text-embedding-ada-002"
        ),
        # Used to automatically describe any images found in the PDFs/files
        vision_model=ModelConfiguration(
            deployment_name="gpt-4o" 
        )
    )

    # 5. Assemble and Create the Knowledge Source
    print("Creating Knowledge Source...")
    
    knowledge_source = project_client.knowledge.create_knowledge_source(
        name="zava-public-policies",
        # CRITICAL: The description routes the Agent's planning engine.
        description=(
            "This is a source for Zava policies. "
            "Users can access this publicly to find structured HR information, "
            "standard operating procedures, and general company guidelines."
        ),
        source_type="AzureBlobStorage",
        storage_configuration=blob_config,
        processing_configuration=processing_config
    )

    print(f"✅ Successfully created Knowledge Source: {knowledge_source.name}")
    print(f"Source ID: {knowledge_source.id}")
    print(f"Status: {knowledge_source.status}") # e.g., 'Indexing' or 'Ready'

if __name__ == "__main__":
    create_blob_knowledge_source()

Key Technical Takeaways from the Code

  • Keyless Authentication: By using DefaultAzureCredential and specifying SystemAssignedIdentity in the blob config, you avoid hardcoding SAS tokens or storage account keys, keeping your enterprise setup secure.
  • Routing via Description: Notice the detailed string passed to the description parameter. You are writing this description for the AI, not for a human. The Agentic Retrieval Engine reads this to decide if it should query this blob container.
  • The “Standard” Processing Mode: Passing "Standard" to the extraction_mode is what triggers the automated chunking, vectorization, and layout-aware extraction behind the scenes, effectively building the data pipeline for you.

Python Implementation: Equipping the Agent with Knowledge

Here is the final piece of the puzzle. Now that the Knowledge Source is actively indexing our blob container, we need to wire it up to an Agent so it can actually reason over that data.

In the Microsoft Foundry/Azure AI Projects SDK, agents are treated as autonomous entities. You equip them with capabilities by passing “Tools.” In this case, we will pass our newly created Knowledge Source as a retrieval tool, create a conversation thread, and ask it a question.

Code
import os
import time
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import KnowledgeTool, ToolResources

def run_knowledge_agent(knowledge_source_id: str):
    # 1. Authenticate and Initialize Client
    credential = DefaultAzureCredential()
    project_client = AIProjectClient.from_connection_string(
        credential=credential,
        conn_str=os.environ.get("FOUNDRY_PROJECT_CONNECTION_STRING")
    )

    # 2. Define the Knowledge Tool
    # We reference the ID of the Knowledge Source created in the previous step.
    # The agent will use the source's description to decide when to trigger this tool.
    knowledge_tool = KnowledgeTool(knowledge_source_ids=[knowledge_source_id])
    
    tool_resources = ToolResources(knowledge_tool=knowledge_tool)

    print("Deploying Agent with Knowledge Source...")

    # 3. Create the Agent
    agent = project_client.agents.create_agent(
        model="gpt-4o", # The reasoning engine
        name="Zava-HR-Specialist",
        instructions=(
            "You are an HR specialist for Zava. Use your configured knowledge "
            "sources to answer employee questions. Always cite your sources."
        ),
        tools=[knowledge_tool],
        tool_resources=tool_resources
    )
    
    print(f"✅ Agent created: {agent.id}")

    # 4. Create a Conversation Thread
    thread = project_client.agents.create_thread()
    
    # 5. Add a User Message to the Thread
    user_query = "What is the standard PTO policy for remote employees?"
    print(f"\nUser: {user_query}")
    
    project_client.agents.create_message(
        thread_id=thread.id,
        role="user",
        content=user_query
    )

    # 6. Run the Agent
    # This is where the Agentic Retrieval Engine kicks in, plans the subqueries, 
    # retrieves from Blob storage, and synthesizes the final answer.
    run = project_client.agents.create_run(
        thread_id=thread.id,
        assistant_id=agent.id
    )

    print("Agent is thinking and retrieving context...")

    # 7. Poll for Completion
    while run.status in ["queued", "in_progress"]:
        time.sleep(1)
        run = project_client.agents.get_run(thread_id=thread.id, run_id=run.id)

    # 8. Retrieve and Print the Final Response
    if run.status == "completed":
        messages = project_client.agents.list_messages(thread_id=thread.id)
        # The latest message is the agent's response
        agent_response = messages.data[0].content[0].text.value
        print(f"\nAgent: {agent_response}")
    else:
        print(f"\n❌ Run failed or requires action. Status: {run.status}")

if __name__ == "__main__":
    # Replace with the actual ID returned from the previous script
    SOURCE_ID = "ks_123abc..." 
    run_knowledge_agent(SOURCE_ID)

What’s happening under the hood?

  • Zero Orchestration: Notice that we didn’t write any code to calculate cosine similarity, query the vector database, or format the retrieved text into the prompt window. The SDK handles the entire RAG (Retrieval-Augmented Generation) pipeline automatically.
  • The Run Loop: Once create_run is triggered, the agent evaluates the user’s question, reads the tool descriptions, realizes it needs to search the Zava Blob Storage, fetches the chunks, and generates the response.
  • Citations: Because Foundry IQ inherently tracks data lineage, the final agent_response will automatically include citation markers pointing directly back to the specific PDF or file in the blob container.

Conclusion

Foundry IQ Knowledge Sources eliminate the tedious middleware historically required to ground LLMs in enterprise data. By offering a hybrid approach of managed indexes and remote on-demand querying (including robust support for M365 and modern MCP architectures), it provides a secure, scalable foundation for building trustworthy AI agents.

(For more documentation and code samples, check out the official resources at aka.ms/iq-series).

Related Articles

More articles coming soon...

Discussion

Loading...