Enterprise AI 15 min read

Using Dataverse as an API Cache in Copilot Studio

Using Dataverse as an API Cache in Copilot Studio
Handling rows and columns in LLMs is difficult. Learn how to build resilient AI agents by using Dataverse as a cache, optimizing views, and implementing a 'Find vs. Get Details' methodology.

When working with Large Language Models (LLMs), handling structured data—rows and columns—is notoriously difficult. A standard Retrieval-Augmented Generation (RAG) approach, where you dump all your rows into the context window, will quickly result in blown prompt limits. Similarly, allowing an LLM to dynamically run raw queries or hit MCP servers often results in massive payloads that the bot cannot process.

The problem is two-fold:

  1. Prompt Token Limits: Traditional RAG fails because structured data is dense.
  2. Dataset Size: Direct dynamic queries can return datasets that are simply too large for a single turn of conversation.

To build resilient, highly accurate AI agents, we must strategically segment our data, orchestrate specialized agents, and tightly control the metadata returned. This guide breaks down the complete methodology for mastering structured data using Dataverse and Microsoft Copilot Studio.


The Architecture of a Data Cache

A common anti-pattern in AI development is connecting an agent directly to an API where data lives at rest. While “live” data sounds appealing, it frequently fails in production due to poor indexing, complex join requirements, and API rate limits.

By using Dataverse as a data cache, you ingest and structure the necessary data into tables optimized for high-performance retrieval. This creates a resilient middle layer that protects your source systems while providing the LLM with a stable, well-indexed dataset.


Part 1: The Backend Foundation – Syncing your API Data

Before the agent can search, the data must be present and fresh. Pointing your AI agent directly at a live API leads to latency and context window exhaustion. Using Dataverse as a staged “cache” is the solution—but you must first establish a reliable synchronization pipeline.

Depending on your data volume, technical expertise, and refresh frequency requirements, there are three primary design patterns for moving data from a REST API into Dataverse: Power Platform Dataflows, Power Automate, and Azure Data Factory (ADF).

PatternBest ForComplexity
Power Platform Dataflows Scheduled batch syncs, Low-code
Low
Power Automate Real-time, Event-driven, Webhooks
ℹ️ Medium
Azure Data Factory Massive datasets, Big Data, Enterprise ELT
⚠️ High

1. Power Platform Dataflows (The Low-Code Scheduled Approach)

Best for: Citizen developers, data analysts, and scheduled batch syncs of small-to-medium datasets.

Dataflows use the familiar Power Query engine (the same one found in Excel and Power BI) to extract, transform, and load (ETL) data. It is the easiest way to map complex nested JSON from an API into clean Dataverse tables without writing code.

The Design:

  • Source: A Web API connector.
  • Transformation: Power Query Editor.
  • Sink (Destination): A Dataverse table.

How to Build It:

  1. In the Power Apps maker portal, navigate to Dataflows and select New dataflow.
  2. Choose Web API as your data source. Provide the API endpoint URL and configure your authentication (e.g., Basic, API Key, or Organizational Account).
  3. Once connected, the API’s JSON response will load into the Power Query Editor. You will typically need to click “Convert to Table” and expand the records into distinct columns.
  4. Clean the Data: Use the UI to change data types (e.g., converting a string to a True/False boolean) and filter out columns your Copilot agent won’t need.
  5. Map to Dataverse: Choose to load the data into an existing Dataverse table. Map your Power Query columns to your Dataverse Logical Names.
  6. The Upsert Key: To ensure your Dataflow updates existing records instead of creating duplicates on the next run, you must select a unique column as the Alternate Key.
  7. Schedule: Set the Dataflow to refresh on a schedule (e.g., daily or every 30 minutes).

2. Power Automate (The Event-Driven Approach)

Best for: Real-time syncs, webhook-triggered updates, or very small and static datasets.

While Power Automate isn’t an enterprise ETL tool, it is brilliant for event-driven synchronization. If your source system supports webhooks (e.g., “send an alert when a new account is created”), Power Automate can catch that payload and push it instantly to Dataverse.

The Design:

  • Trigger: Recurrence (Scheduled) or “When an HTTP request is received” (Webhook).
  • Action 1: HTTP action to call the external REST API (if polling).
  • Action 2: Parse JSON to structure the response.
  • Action 3: Dataverse connector actions.

How to Build It:

  1. Create a new Cloud Flow. If polling an API, use a Recurrence trigger.
  2. Add the HTTP action to perform a GET request to your source API. Include any necessary authentication headers (like Bearer tokens).
  3. Add a Parse JSON action and provide a sample payload from your API so Power Automate can generate dynamic content tags.
  4. Add an Apply to each loop to iterate through the array of items returned by the API.
  5. Inside the loop, use the Dataverse Update a row or Add a new row action.
💡

Pro-Tip: To avoid querying Dataverse first to see if the record exists, configure Dataverse with an Alternate Key on your table. You can then use the Dataverse connector’s built-in Upsert functionality. Pass the API’s unique ID to the alternate key field; Dataverse will automatically decide whether to insert or update.

Note: Avoid Power Automate for large datasets (e.g., tens of thousands of rows). Looping through massive arrays will cause API limits to throttle your flow and result in long execution times.


3. Azure Data Factory / Synapse (The Enterprise Big-Data Approach)

Best for: Data engineers, massive datasets, complex orchestrations, and high-performance ELT/ETL pipelines.

When dealing with millions of rows, pagination, and enterprise-grade security, Azure Data Factory (ADF) is the gold standard. ADF wrangling dataflows and copy activities translate your logic into backend code designed for high-throughput cloud scale.

The Design:

  • Source Linked Service: REST Connector.
  • Sink Linked Service: Dynamics 365 / Dataverse Connector.
  • Pipeline Activity: Copy Data Activity or Mapping Data Flow.

How to Build It:

  1. In Azure Data Factory, navigate to Linked Services and create a new REST linked service. Configure your Base URL and authentication (ADF supports Service Principals, Managed Identities, OAuth2, etc.).
  2. Create a second Linked Service for Dynamics 365 (Dataverse).
  3. Create a Pipeline and drag a Copy Data activity onto the canvas.
  4. Configure the Source: Point it to your REST dataset. If your API paginates its results (e.g., returning 1000 records at a time with a next_page token), ADF’s REST connector natively supports pagination rules. Configure this so ADF automatically traverses the entire API.
  5. Configure the Sink: Point it to your Dataverse dataset.
  6. Set to Upsert: In the Sink settings, change the “Write behavior” from Insert to Upsert. You will be prompted to provide the Alternate key name configured in your Dataverse table.
  7. Mapping: Under the Mapping tab, align your JSON paths to your Dataverse schema. If you are writing to Lookup fields (linking a Contact to an Account), you must map your source ID to a virtual column using the special syntax {lookup_field_name}@{alternate_key_column_name}.
  8. Performance Tuning: For massive datasets, you can adjust the “Write batch size” in the Sink settings to optimize how many API calls ADF makes to Dataverse per second, preventing throttling errors.

Summary: Which should you choose?

  • Want it done fast, on a schedule, with zero code? Use Dataflows.
  • Want to update records in Dataverse the exact second they change in the source system via webhooks? Use Power Automate.
  • Need to ingest 5 million records nightly with complex pagination and high throughput? Use Azure Data Factory.

Part 2: The Implementation Methodology

Building a data-driven agent isn’t just about connecting a database; it’s about orchestration. To ensure your bot remains fast and accurate, we follow a strict implementation path.

Step 1: Enable Dataverse Search (Environment Setup)

Before building your tools, you must enable the global search feature; otherwise, your queries will silently fail.

  1. Go to the Power Platform Admin Center.
  2. Select Environments -> Choose your environment.
  3. Go to Settings -> Product and Features.
  4. Toggle Dataverse Search to ON.

Step 2: The “Find” vs. “Get Details” Design Pattern

This is the core strategy for handling large structured datasets. Never create a single tool that searches for a record and returns every piece of metadata simultaneously. This “fat payload” approach will break your context window.

Find vs Get Details Methodology Figure 1: Separating light-weight discovery from heavy-weight retrieval.

  1. The Find Tool: Filters data into logical groups. It returns a high-level list (e.g., Names and IDs only).
  2. The Get Details Tool: Takes a specific ID and returns the deep metadata for that single record.

Never blow your token budget on data the user hasn’t asked for yet. Segmenting your tools is the difference between an agent that works and one that crashes.

— Engineering Best Practice

This “two-step” approach allows the LLM to retrieve a list of 10 accounts with minimal token usage. If the user asks for details on the second account, the LLM fires the “Details” tool for just that specific ID.


Step 3: Optimize Dataverse Views for LLM Consumption

Now that the strategy is set, we must prep the data for the LLM. If you point an LLM at an Account table without filtering, it might receive 140+ columns. You must explicitly limit the returned properties via Dataverse Views.

  1. Select Your Columns: Navigate to your Dataverse environment and open your target tables.
  2. Edit Quick Find: Go to Views and edit the Quick Find Active view. This is the view Copilot Studio uses by default for search operations.
  3. Limit Columns: Add only the columns the LLM absolutely needs (e.g., Name, ID, City, Status).
  4. Configure Search: Under the “Find By” section, add the columns you want the LLM to query against.
  5. Verify Searchability: Ensure each selected column is explicitly marked as “Searchable” in its column settings.
💡

Crucial UI Tip: After making changes, click Save and Publish. Do not navigate away from this screen until the process completely finishes. If you leave prematurely, the indexing will fail, and your agent will return stale or zero results.

💡

Note on Logical Names: When referencing columns in Copilot Studio, always use the Logical Name (found under Advanced Options) rather than the Schema Name or Display Name.


Step 4: Multi-Agent Architecture—Parent, Child, and Separation of Concerns

Do not bloat your main agent’s system prompt with massive data structures or complex tool logic. Instead, use an orchestrator pattern with specialized Child Agents. This separation ensures each agent has a focused context window.

Child Agents vs. Parent Agents

  • Parent Agent (The Orchestrator): Handles global rules and routing.
  • Child Agents (The Domain Experts): Create an Account Agent and a Contact Agent. This prevents the parent’s system prompt from becoming a “wall of text” that degrades performance.
Agent TypeRoleKey Responsibility
Orchestrator Global Router Handling global instructions and routing to child agents.
Account Agent Domain Expert Searching and fetching details for organizational data.
Contact Agent Domain Expert Managing individual personnel records and relationships.

Global Instructions & Guardrails

Set rules at the parent level that apply to the entire system:

  • Sample Data Handling: If your data contains “(sample)”, instruct the agent to ignore or filter that text.
  • The “No Exports” Rule: Explicitly instruct the parent agent “never to offer to export or create files.” This prevents the LLM from hallucinating file download capabilities or generating broken links when users ask for large datasets.

The “Find” vs. “Get Details” Methodology

This is the secret sauce for handling large structured datasets. Never create a single tool that searches for a record and returns every piece of metadata simultaneously. Instead, separate the actions:

Find vs Get Details Methodology Figure 1: Separating the light-weight search from the heavy-weight detail retrieval.

  1. The Find Tool: Filters data into logical groups. It returns a high-level list (e.g., Names and IDs only).
  2. The Get Details Tool: Takes a specific ID and returns the deep metadata for that single record.

Never blow your token budget on data the user hasn’t asked for yet. Segmenting your tools is the difference between an agent that works and one that crashes.

— Engineering Best Practice

This “two-step” approach allows the LLM to retrieve a list of 10 accounts without blowing the token limit. If the user asks for details on the second account, the LLM fires the “Details” tool for just that specific ID.

💡

Nudging the Orchestrator: You can use the descriptions within child agents to guide the orchestrator on tool chaining. For example: “Always use the Find tool first to locate the record, then follow up with the Details tool.”


Step 5: Building Tools with the “SearchQuery” Action

To build these highly controllable tools, leverage the Dataverse “Perform an unbound action” connector. Specifically, use the SearchQuery action. This action allows for the targeted retrieval we defined in the “Find vs. Get Details” methodology.

CI/CD Best Practice: The “(Current)” Environment

When configuring the Dataverse connector, set the environment input to the custom value (Current).

  • Avoid Hardcoding: Do not select a specific environment name from the dropdown.
  • Portability: This ensures that when you export your Copilot via pipelines to Test or Production, it automatically targets the local environment rather than breaking by looking for a Dev instance.

Step 6: Advanced Prompting & Entity Extraction

To make your tools bulletproof, utilize tool inputs and descriptions for inline formatting and schema mapping. This bridge between natural language and your database schema is critical.

Entity Extraction & Formatting

When defining the search input parameter, give it strict formatting rules:

  • “Format the state in a two-digit state code in all caps (e.g., TX).”
  • “Ensure zip codes are five digits.”

If a user asks, “What are my accounts in Texas?”, the LLM will automatically convert “Texas” to “TX” before the query ever hits Dataverse.

The “Also Known As” (AKA) Trick

LLMs don’t inherently know your internal schema. If your “Account Name” is stored in a column named parentcustomerid, define this translation directly in the tool description:

“Account Name is also known as parentcustomerid.”

🧠

The AKA Trick: This creates a semantic bridge between natural language and your structured database columns. It’s often more effective than renaming the columns themselves, as it preserves the original schema while providing the LLM the context it needs to map user intent correctly.


The Result: Dynamic Chaining & Pivoting

The ultimate goal of this architecture is to enable Dynamic Chaining. When implemented correctly, you can prove the bot’s capability by testing a complex pivot:

  1. Search: “Show me accounts in California.”
  2. Retrieve: The bot uses the Find tool to get a list.
  3. Inspect: “Tell me more about the third one.” (Bot use Get Details).
  4. Join: “Who are the contacts for that account?” (Bot pivots from the Account Agent to the Contact Agent using the retrieved ID).

By implementing Dataverse as a cache, strictly limiting returns, and orchestrating through specialized child agents, you can build Copilot Studio agents capable of handling massive structured datasets with speed, precision, and enterprise-grade reliability.

Related Articles

More articles coming soon...

Discussion

Loading...