Building a multi-agent pipeline for NL-to-SQL analytics

This post is a follow-up to Creating a domain-specific NL-to-SQL MCP server, which introduced our MCP-based NL-to-SQL system. Here, we explore how we overcame the initial implementation's limitations through a Multi-Agent architecture.

Introduction — Why multi-agent?

While building our initial NL-to-SQL system on MCP Server, we validated a lot of possibilities. Along the way, we also discovered that a more stable and scalable architecture was clearly needed.

Limitation 1 — Automatic generation, separate execution

Our initial MCP-based NL-to-SQL system had a unidirectional connection between the LLM that generates SQL and the server that actually executes it.
The flow looked like: User question → SQL generation → Execution on a separate server → Results. While the SQL itself was generated automatically, actual execution still depended on a separate tool.

Limitation 2 — Single-point failure in a monolithic engine

The existing MCP processed everything in a single flow: Natural language interpretation → Intent generation → SQL generation → Result explanation.
This led to:

Difficulty identifying which step caused an incorrect interpretation
A single error cascading through the entire result
Users struggling to understand what went wrong

Limitation 3 — Challenges in maintenance and scalability

Real-world data analysis involves various conditions and edge cases that can't be solved with simple aggregations alone. However, all rules must be packed into a single prompt in a monolithic MCP structure.
This results in:

Prompts becoming bloated with increasing rules, raising conflict potential
Hesitation to add new rules due to risk of affecting existing behavior
Difficulty tracking which rules were or weren't applied

Ultimately, maintenance becomes increasingly painful as rules grow more complex.

So we built an A2A (agent-to-agent) pipeline

"From an AI that generates SQL" → "To an AI team that performs the analysis itself"
Instead of concentrating all responsibility in a single engine, we redesigned the system around specialized Agents. Each Agent collaborates step by step based on its role. We call this approach A2A (Agent-to-Agent) and implemented it using our internal AI Agent Portal (v1.4.2.4).

With the A2A architecture:

Natural language interpretation, query generation, execution, and result explanation are connected as a seamless pipeline.
Each step is isolated, enabling immediate identification of error sources.
Rules are distributed across Agents, so adding new features doesn't impact existing behavior.

In other words, A2A establishes an analytics foundation with accuracy, consistency, and operational stability that was difficult to achieve with a monolithic MCP structure.

Why UTS data — Complex, but necessary

What we want to know isn't just whether a metric increased or decreased. We want to understand what actions users actually took and what flows preceded and followed those actions.
To understand this, summarized metrics alone aren't enough. We need raw event logs rich with behavioral context. This is exactly what UTS (User Tracking System) provides.

UTS is a user behavior log collection system used across various internal services:

It records activities like feature usage, interactions, and screen navigation in a consistent event format
Its JSON-based denormalized structure flexibly accommodates various attributes for service-specific needs
It provides granular, multi-dimensional user behavior data including session/action-level information

Since all services share the same schema rules:

Analysis isn't limited to a specific service
The same data structure and pipeline can be reused when expanding analysis scope to other services

In this post, we'll use the AI Talk Suggestion service data as an example to explain our NL-to-SQL analytics system.

AI Talk Suggestion is a feature where AI suggests appropriate responses based on conversation context. Users can select suggested content at the right moment or compose their own response to send.
See official LINE guide for details.

We'll use modified example data based on actual logs for illustration purposes.

UTS-like event log example

Each row in UTS represents a single event performed by a user.

dt	hour	user_id	session_id	view_name	event_type	event_group	event_details	client_ts	server_ts
20251125	14	usr_001	SESS-A1	suggestion_main	view	open	{"origin":"chat_list","mode":"default"}	1736916000100	1736916000700
20251125	14	usr_001	SESS-A1	suggestion_main	signal	suggest	{"action":"shown","item_id":"sug_101"}	1736916000300	1736916000850
20251125	14	usr_001	SESS-A1	suggestion_main	action	suggest	{"action":"selected","item_id":"sug_101"}	1736916000450	1736916000920
20251125	14	usr_001	SESS-A1	suggestion_main	action	message	{"action":"sent","content_type":"text"}	1736916000700	1736916001100
20251125	15	usr_002	SESS-B2	suggestion_main	signal	suggest	{"action":"shown","item_id":"sug_205"}	1736916001200	1736916001350
20251125	15	usr_002	SESS-B2	suggestion_main	action	suggest	{"action":"dismissed","item_id":"sug_205"}	1736916001400	1736916001550
20251125	16	usr_003	SESS-C3	settings_panel	view	settings	{"section":"ai_suggest"}	1736916002000	1736916002600
20251125	16	usr_003	SESS-C3	settings_panel	action	toggle	{"feature":"ai_suggest","value":"off"}	1736916002450	1736916003000

For example, looking at usr_001's logs:

Who: usr_001
When: 2025-11-25, 14:00
Where: suggestion_main screen
What:
1. Opened the suggestion screen from the chat list (view/open)
2. AI recommendation was displayed (signal/suggest)
3. Selected the recommended item (action/suggest)
4. Sent a message (action/message)

Since each row captures who, when, what, and in what context, UTS serves as the core data source for granular user behavior analysis.

Foundation — Designing an NL-to-SQL data mart

While the A2A architecture improves analysis accuracy and quality, directly querying UTS data has practical limitations:

Tables are massive (JSON-based denormalization)
High join costs due to event-level storage
Inevitable performance degradation if the NL-to-SQL pipeline queries raw data every time

Therefore, we built a dedicated data mart that normalizes and summarizes raw UTS for analytical purposes. By having NL-to-SQL operate on this data mart, we enabled lighter and more stable analytics execution.

LLM-friendly restructuring of raw logs

UTS contains mixed event information and JSON parameters. This makes it unstable for direct LLM interpretation. So we first extract behavioral units using rule-based logic. Then we pre-process them into semantic metrics that LLMs can directly reference.

Step 1 — Transform raw events into 'action' units

Here's a simplified example of event logs for structural explanation:

dt	hour	user_id	event_type	event_group	event_details
20251125	14	usr_001	signal	suggest	{"action":"shown"}
20251125	14	usr_001	action	suggest	{"action":"selected"}
20251125	14	usr_001	action	message	{"action":"sent","length":"24"}
20251125	16	usr_003	view	settings	{"section":"ai_suggest","mode":"v2"}

These raw events have diverse structures. This makes it difficult to determine meaningful actions directly from event type/attribute combinations.
So in the ETL stage, we transform raw events into consistent action units based on human-defined rules.

-- Step 1: Example of transforming raw events into action flags
SELECT
  dt,
  hour,
  user_id,
 
  /* Suggest shown flag */
  (event_type = 'signal'
   AND event_group = 'suggest'
   AND COALESCE(event_details['action'], '') = 'shown') AS is_suggest_shown,
 
  /* Suggest selected flag */
  (event_type = 'action'
   AND event_group = 'suggest'
   AND COALESCE(event_details['action'], '') IN ('selected','applied')) AS is_suggest_selected,
 
  /* Message sent flag */
  (event_type = 'action'
   AND event_group = 'message'
   AND event_details['action'] = 'sent') AS is_message_sent,
 
  ...
FROM raw_events

Step 1 Result Example:

dt	hour	user_id	is_suggest_shown	is_suggest_selected	is_message_sent	is_settings_view
20251125	14	usr_001	true	false	false	false
20251125	14	usr_001	false	true	false	false
20251125	14	usr_001	false	false	true	false
20251125	16	usr_003	false	false	false	true

This process removes JSON parsing and conditional branching complexity. Each event becomes a clear action flag indicating which action it corresponds to.

Step 2 — Aggregate by analysis grain and pre-compute semantic metrics

Users ask questions like:

"Users who saw AI recommendations?"
"Users who selected recommendations?"
"Active users who sent messages?"

To answer these, you need to combine Step 1's action flags to define concepts like "active user". However, SQL varies across questions if the LLM has to figure out this combination every time.
So in the ETL stage, we pre-aggregate by user-day (dt, user_id) and pre-compute frequently used semantic judgments as derived columns.

-- Step 2: Example of user-day aggregation and derived metric generation
SELECT
  dt,
  user_id,
  SUM(is_suggest_shown)    AS suggest_shown_cnt,
  SUM(is_suggest_selected) AS suggest_selected_cnt,
  SUM(is_message_sent)     AS message_sent_cnt,
  SUM(is_settings_view)    AS settings_view_cnt,
  -- Example AU definition: message sent OR recommendation selected
  CASE WHEN SUM(is_message_sent) > 0 OR SUM(is_suggest_selected) > 0
       THEN true ELSE false END AS is_active_user
FROM step1_events
GROUP BY dt, user_id

Step 2 Result Example:

dt	user_id	suggest_shown_cnt	suggest_selected_cnt	message_sent_cnt	settings_view_cnt	is_active_user
20251125	usr_001	1	1	1	0	true
20251125	usr_002	1	0	0	0	false
20251125	usr_003	0	0	0	1	false

These columns directly correspond to natural language questions:

Natural language question	Column condition for LLM
Users who saw AI recommendations?	`suggest_shown_cnt > 0`
Users who selected recommendations?	`suggest_selected_cnt > 0`
Active users who sent messages?	`is_active_user = true`

Pre-processing aggregation and semantic calculation in ETL and exposing clear columns to the LLM enables stable and consistent SQL generation.

Domain-specific analytics data marts

User behavior within a service is recorded across multiple UTS tables spanning recommendation usage, screen navigation, settings, and more.
Based on this organized behavioral unit data, we configured domain-specific data marts. These marts quickly answer frequently asked questions in each area.
Each mart contains only the tables and columns needed for specific analytics domains. This enables the LLM to quickly find the right table and right column for each question type.

The example below uses simplified names and columns unrelated to the actual structure.

Question type	Reference mart	Key columns
AU, recommendation selection rate	Suggestion Domain mart	`is_active_user`, `suggest_selected_cnt`
Screen dwell time	View Domain mart	`avg_stay_time`, `view_entries`
Settings change status	Settings Domain mart	`ai_suggest_enabled`, `toggle_change_cnt`

Real-world example

User Question:
"How many AU yesterday?"

The LLM recognizes the question is about active users (AU) and finds the is_active_user column from the Suggestion Domain mart to generate SQL:

SELECT COUNT(DISTINCT user_id) AS au_count
FROM suggestion_domain_mart
WHERE dt = '20251125'
  AND is_active_user = true

If we queried Raw UTS directly:

-- Complex query difficult for LLM to generate
SELECT COUNT(DISTINCT user_id)
FROM uts_raw_events
WHERE dt = '20251125'
  AND (
    (event_type = 'action' AND event_group = 'message'
     AND JSON_EXTRACT_SCALAR(event_details, '$.action') = 'sent')
    OR
    (event_type = 'action' AND event_group = 'suggest'
     AND JSON_EXTRACT_SCALAR(event_details, '$.action') = 'selected')
  )

Thanks to domain-specific marts, the LLM can quickly determine which table to look at and which column to use, generating accurate SQL.

A2A architecture — 5-agent collaborative pipeline

The data marts designed earlier provide an LLM-friendly structure, but the process of converting natural language questions into accurate SQL and executing them remains complex.

Various user expressions must be interpreted consistently
Invalid queries must be validated before execution
Results must be explained in a user-understandable format

A2A is designed as a structure where 5 specialized Agents with clear roles collaborate step by step. By separating each stage (natural language interpretation, Intent generation, SQL generation/validation, execution, and result explanation), responsibilities become clear. Error tracing and maintenance also become easier.

Pipeline overview

The diagram above shows how the question "How many AU yesterday?" is processed.

Core Flow:

Dialogue: Determine question type → Analytics question
Interpret: "yesterday" → date, "AU" → normalized to KPI-AU
Precheck: Intent → SQL conversion and validation
Execute: Execute on Presto → Return results
Present: Generate user-friendly response

Each Agent has clear inputs and outputs, operating as a pipeline where the previous stage's output becomes the next stage's input.

Role of each agent

1) Dialogue agent — Conversation entry point and router

Role
The entry point for all user input and the traffic controller that determines routing based on question type.

if   analytics question     → route to Interpret     # "How many AU by Segment on 2025-11-30?"
elif follow-up question     → respond directly       # "Convert results to percentage"
else                        → respond directly       # "Hello", "What can you do?"

Output

General conversation / follow-up → Conversational response shown directly to user
Analytics request → Refined question + context for Interpret to use

2) Interpret agent — Natural language to structured intent

Role

Normalizes various natural language expressions into structured Intent JSON. The NL-to-SQL pipeline can directly understand this format. The key point is that this isn't simple keyword matching but Semantic Matching.

NL → Metric normalization

For example, these questions can all mean the same thing:

"How many AU?"
"Tell me the active user count"
"Number of active users"

Interpret normalizes all of these to the same Metric Key (KPI-AU).
Internally, it references a Metric Ontology (example):

KPI-AU:
  synonyms: ["active users", "AU", "active user count"]
  definition: "Users who sent at least one message or selected an AI recommendation"
  table: suggestion_domain_mart
  logic: "is_active_user = true"
  dt_required: true

Output

It combines normalized Metric, date, and additional conditions into Intent JSON. The next stage (Precheck) can use this directly. Example:

{
  "metric_key": "KPI-AU",
  "dt": "20251125",
  "table": "suggestion_domain_mart",
  "aux_tables": [],
  "notes": []
}

3) Precheck agent — Intent → Safe SQL generation

Role

Based on the Intent JSON from Interpret:

Performs basic safety and validity checks using predefined Guardrail rules
Generates a single Presto SELECT statement using the template registry

Template selection

Selects the appropriate SQL template based on Intent's metric_key:

KPI-AU → AU aggregation template
KPI-RETENTION-D1 → D+1 retention template
SEG-RETENTION-D1 → Segmented D+1 retention template

It then incorporates date range (dt_from, dt_to), target table, and other conditions to complete the final SQL.

Guardrail validation

Validates basic safety before SQL generation:

Date format verification
Only allowed tables are used
Query is SELECT-only, etc.

Output

If validation passes, it populates the template with date range, table name, etc. to complete the final SQL:

-- Precheck output example
-- Summary: AU for 2025-11-25

SELECT
  COUNT(DISTINCT CASE WHEN is_active_user THEN user_id END) AS au_count
FROM suggestion_domain_mart
WHERE dt = '20251125'

4) Execute agent — SQL execution

Role

Directly passes the single SELECT statement from Precheck to the internal Presto-based execution environment.

Processing

Query execution
Timeout management, etc.

Output

Normalizes execution results to JSON format for Present:

{
  "type": "EXECUTE_RESULT",
  "data": [
    {"au": 99999}
  ],
  "metadata": {
    "row_count": 1,
    "execution_time_ms": 30
  }
}

5) Present agent — Result interpretation + user-friendly summary

Role

Receives Execute results (JSON) and creates the final text/table response to show users.

Processing

Present composes the final response:

Metric Definition — Explains the meaning of the requested metric
Result Summary — Organizes data in an easy-to-read format

Output

Example final user response:

📊 AU (Active Users): Users who sent at least one message or selected an AI recommendation per day

As of 2025-11-25, AU is 99,999.

This 5-stage pipeline (Dialogue → Interpret → Precheck → Execute → Present) operates sequentially within a single session while maintaining context. Therefore, the same question always generates the same SQL and the same results, guaranteeing consistent NL-to-SQL analytics.

Agent collaboration: Round robin vs swarm

Since NL-to-SQL requires different processing steps for each user question, the Agent execution method significantly impacts overall performance and user experience.

To find the optimal orchestration strategy, we evaluated two multi-agent architectures provided by our internal AI Agent Portal (v1.4.2.4): Round Robin and Swarm. We selected these two structures as practical candidates for our analytics Agents. Round Robin executes Agents in a fixed cycle, while Swarm uses dynamic handoff-based collaboration between Agents. These patterns are widely used in multi-agent orchestration, but their behavior may differ across implementations.

Early evaluations suggested that Swarm was a better fit for conversational NL-to-SQL. The sections below outline what we observed in our analytics scenario.

Overall flow comparison

The differences between the two architectures became clearer when we compared how they handled common types of user questions. The diagram below shows how each question type moves through the pipeline.

User question	Round robin	Swarm
"Hello"	Executes all 5 Agents	Executes only 1 Agent (Dialogue) → Immediate response
"AU yesterday?"	Executes all 5 Agents	Executes all 5 Agents
"Why did it increase?" (follow-up)	Executes all 5 Agents	Executes only 1 Agent (Dialogue) → Context-based immediate response

In Round Robin, every question triggers the full pipeline. In Swarm, only the Agents needed for a given question join the execution path.

Characteristics of each approach

Based on AI Agent Portal v1.4.2.4 implementation

Round robin — Fixed order full execution

Aspect	Description
Execution Strategy	Always sequential D → I → P → E → Pr
Advantages	Simple implementation, predictable behavior
Disadvantages	Later Agents still run even when not needed

Round Robin suits pipelines where every step should always run and conversational overhead is less critical.

Swarm — Rule-based selective execution

Aspect	Description
Execution Strategy	Dialogue determines question type, handoff only when needed
Advantages	Fewer Agent calls possible, shorter response time
Disadvantages	Additional handoff rule design required

Swarm is better suited to interactive scenarios that involve both conversational and analytical requests.

Swarm's routing rules:

In our Swarm setup, each question type follows a dedicated route through the Agents:

Question type	Handling	Executed agents
General Chat	Immediate response	Dialogue
Analysis Query	Full pipeline	D → I → P → E → Pr
Follow-up	Context-based response	Dialogue

Only analysis queries traverse the full five-Agent pipeline. Short conversational turns stay within the Dialogue Agent.

Summary

Round Robin and Swarm each offer different trade-offs for multi-agent orchestration. Round Robin processes every step in a fixed sequence. Swarm activates only the Agents needed for each question. In our tests, this selective execution made Swarm more effective for conversational NL-to-SQL.

Key innovations

The core of the A2A architecture described above can be summarized in three points.

Innovation 1 — Precise context injection by role

Each Agent in A2A receives only context optimized for its role.

Agent	Specialized context injected
Dialogue	Conversation context, follow-up rules
Interpret	Metric Ontology, date rules, template routing
Precheck	SQL templates, Guardrail policies, schema info
Execute	Presto execution
Present	Execute results, result interpretation

Why this matters

MCP required all rules in a single prompt:

- Natural language interpretation rules
- Schema information
- SQL templates
- Guardrail policies
- Result format rules

→ Hundreds of lines of prompts with high rule conflict potential

A2A provides only the specialized knowledge needed to each Agent. This results in:

Maximized LLM performance — Focus on relevant information only
Rule conflict prevention — Each Agent is independent
Easier maintenance — Modifying one Agent doesn't affect others

Innovation 2 — Structured agent interface

A2A's core innovation is that Agents exchange structured data.

Structured JSON exchange

[Interpret → Precheck]
{
  "metric_key": "KPI-AU",
  "dt_from": "20251125",
  "table": "suggestion_domain_mart"
}

[Precheck → Execute]
{
  "sql": "SELECT COUNT(...)"
}

This always guarantees the same structure and ensures type safety.

Practical effects

Thanks to structured interfaces, A2A guarantees:

Consistency — The same question always generates the same Intent. "How many AU yesterday?" always converts to {metric_key: "KPI-AU", dt: "20251125"} regardless of when asked, which always generates the same SQL.
Stability — Each Agent can trust the data format sent by the next Agent. Precheck is confident Interpret will always provide metric_key, and Execute is confident Precheck only sends validated SQL.

Innovation 3 — Failure isolation by stage

A2A executes and logs each stage independently, enabling immediate identification of what went wrong and where.

Stage-by-stage failure handling

Stage	Behavior on failure	Meaning
Interpret	`metric_key: unknown`	Natural language intent interpretation failed
Precheck	`PRECHECK_STOP`	Policy/schema violation
Execute	`empty result`	Execution error
Present	Empty result handling	Upstream error propagation

Practical effects

Stage-by-stage error isolation and tracing is now possible. A2A can evolve into an analytics platform that operates robustly even in real production environments.

Specialize, standardize, isolate. These three principles shaped a pipeline that's both debuggable and extensible.

Conclusion

From MCP to A2A — What we learned

Here are the lessons from this journey:

Stage	What we learned
MCP (Monolithic)	Works well for simple schemas or small data volumes, where a monolithic prompt-based architecture can be sufficient. It is especially effective when the data model is straightforward and relationships are simple.
A2A (Multi-Agent)	Provides clearer separation of responsibilities through specialized Agents and makes debugging and rule management more predictable. As analytical requirements grow, this structure supports scalable and consistent workflows.

Key Takeaway: Simpler workloads fit monolithic designs, while evolving analytical demands require collaboration across specialized Agents.
This structure enables clearer responsibilities, more predictable execution, and safer growth as NL-to-SQL workflows become more complex.

The next chapter: Evolving A2A in real use

A2A at this point is still in its early form, with some constraints:

Designed around defined question sets — Completely new pattern free-form queries aren't yet supported.
Static prompt management — User patterns or failure cases aren't automatically reflected.

These constraints aren't A2A's limitations, but starting points for the next stage.
NL-to-SQL isn't simply about converting questions to SQL. It's about designing an entire experience. This experience accurately understands user intent and computes metrics based on shared business definitions. It also delivers trustworthy results in a consistent way.
A2A is one approach to building that experience—and it's still evolving. What comes next will be shaped by real-world usage.

🚀 Stay tuned for Part 3: A2A in Production