LY Corporation Tech Blog

We are promoting the technology and development culture that supports the services of LY Corporation and LY Corporation Group (LINE Plus, LINE Taiwan and LINE Vietnam).

Building a multi-agent pipeline for NL-to-SQL analytics

This post is a follow-up to Creating a domain-specific NL-to-SQL MCP server, which introduced our MCP-based NL-to-SQL system. Here, we explore how we overcame the initial implementation's limitations through a Multi-Agent architecture.

Introduction — Why multi-agent?

While building our initial NL-to-SQL system on MCP Server, we validated a lot of possibilities. Along the way, we also discovered that a more stable and scalable architecture was clearly needed.

Limitation 1 — Automatic generation, separate execution

Our initial MCP-based NL-to-SQL system had a unidirectional connection between the LLM that generates SQL and the server that actually executes it.
The flow looked like: User question → SQL generation → Execution on a separate server → Results. While the SQL itself was generated automatically, actual execution still depended on a separate tool.

Limitation 2 — Single-point failure in a monolithic engine

The existing MCP processed everything in a single flow: Natural language interpretation → Intent generation → SQL generation → Result explanation.
This led to:

  • Difficulty identifying which step caused an incorrect interpretation
  • A single error cascading through the entire result
  • Users struggling to understand what went wrong

Limitation 3 — Challenges in maintenance and scalability

Real-world data analysis involves various conditions and edge cases that can't be solved with simple aggregations alone. However, all rules must be packed into a single prompt in a monolithic MCP structure.
This results in:

  • Prompts becoming bloated with increasing rules, raising conflict potential
  • Hesitation to add new rules due to risk of affecting existing behavior
  • Difficulty tracking which rules were or weren't applied

Ultimately, maintenance becomes increasingly painful as rules grow more complex.

So we built an A2A (agent-to-agent) pipeline

"From an AI that generates SQL" → "To an AI team that performs the analysis itself"
Instead of concentrating all responsibility in a single engine, we redesigned the system around specialized Agents. Each Agent collaborates step by step based on its role. We call this approach A2A (Agent-to-Agent) and implemented it using our internal AI Agent Portal (v1.4.2.4).

With the A2A architecture:

  • Natural language interpretation, query generation, execution, and result explanation are connected as a seamless pipeline.
  • Each step is isolated, enabling immediate identification of error sources.
  • Rules are distributed across Agents, so adding new features doesn't impact existing behavior.

In other words, A2A establishes an analytics foundation with accuracy, consistency, and operational stability that was difficult to achieve with a monolithic MCP structure.


Why UTS data — Complex, but necessary

What we want to know isn't just whether a metric increased or decreased. We want to understand what actions users actually took and what flows preceded and followed those actions.
To understand this, summarized metrics alone aren't enough. We need raw event logs rich with behavioral context. This is exactly what UTS (User Tracking System) provides.

UTS is a user behavior log collection system used across various internal services:

  • It records activities like feature usage, interactions, and screen navigation in a consistent event format
  • Its JSON-based denormalized structure flexibly accommodates various attributes for service-specific needs
  • It provides granular, multi-dimensional user behavior data including session/action-level information

Since all services share the same schema rules:

  • Analysis isn't limited to a specific service
  • The same data structure and pipeline can be reused when expanding analysis scope to other services

In this post, we'll use the AI Talk Suggestion service data as an example to explain our NL-to-SQL analytics system. 

AI Talk Suggestion is a feature where AI suggests appropriate responses based on conversation context. Users can select suggested content at the right moment or compose their own response to send.
See official LINE guide for details.

We'll use modified example data based on actual logs for illustration purposes.

UTS-like event log example

Each row in UTS represents a single event performed by a user.

dthouruser_idsession_idview_nameevent_typeevent_groupevent_detailsclient_tsserver_ts
2025112514usr_001SESS-A1suggestion_mainviewopen{"origin":"chat_list","mode":"default"}17369160001001736916000700
2025112514usr_001SESS-A1suggestion_mainsignalsuggest{"action":"shown","item_id":"sug_101"}17369160003001736916000850
2025112514usr_001SESS-A1suggestion_mainactionsuggest{"action":"selected","item_id":"sug_101"}17369160004501736916000920
2025112514usr_001SESS-A1suggestion_mainactionmessage{"action":"sent","content_type":"text"}17369160007001736916001100
2025112515usr_002SESS-B2suggestion_mainsignalsuggest{"action":"shown","item_id":"sug_205"}17369160012001736916001350
2025112515usr_002SESS-B2suggestion_mainactionsuggest{"action":"dismissed","item_id":"sug_205"}17369160014001736916001550
2025112516usr_003SESS-C3settings_panelviewsettings{"section":"ai_suggest"}17369160020001736916002600
2025112516usr_003SESS-C3settings_panelactiontoggle{"feature":"ai_suggest","value":"off"}17369160024501736916003000

For example, looking at usr_001's logs:

  • Who: usr_001
  • When: 2025-11-25, 14:00
  • Where: suggestion_main screen
  • What:
    1. Opened the suggestion screen from the chat list (view/open)
    2. AI recommendation was displayed (signal/suggest)
    3. Selected the recommended item (action/suggest)
    4. Sent a message (action/message)

Since each row captures who, when, what, and in what context, UTS serves as the core data source for granular user behavior analysis.


Foundation — Designing an NL-to-SQL data mart

While the A2A architecture improves analysis accuracy and quality, directly querying UTS data has practical limitations:

  • Tables are massive (JSON-based denormalization)
  • High join costs due to event-level storage
  • Inevitable performance degradation if the NL-to-SQL pipeline queries raw data every time

Therefore, we built a dedicated data mart that normalizes and summarizes raw UTS for analytical purposes. By having NL-to-SQL operate on this data mart, we enabled lighter and more stable analytics execution.

LLM-friendly restructuring of raw logs

UTS contains mixed event information and JSON parameters. This makes it unstable for direct LLM interpretation. So we first extract behavioral units using rule-based logic. Then we pre-process them into semantic metrics that LLMs can directly reference.

Step 1 — Transform raw events into 'action' units

Here's a simplified example of event logs for structural explanation:

dthouruser_idevent_typeevent_groupevent_details
2025112514usr_001signalsuggest{"action":"shown"}
2025112514usr_001actionsuggest{"action":"selected"}
2025112514usr_001actionmessage{"action":"sent","length":"24"}
2025112516usr_003viewsettings{"section":"ai_suggest","mode":"v2"}

These raw events have diverse structures. This makes it difficult to determine meaningful actions directly from event type/attribute combinations.
So in the ETL stage, we transform raw events into consistent action units based on human-defined rules.

-- Step 1: Example of transforming raw events into action flags
SELECT
  dt,
  hour,
  user_id,
 
  /* Suggest shown flag */
  (event_type = 'signal'
   AND event_group = 'suggest'
   AND COALESCE(event_details['action'], '') = 'shown') AS is_suggest_shown,
 
  /* Suggest selected flag */
  (event_type = 'action'
   AND event_group = 'suggest'
   AND COALESCE(event_details['action'], '') IN ('selected','applied')) AS is_suggest_selected,
 
  /* Message sent flag */
  (event_type = 'action'
   AND event_group = 'message'
   AND event_details['action'] = 'sent') AS is_message_sent,
 
  ...
FROM raw_events

Step 1 Result Example:

dthouruser_idis_suggest_shownis_suggest_selectedis_message_sentis_settings_view
2025112514usr_001truefalsefalsefalse
2025112514usr_001falsetruefalsefalse
2025112514usr_001falsefalsetruefalse
2025112516usr_003falsefalsefalsetrue

This process removes JSON parsing and conditional branching complexity. Each event becomes a clear action flag indicating which action it corresponds to.

Step 2 — Aggregate by analysis grain and pre-compute semantic metrics

Users ask questions like:

  • "Users who saw AI recommendations?"
  • "Users who selected recommendations?"
  • "Active users who sent messages?"

To answer these, you need to combine Step 1's action flags to define concepts like "active user". However, SQL varies across questions if the LLM has to figure out this combination every time.
So in the ETL stage, we pre-aggregate by user-day (dt, user_id) and pre-compute frequently used semantic judgments as derived columns.

-- Step 2: Example of user-day aggregation and derived metric generation
SELECT
  dt,
  user_id,
  SUM(is_suggest_shown)    AS suggest_shown_cnt,
  SUM(is_suggest_selected) AS suggest_selected_cnt,
  SUM(is_message_sent)     AS message_sent_cnt,
  SUM(is_settings_view)    AS settings_view_cnt,
  -- Example AU definition: message sent OR recommendation selected
  CASE WHEN SUM(is_message_sent) > 0 OR SUM(is_suggest_selected) > 0
       THEN true ELSE false END AS is_active_user
FROM step1_events
GROUP BY dt, user_id

Step 2 Result Example:

dtuser_idsuggest_shown_cntsuggest_selected_cntmessage_sent_cntsettings_view_cntis_active_user
20251125usr_0011110true
20251125usr_0021000false
20251125usr_0030001false

These columns directly correspond to natural language questions:

Natural language questionColumn condition for LLM
Users who saw AI recommendations?suggest_shown_cnt > 0
Users who selected recommendations?suggest_selected_cnt > 0
Active users who sent messages?is_active_user = true

Pre-processing aggregation and semantic calculation in ETL and exposing clear columns to the LLM enables stable and consistent SQL generation.

Domain-specific analytics data marts

User behavior within a service is recorded across multiple UTS tables spanning recommendation usage, screen navigation, settings, and more.
Based on this organized behavioral unit data, we configured domain-specific data marts. These marts quickly answer frequently asked questions in each area.
Each mart contains only the tables and columns needed for specific analytics domains. This enables the LLM to quickly find the right table and right column for each question type.

The example below uses simplified names and columns unrelated to the actual structure.

image

Question typeReference martKey columns
AU, recommendation selection rateSuggestion Domain martis_active_user, suggest_selected_cnt
Screen dwell timeView Domain martavg_stay_time, view_entries
Settings change statusSettings Domain martai_suggest_enabled, toggle_change_cnt

Real-world example

User Question:
"How many AU yesterday?"

The LLM recognizes the question is about active users (AU) and finds the is_active_user column from the Suggestion Domain mart to generate SQL:

SELECT COUNT(DISTINCT user_id) AS au_count
FROM suggestion_domain_mart
WHERE dt = '20251125'
  AND is_active_user = true

If we queried Raw UTS directly:

-- Complex query difficult for LLM to generate
SELECT COUNT(DISTINCT user_id)
FROM uts_raw_events
WHERE dt = '20251125'
  AND (
    (event_type = 'action' AND event_group = 'message'
     AND JSON_EXTRACT_SCALAR(event_details, '$.action') = 'sent')
    OR
    (event_type = 'action' AND event_group = 'suggest'
     AND JSON_EXTRACT_SCALAR(event_details, '$.action') = 'selected')
  )

Thanks to domain-specific marts, the LLM can quickly determine which table to look at and which column to use, generating accurate SQL.


A2A architecture — 5-agent collaborative pipeline

The data marts designed earlier provide an LLM-friendly structure, but the process of converting natural language questions into accurate SQL and executing them remains complex.

  • Various user expressions must be interpreted consistently
  • Invalid queries must be validated before execution
  • Results must be explained in a user-understandable format

A2A is designed as a structure where 5 specialized Agents with clear roles collaborate step by step. By separating each stage (natural language interpretation, Intent generation, SQL generation/validation, execution, and result explanation), responsibilities become clear. Error tracing and maintenance also become easier.

Pipeline overview

image

The diagram above shows how the question "How many AU yesterday?" is processed.

Core Flow:

  1. Dialogue: Determine question type → Analytics question
  2. Interpret: "yesterday" → date, "AU" → normalized to KPI-AU
  3. Precheck: Intent → SQL conversion and validation
  4. Execute: Execute on Presto → Return results
  5. Present: Generate user-friendly response

Each Agent has clear inputs and outputs, operating as a pipeline where the previous stage's output becomes the next stage's input.

Role of each agent

1) Dialogue agent — Conversation entry point and router

Role
The entry point for all user input and the traffic controller that determines routing based on question type.

if   analytics question     → route to Interpret     # "How many AU by Segment on 2025-11-30?"
elif follow-up question     → respond directly       # "Convert results to percentage"
else                        → respond directly       # "Hello", "What can you do?"

Output

  • General conversation / follow-up → Conversational response shown directly to user
  • Analytics request → Refined question + context for Interpret to use

2) Interpret agent — Natural language to structured intent

Role

Normalizes various natural language expressions into structured Intent JSON. The NL-to-SQL pipeline can directly understand this format. The key point is that this isn't simple keyword matching but Semantic Matching.

NL → Metric normalization

For example, these questions can all mean the same thing:

  • "How many AU?"
  • "Tell me the active user count"
  • "Number of active users"

Interpret normalizes all of these to the same Metric Key (KPI-AU).
Internally, it references a Metric Ontology (example):

KPI-AU:
  synonyms: ["active users", "AU", "active user count"]
  definition: "Users who sent at least one message or selected an AI recommendation"
  table: suggestion_domain_mart
  logic: "is_active_user = true"
  dt_required: true
Output

It combines normalized Metric, date, and additional conditions into Intent JSON. The next stage (Precheck) can use this directly. Example:

{
  "metric_key": "KPI-AU",
  "dt": "20251125",
  "table": "suggestion_domain_mart",
  "aux_tables": [],
  "notes": []
}

3) Precheck agent — Intent → Safe SQL generation

Role

Based on the Intent JSON from Interpret:

  1. Performs basic safety and validity checks using predefined Guardrail rules
  2. Generates a single Presto SELECT statement using the template registry
Template selection

Selects the appropriate SQL template based on Intent's metric_key:

  • KPI-AU → AU aggregation template
  • KPI-RETENTION-D1 → D+1 retention template
  • SEG-RETENTION-D1 → Segmented D+1 retention template

It then incorporates date range (dt_from, dt_to), target table, and other conditions to complete the final SQL.

Guardrail validation

Validates basic safety before SQL generation:

  • Date format verification
  • Only allowed tables are used
  • Query is SELECT-only, etc.
Output

If validation passes, it populates the template with date range, table name, etc. to complete the final SQL:

-- Precheck output example
-- Summary: AU for 2025-11-25

SELECT
  COUNT(DISTINCT CASE WHEN is_active_user THEN user_id END) AS au_count
FROM suggestion_domain_mart
WHERE dt = '20251125'

4) Execute agent — SQL execution

Role

Directly passes the single SELECT statement from Precheck to the internal Presto-based execution environment.

Processing
  • Query execution

  • Timeout management, etc.

Output

Normalizes execution results to JSON format for Present:

{
  "type": "EXECUTE_RESULT",
  "data": [
    {"au": 99999}
  ],
  "metadata": {
    "row_count": 1,
    "execution_time_ms": 30
  }
}

5) Present agent — Result interpretation + user-friendly summary

Role

Receives Execute results (JSON) and creates the final text/table response to show users.

Processing

Present composes the final response:

  1. Metric Definition — Explains the meaning of the requested metric
  2. Result Summary — Organizes data in an easy-to-read format
Output

Example final user response:

📊 AU (Active Users): Users who sent at least one message or selected an AI recommendation per day

As of 2025-11-25, AU is 99,999.

This 5-stage pipeline (Dialogue → Interpret → Precheck → Execute → Present) operates sequentially within a single session while maintaining context. Therefore, the same question always generates the same SQL and the same results, guaranteeing consistent NL-to-SQL analytics.

Agent collaboration: Round robin vs swarm

Since NL-to-SQL requires different processing steps for each user question, the Agent execution method significantly impacts overall performance and user experience.

To find the optimal orchestration strategy, we evaluated two multi-agent architectures provided by our internal AI Agent Portal (v1.4.2.4): Round Robin and Swarm. We selected these two structures as practical candidates for our analytics Agents. Round Robin executes Agents in a fixed cycle, while Swarm uses dynamic handoff-based collaboration between Agents. These patterns are widely used in multi-agent orchestration, but their behavior may differ across implementations.

Early evaluations suggested that Swarm was a better fit for conversational NL-to-SQL. The sections below outline what we observed in our analytics scenario.

Overall flow comparison

The differences between the two architectures became clearer when we compared how they handled common types of user questions. The diagram below shows how each question type moves through the pipeline.

image

User questionRound robinSwarm
"Hello"Executes all 5 AgentsExecutes only 1 Agent (Dialogue) → Immediate response
"AU yesterday?"Executes all 5 AgentsExecutes all 5 Agents
"Why did it increase?" (follow-up)Executes all 5 AgentsExecutes only 1 Agent (Dialogue) → Context-based immediate response

In Round Robin, every question triggers the full pipeline. In Swarm, only the Agents needed for a given question join the execution path.

Characteristics of each approach

Based on AI Agent Portal v1.4.2.4 implementation

Round robin — Fixed order full execution
AspectDescription
Execution StrategyAlways sequential D → I → P → E → Pr
AdvantagesSimple implementation, predictable behavior
DisadvantagesLater Agents still run even when not needed

Round Robin suits pipelines where every step should always run and conversational overhead is less critical.

Swarm — Rule-based selective execution
AspectDescription
Execution StrategyDialogue determines question type, handoff only when needed
AdvantagesFewer Agent calls possible, shorter response time
DisadvantagesAdditional handoff rule design required

Swarm is better suited to interactive scenarios that involve both conversational and analytical requests.

Swarm's routing rules:

In our Swarm setup, each question type follows a dedicated route through the Agents:

Question typeHandlingExecuted agents
General ChatImmediate responseDialogue
Analysis QueryFull pipelineD → I → P → E → Pr
Follow-upContext-based responseDialogue

Only analysis queries traverse the full five-Agent pipeline. Short conversational turns stay within the Dialogue Agent.

Summary

Round Robin and Swarm each offer different trade-offs for multi-agent orchestration. Round Robin processes every step in a fixed sequence. Swarm activates only the Agents needed for each question. In our tests, this selective execution made Swarm more effective for conversational NL-to-SQL.

Key innovations

The core of the A2A architecture described above can be summarized in three points.

Innovation 1 — Precise context injection by role

Each Agent in A2A receives only context optimized for its role.

AgentSpecialized context injected
DialogueConversation context, follow-up rules
InterpretMetric Ontology, date rules, template routing
PrecheckSQL templates, Guardrail policies, schema info
ExecutePresto execution
PresentExecute results, result interpretation

Why this matters

MCP required all rules in a single prompt:

- Natural language interpretation rules
- Schema information
- SQL templates
- Guardrail policies
- Result format rules

→ Hundreds of lines of prompts with high rule conflict potential

A2A provides only the specialized knowledge needed to each Agent. This results in:

  • Maximized LLM performance — Focus on relevant information only
  • Rule conflict prevention — Each Agent is independent
  • Easier maintenance — Modifying one Agent doesn't affect others

Innovation 2 — Structured agent interface

A2A's core innovation is that Agents exchange structured data.

Structured JSON exchange

[Interpret → Precheck]
{
  "metric_key": "KPI-AU",
  "dt_from": "20251125",
  "table": "suggestion_domain_mart"
}

[Precheck → Execute]
{
  "sql": "SELECT COUNT(...)"
}

This always guarantees the same structure and ensures type safety.

Practical effects

Thanks to structured interfaces, A2A guarantees:

  • Consistency — The same question always generates the same Intent. "How many AU yesterday?" always converts to {metric_key: "KPI-AU", dt: "20251125"} regardless of when asked, which always generates the same SQL.
  • Stability — Each Agent can trust the data format sent by the next Agent. Precheck is confident Interpret will always provide metric_key, and Execute is confident Precheck only sends validated SQL.

Innovation 3 — Failure isolation by stage

A2A executes and logs each stage independently, enabling immediate identification of what went wrong and where.

Stage-by-stage failure handling

StageBehavior on failureMeaning
Interpretmetric_key: unknownNatural language intent interpretation failed
PrecheckPRECHECK_STOPPolicy/schema violation
Executeempty resultExecution error
PresentEmpty result handlingUpstream error propagation

Practical effects

Stage-by-stage error isolation and tracing is now possible. A2A can evolve into an analytics platform that operates robustly even in real production environments.

Specialize, standardize, isolate. These three principles shaped a pipeline that's both debuggable and extensible.


Conclusion

From MCP to A2A — What we learned

Here are the lessons from this journey:

StageWhat we learned
MCP (Monolithic)

Works well for simple schemas or small data volumes, where a monolithic prompt-based architecture can be sufficient.
It is especially effective when the data model is straightforward and relationships are simple.

A2A (Multi-Agent)

Provides clearer separation of responsibilities through specialized Agents and makes debugging and rule management more predictable.
As analytical requirements grow, this structure supports scalable and consistent workflows.

Key Takeaway: Simpler workloads fit monolithic designs, while evolving analytical demands require collaboration across specialized Agents.
This structure enables clearer responsibilities, more predictable execution, and safer growth as NL-to-SQL workflows become more complex.

The next chapter: Evolving A2A in real use

A2A at this point is still in its early form, with some constraints:

  • Designed around defined question sets — Completely new pattern free-form queries aren't yet supported.
  • Static prompt management — User patterns or failure cases aren't automatically reflected.

These constraints aren't A2A's limitations, but starting points for the next stage.
NL-to-SQL isn't simply about converting questions to SQL. It's about designing an entire experience. This experience accurately understands user intent and computes metrics based on shared business definitions. It also delivers trustworthy results in a consistent way.
A2A is one approach to building that experience—and it's still evolving. What comes next will be shaped by real-world usage.

🚀 Stay tuned for Part 3: A2A in Production