Introduction
Enterprise data analysis faces a fundamental challenge: the gap between business questions and executable SQL queries. While data-driven decision making is critical, the reality is that most teams don't have enough data analysts to serve all business needs. Analysts spend significant time writing SQL for repetitive queries, and business teams wait days for answers to simple questions.
We all know this problem — then how should we solve it? An easy answer emerged with the introduction of generative AI models, or more specifically, large language models (LLMs). "Why not connect an LLM directly to our database? Let AI handle the SQL generation".
After Anthropic introduced the model context protocol (MCP) to enable easy connection between LLMs and external tools, we were one of the first teams in LY Corporation to create an in-house database MCP server. We connected Claude to our AI Assistant service database and expected it to democratize data analysis across the organization.
What our analysis workflow looked like
What we imagined our analysis workflow would look like after LLM implementation
It didn't work.
Why generic LLMs failed for enterprise analytics
Our initial experiment showed a possibility. LLMs were returning SQLs that seemingly looked good. Could this be the next innovation in data analysis?
Then, the more we actually looked into the generated SQLs and the more we tried asking complex and business-related questions, problems started showing up. It was revealed how little the LLM understood our business.
It didn’t understand our data.
When someone asked, “What’s this month’s DAU?”, it rummaged through dozens of tables containing “user” and confidently picked the wrong one. We realized syntax wasn’t our problem; semantics were.
It didn’t understand our business.
The model knew what “daily active users” meant in general, but not in our context where “active” only includes certain API endpoints. The LLM was fluent, not wise.
And then came the enterprise nightmares:
-
It tried to drop tables because it thought schema cleanup was “optimization”.
-
It failed with cryptic messages like “syntax error near line 42”.
-
It raised security flags the moment someone mentioned production data.
These limitations revealed a fundamental truth: generic LLMs lack the domain-specific knowledge, security policies, and organizational nuance.
The MCP server solution
Rather than abandoning AI-assisted SQL generation, we recognized that we needed a domain-specific system that understood our data and business context. That meant building a system that understood not only SQL syntax but our data, our business logic, and our guardrails. We set out to design a domain-specific NL-to-SQL MCP server to solve these challenges. The server worked as a bridge between natural language and the real-world constraints of enterprise data.
In the sections that follow, we’ll walk through how we built it:
-
Architecture: How a 7-step workflow and 4-layer design makes every query consistent and secure
-
Core components: How self-training, validation, and configuration keeps it improving over time
Our goal wasn’t just to make the model smarter, but rather to make it understandable, reliable, and teachable.
Understanding the system architecture
The NL-to-SQL MCP server transforms natural language queries into production-ready SQL through a structured pipeline. It acts like a translation system between business intent and data reality. To achieve this, we ended up with two complementary blueprints.
-
A 7-step workflow that defines how a query becomes SQL
-
A 4-layer architecture that defines where each step happens
Together, they turned what used to be a black box into a predictable, teachable process. We'll examine each perspective in detail, then show how key technical implementations enable this architecture to work in production.
The logical flow — how a question becomes a query
The 7-step workflow
Every query passes through seven mandatory checkpoints. It starts when a user asks a question in natural language, and ends when the system delivers validated, production-ready SQL or learns from the interaction to improve the next one. This structure is enforced by the system prompt and executed through MCP tool calls. The workflow ensures consistency, quality, and continuous improvement.

Step 0: User request
The user will start the process by asking a question in natural language - "What was last month’s retention?", "What was yesterday's DAU?". The system first detects the language, normalizes the phrasing, and translates it into a structured intent. This step ensures the AI truly hears the user before it starts thinking.
Step 1: Business context discovery
Before touching a database, the system looks up current business definitions and available metrics. Business context can be updated - KPIs can change, terms can be removed. This step makes sure the model never works with stale assumptions.
Step 2: Query analysis & SQL generation
This is where the core natural language processing happens. The system searches its knowledge base for similar queries. If it finds one, it adapts the existing SQL. If not, it synthesizes a new one from templates and learned patterns.
Step 3: SQL enhancement & validation
This stage acts as the quality gate: checking syntax, flagging slow or dangerous operations, and blocking any attempt to modify data. Rather than blindly trusting the generated SQL, it goes through a full validation report. When issues are detected in this step, the system goes back to Step 2 and tries to generate a validated SQL based on the issue report.
Step 4: SQL formatting & optimization
The final SQL may run, but it can be difficult to read to the human eye. Here, the query is formatted, indexed, and rewritten for clarity and performance. It ensures that the user can actually read and understand the SQL that is used.
Step 5: SQL delivery and guidance
At this point, the user sees their query and result. They get the retrieved business context, execution guide, validation report, and how to interpret the data.
Something else happens in the background: the system evaluates whether this interaction is worth remembering. Was it novel? High-confidence? Correct? Every response is a learning opportunity waiting to be confirmed.
Step 6: Interactive learning dialog
If the query meets the criteria, the system opens a dialogue: "This looks like a valuable example!" It explains why this example will help improve the system, and suggests adding it to the dataset. When the user agrees, that moment of success becomes a new golden example.
Step 7: Auto-expansion & training update
Finally, the system takes that one golden example and expands it, generating new variations across phrasing, parameters, and complexity. One query becomes many. It’s how the AI evolves from rule-following to intuition.
Benefits of enforcing a workflow
This structured approach provides three critical advantages:
- Consistency: Same query always follows same process, producing reliable results
- Quality gates: Multiple validation steps catch errors before SQL reaches users
- Self-improvement: Every successful query is an opportunity to enhance the system
The enforcement mechanism is simple but effective: the system prompt explicitly instructs LLMs to execute these steps in order, and each step is implemented as a dedicated MCP tool that must be called.
The physical structure - under the hood
Separating concerns
While the 7-step workflow describes the sequence of operations, the 4-layer architecture defines where each operation happens and how components interact. This separation enables independent development, testing, and maintenance of each layer.
The 4-layer architecture

Layer 1: Client integration
This is where the behavior is defined and the 7-step workflow is enforced. It ensures that every request enters through the same disciplined 7-step process, resulting in consistent, high-quality processing for every query.
Layer 2: MCP server implementation
This layer provides a structured interface between the LLM and semantic engine. Whether it’s SQL generation, validation, or self-training, each operation runs through standardized APIs with error handling baked in. It enforces structure without stifling flexibility.
Layer 3: Semantic processing engine
This is where the core intelligence resides. It’s responsible for understanding intent, finding similar patterns, and generating SQL using both learned examples and fallback templates. It is implemented through the semantic orchestrator, the 'brain' of the system.
Layer 4: Knowledge & configuration base
All of the system’s institutional knowledge lives here in YAML and CSV files that describe metrics, queries, and business rules. Because it’s human-readable, domain experts can edit it directly. This layer keeps the system current, transparent, and explainable.
Benefits of this architecture
Together, these four layers create a modular foundation. We can improve one layer without breaking the others, replace the knowledge base to adapt to new business domains, or even connect multiple MCP servers to form larger, cross-domain analytics networks. It also enables the 7-step workflow by providing clean separation between "what to do" (workflow), "how to communicate" (MCP tools), "how to process" (semantic engine), and "what to know" (knowledge base).
Key components and implementations
Having understood the architecture, let's examine what makes this system significantly more effective than generic LLM-based solutions for enterprise data analysis. We identified six core innovations that set our MCP server apart.
| Issue | Generic LLMs | Our NL-to-SQL MCP Server |
| Domain knowledge | One-size-fits-all approach | Business knowledge externalized in human-editable YAML/CSV files |
| Learning capability | Static, never improves | Self-training with automatic variation generation from user interactions |
| Consistency | Unpredictable behavior | Mandatory 7-step workflow enforced by system prompt |
| Language support | English-only | Multi-language semantic understanding (Korean/English/Japanese) with business term mapping |
| Security & validation | Basic SQL generation | Three-layer validation pipeline (syntax, security, performance) |
| SQL generation | Black box process | Confidence-aware generation with similarity scoring and intelligent fallback patterns |
All six work together, but two form the beating heart of the system - Domain-specific configurability (teaching the AI what to know) and self-training architecture (teaching it how to grow). This post focuses on these two innovations in detail. If you're interested in the detailed implementation of other supporting innovations, feel free to reach out.
Innovation 1: Domain-specific configurability
Generic LLMs don't know what "DAU" means in our service. They don't know which tables are authoritative. They can't distinguish between test data and production metrics. Domain configurability solves this by externalizing business knowledge into human-readable configuration files.
How business knowledge becomes system intelligence
Our system separates business logic from technical implementation by moving all the business logic out of code and into human-readable files. We now define metrics, SQL templates, and segmentation rules directly in YAML.
- business_terms.yaml: What metrics mean in your business
- queries.yaml: How to calculate those metrics
- user_segmentation.yaml: Business rules and logic
We can edit these files directly without developer involvement, and changes take effect immediately without code deployment.
Example: Defining business terms
When someone asks "What's the DAU?", the system needs to know your specific definition. Here's how you configure it:
business_terms:
DAU:
term: "DAU"
full_name: "Daily Active Users"
definition: "Users with event logs during measurement period, excluding blocked and unsubscribed"
calculation_logic: "COUNT(DISTINCT user_id) WHERE has_events AND NOT blocked"
data_sources: ["user_event_history", "user_status_history"]
related_terms: ["MAU", "engagement", "retention"]
That small decision changes everything. It ensures that every query about DAU uses the same business logic, regardless of who asks or how they phrase the question.
Innovation 2: Self-training architecture
We didn’t want a static system. We wanted something that gets better with use. Every time a user asks a question and confirms the answer, the system adds it to its training set and automatically generates eight new variations (different phrasing, parameters, or levels of complexity).

The learning loop: from query to knowledge
Here's how the system learns.

-
The system evaluates whether the query is worth learning from, based on confidence, novelty, and validation results.
-
If approved, it asks the user for consent: “Should I learn this?”
-
Once confirmed, the query is stored in a CSV-based dataset that any analyst can review or edit. We chose CSV files over vector databases or complex ML pipelines for transparency and maintainability.
-
From there, the system automatically generates variations through reworded queries, different date ranges, and complexity levels. This effectively multiplies its training set.
Smart learning assessment
The system doesn't learn from every query. It automatically evaluates whether a successful query-SQL pair should be added to training data.
Learning opportunity criteria:
- High confidence score
- SQL passed all validation checks
- Novel pattern (not too similar to existing examples)
- Business-critical category (KPIs, revenue, user metrics)
- User confirmed accuracy
This ensures the training dataset contains only proven, high-quality examples.