Stata-MCP Overview

What is Stata-MCP and Stata?

Stata-MCP is a Model Context Protocol (MCP) server that bridges Large Language Models (LLMs) with Stata, enabling autonomous econometric analysis and statistical computation. Built on the FastMCP framework, Stata-MCP exposes Stata's comprehensive analytical capabilities as structured tools that LLMs can invoke programmatically, transforming natural language queries into reproducible Stata workflows.

Why Stata-MCP?

Stata remains the dominant analytical engine in empirical social science research. In China's economics discipline alone, over 80% of published articles are empirical studies, with more than 98.4% utilizing Stata for analysis. This prevalence stems from Stata's mature ecosystem, methodological completeness, and reliability in reproducing published research.

Stata-MCP addresses a critical gap in AI-assisted research: while modern LLMs excel at code generation and statistical reasoning, they lack native execution environments for domain-specific tools like Stata. By implementing the MCP protocol, Stata-MCP enables:

Deterministic Execution: LLM-generated Stata code executes in a controlled, reproducible environment
Methodological Rigor: Access to Stata's validated econometric implementations ensures analytical integrity
Workflow Orchestration: Complex multi-step analyses (data cleaning → estimation → visualization) become automated pipelines
Cross-Platform Compatibility: Unified abstraction layer across macOS, Windows, and Linux environments

Architecture Overview

Stata-MCP operates through four architectural layers:

1. Protocol Layer (MCP Server)

The FastMCP-based server (src/stata_mcp/__init__.py) implements the Model Context Protocol, exposing Stata operations as structured tools. Each tool defines: - Input parameter schemas with type validation - Output serialization for LLM consumption - Error handling and logging infrastructure - Resource registration for stateful operations

2. Execution Layer (Stata Integration)

Platform-specific Stata controllers manage command execution: - StataFinder: Locates Stata executables across operating systems (macOS: /Applications/Stata/, Windows: Program Files, Linux: system PATH) - StataController: Manages Stata process lifecycle, command invocation, and exit code monitoring - StataDo: Handles do-file execution with log capture and error reporting

3. Security & Monitoring Layer

Advanced safety features for production deployments: - Security Guard: Validates dofiles against dangerous commands (shell execution, file deletion, etc.) - Monitoring System: Real-time RAM monitoring with automatic process termination - Blacklist-based validation: Blocks dangerous operations before execution - Resource limits: Prevents memory exhaustion and system instability

4. Configuration Layer

Unified configuration management with hierarchical priority: - Configuration System: TOML-based config file at ~/.statamcp/config.toml - Environment variables: Override settings for specific sessions - Priority: Environment variables > config file > defaults - Sections: DEBUG, SECURITY, PROJECT, MONITOR

5. Application Layer (Modes & Tools)

Two primary operational modes:

MCP Server Mode (Default)

Operates as a stdio/HTTP/SSE server, responding to tool invocation requests from MCP-compliant clients. Tools include:

Tool	Purpose
`stata_do`	Execute do-files with log retrieval
`write_dofile`	Create timestamped do-files
`append_dofile`	Extend existing do-files immutably
`get_data_info`	Analyze CSV/DTA files with statistical summaries
`help`	Retrieve Stata command documentation (cached)
`ssc_install`	Install packages from SSC/GitHub/net sources
`load_figure`	Load Stata-generated graphics for display
`read_file`	Generic file reading with encoding support
`mk_dir`	Secure directory creation with validation

Agent Mode (`--agent` flag)

Interactive REPL agent for conversational analysis: - Read-Eval-Print Loop (REPL) interface for multi-turn sessions - SQLite-based session management for conversation history - Custom working directory support via --agent <path> - Environment variables for model configuration (STATA_MCP_MODEL, STATA_MCP_API_KEY) - Supports any OpenAI-compatible API endpoint

Data Processing Pipeline

Stata-MCP implements a polymorphic data analysis system supporting multiple formats:

DataInfo Architecture

Abstract base class DataInfoBase with format-specific implementations: - DtaDataInfo: Native Stata .dta format with metadata extraction - CsvDataInfo: CSV files with encoding detection and type inference - ExcelDataInfo: Excel workbooks with sheet selection

Statistical Metrics

Configurable metric computation (via ~/.statamcp/config.toml or environment variables): - Default: observations, mean, standard error, minimum, maximum - Extended: Q1, Q3, skewness, kurtosis, unique value sampling

Caching Strategy

Content-addressable cache using MD5 hashing:

~/.statamcp/.cache/data_info__<name>_<ext>__hash_<suffix>.json

Cache invalidation occurs automatically on content change detection.

Project Structure Convention

Stata-MCP enforces a standardized directory layout for reproducible research:

~/Documents/stata-mcp-folder/
├── stata-mcp-log/      # Stata execution logs (timestamped)
├── stata-mcp-dofile/   # Generated do-files (ISO 8601 timestamps)
├── stata-mcp-result/   # Command outputs (outreg2, esttab exports)
└── stata-mcp-tmp/      # Temporary artifacts (data info cache)

For AI-assisted research projects, the recommended template (stata-mcp --init) creates:

<project_name>/
├── .claude/
│   ├── skills/              # Custom Claude Code skills
│   └── settings.local.json  # MCP server registration
├── source/
│   ├── data/
│   │   ├── raw/             # Immutable source data
│   │   ├── processing/      # Intermediate datasets
│   │   └── final/           # Analysis-ready data
│   ├── figs/                # Publication figures
│   └── tabs/                # Publication tables
├── stata-mcp-folder/        # Stata-MCP working directory
└── CLAUDE.md                # Project-specific instructions

Integration Patterns

In AI Clients

MCP-compliant clients (Claude Code, Cline, Continue) register Stata-MCP as a server in their configuration:

{
  "mcpServers": {
    "stata-mcp": {
      "command": "uvx",
      "args": ["stata-mcp"]
    }
  }
}

In Python Agents

Stata-MCP agents can be embedded as tools within other agent workflows:

from stata_mcp.agent_as import StataAgent
from agents import Agent, Runner

# Initialize Stata agent and convert to tool
stata_agent = StataAgent()
stata_tool = stata_agent.as_tool

# Embed in a larger agent workflow
research_assistant = Agent(
    name="Research Assistant",
    instructions="You help with economic research using Stata",
    tools=[stata_tool]
)

# Run the agent
result = await Runner.run(
    research_assistant,
    "Analyze the relationship between education and income"
)

Terminal REPL

Interactive analysis sessions:

from stata_mcp.agent_as import REPLAgent

agent = REPLAgent(work_dir="~/analysis")
agent.run()  # Starts interactive REPL

Cross-Platform Support

Platform	Stata Detection	Package Installation	Help System
macOS	`/Applications/Stata/StataMP`	Native CLI	✅ Cached
Windows	`Program Files` registry	Do-file delegation	❌ Not supported
Linux	`stata-mp` from PATH	Native CLI	✅ Cached

Design Philosophy

Immutability: Source files remain unmodified; all operations create timestamped artifacts
Fail-Safety: Graceful degradation (e.g., append_dofile creates new files if source missing)
Reproducibility: Deterministic paths, automatic logging, and cache invalidation
Extensibility: Plugin architecture for custom tools and data format handlers
Security First:
Security Guard: Blocks dangerous commands before execution
Path Validation: Restricts file operations to working directory
Resource Monitoring: Prevents memory exhaustion via RAM monitoring
Sandboxed Execution: Isolated execution environments for safety

Advanced Features

Security Guard ✅

Automatically validates all dofile code against dangerous commands: - Blocks shell execution (!, shell, xshell, etc.) - Prevents file deletion operations (erase, rm) - Stops untrusted code execution (run, do, include) - Configurable via Security settings

RAM Monitoring ✅

Real-time monitoring of Stata process memory usage: - Tracks RAM usage during execution - Automatically terminates processes exceeding limits - Configurable RAM limits per project - Minimal overhead with daemon thread architecture - See Monitoring documentation for details

Unified Configuration ✅

Hierarchical configuration system: - TOML-based configuration file (~/.statamcp/config.toml) - Environment variable overrides - Sections: DEBUG, SECURITY, PROJECT, MONITOR - See Configuration documentation for details

Sandbox System (not support now)

Alternative execution backend using Jupyter kernels for environments without Stata licenses or for testing purposes.

Multi-Language Support (not support now)

Configurable language settings for localized error messages and documentation.

Citation and Acknowledgments

Stata-MCP is developed by the empirical research community to bridge AI assistance with domain-specific analytical tools. Contributions, bug reports, and feature requests are welcome via the GitHub repository.