MCP-for-Stata Overview

What is MCP-for-Stata and Stata?

MCP-for-Stata is a Model Context Protocol (MCP) server that bridges Large Language Models (LLMs) with Stata, enabling autonomous econometric analysis and statistical computation. Built on the FastMCP framework, MCP-for-Stata exposes Stata's comprehensive analytical capabilities as structured tools that LLMs can invoke programmatically, transforming natural language queries into reproducible Stata workflows.

Why MCP-for-Stata?

Stata remains the dominant analytical engine in empirical social science research. In China's economics discipline alone, over 80% of published articles are empirical studies, with more than 98.4% utilizing Stata for analysis. This prevalence stems from Stata's mature ecosystem, methodological completeness, and reliability in reproducing published research.

MCP-for-Stata addresses a critical gap in AI-assisted research: while modern LLMs excel at code generation and statistical reasoning, they lack native execution environments for domain-specific tools like Stata. By implementing the MCP protocol, MCP-for-Stata enables:

Deterministic Execution: LLM-generated Stata code executes in a controlled, reproducible environment
Methodological Rigor: Access to Stata's validated econometric implementations ensures analytical integrity
Workflow Orchestration: Complex multi-step analyses (data cleaning → estimation → visualization) become automated pipelines
Cross-Platform Compatibility: Unified abstraction layer across macOS, Windows, and Linux environments

Architecture Overview

MCP-for-Stata operates through four architectural layers:

1. Protocol Layer (MCP Server)

The FastMCP-based server (src/stata_mcp/__init__.py) implements the Model Context Protocol, exposing Stata operations as structured tools. Each tool defines: - Input parameter schemas with type validation - Output serialization for LLM consumption - Error handling and logging infrastructure - Resource registration for stateful operations

2. Execution Layer (Stata Integration)

Platform-specific Stata controllers manage command execution: - StataFinder: Locates Stata executables across operating systems (macOS: /Applications/Stata/, Windows: Program Files, Linux: system PATH) - StataController: Manages Stata process lifecycle, command invocation, and exit code monitoring - StataDo: Handles do-file execution with log capture and error reporting

3. Security & Monitoring Layer

Advanced safety features for production deployments: - Security Guard: Validates dofiles against dangerous commands (shell execution, file deletion, etc.) - Monitoring System: Real-time RAM monitoring with automatic process termination - Blacklist-based validation: Blocks dangerous operations before execution - Resource limits: Prevents memory exhaustion and system instability

4. Configuration Layer

Unified configuration management with hierarchical priority: - Configuration System: TOML-based config file at ~/.statamcp/config.toml - Environment variables: Override settings for specific sessions - Priority: Environment variables > config file > defaults - Sections: DEBUG, SECURITY, PROJECT, MONITOR, BETA, HELP, STATA, data_info

5. Application Layer (Modes & Tools)

Two primary operational modes:

MCP Server Mode (Default)

Operates as a stdio/HTTP/SSE server, responding to tool invocation requests from MCP-compliant clients. Tools include:

Tool	Purpose
`stata_do`	Execute do-files with log retrieval
`get_data_info`	Analyze CSV/TSV/PSV, DTA, XLSX/XLS, and SPSS SAV/ZSAV files with statistical summaries
`help`	Retrieve Stata command documentation (cached) (Unix only)
`ado_package_install`	Install approved packages; GitHub requires an allowlist (unsafe profile only)
`read_log`	Read Stata log files (text and SMCL formats)

Data Processing Pipeline

MCP-for-Stata implements a polymorphic data analysis system supporting multiple formats:

DataInfo Architecture

Abstract base class DataInfoBase with format-specific implementations: - DtaDataInfo: Native Stata .dta format with metadata extraction - CsvDataInfo: CSV/TSV/PSV files with encoding detection and type inference - ExcelDataInfo: Excel workbooks (.xlsx, .xls) with sheet selection - SpssDataInfo: SPSS data files (.sav, .zsav) - New in v1.14.0

Statistical Metrics

Configurable metric computation (via ~/.statamcp/config.toml or environment variables): - Default: observations, mean, standard error, minimum, maximum - Extended: Q1, Q3, skewness, kurtosis, unique value sampling

Caching Strategy

Content-addressable cache using MD5 hashing:

~/.statamcp/.cache/data_info__<name>_<ext>__hash_<suffix>.json

Cache invalidation occurs automatically on content change detection.

Project Structure Convention

MCP-for-Stata enforces a standardized directory layout for reproducible research:

~/.statamcp/
├── stata-mcp-log/      # Stata execution logs (timestamped)
├── stata-mcp-dofile/   # Generated do-files (ISO 8601 timestamps)
└── stata-mcp-tmp/      # Temporary artifacts (data info cache)

For AI-assisted research projects, a recommended layout is:

<project_name>/
├── .claude/
│   ├── skills/              # Custom Claude Code skills
│   └── settings.local.json  # MCP server registration
├── source/
│   ├── data/
│   │   ├── raw/             # Immutable source data
│   │   ├── processing/      # Intermediate datasets
│   │   └── final/           # Analysis-ready data
│   ├── figs/                # Publication figures
│   └── tabs/                # Publication tables
├── .statamcp/               # MCP-for-Stata working directory
└── CLAUDE.md                # Project-specific instructions

Integration Patterns

In AI Clients

MCP-compliant clients (Claude Code, Cline, Continue) register MCP-for-Stata as a server in their configuration:

{
  "mcpServers": {
    "stata-mcp": {
      "command": "uvx",
      "args": ["stata-mcp"]
    }
  }
}

Cross-Platform Support

Platform	Stata Detection	Package Installation	Help System
macOS	`/Applications/Stata/StataMP`	Native CLI	✅ Cached
Windows	`Program Files` registry	Do-file delegation	❌ Not supported
Linux	`stata-mp` from PATH	Native CLI	✅ Cached

Design Philosophy

Immutability: Source files remain unmodified; all operations create timestamped artifacts
Fail-Safety: Graceful degradation (e.g., append_dofile creates new files if source missing)
Reproducibility: Deterministic paths, automatic logging, and cache invalidation
Extensibility: Plugin architecture for custom tools and data format handlers
Security First:
Security Guard: Blocks dangerous commands before execution
Path Validation: Restricts file operations to working directory
Resource Monitoring: Prevents memory exhaustion via RAM monitoring
Sandboxed Execution: Isolated execution environments for safety

Advanced Features

Security Guard ✅

Automatically validates all dofile code against dangerous commands: - Blocks shell execution (!, shell, xshell, etc.) - Prevents file deletion operations (erase, rm) - Stops untrusted code execution (run, do, include) - Configurable via Security settings

RAM Monitoring ✅

Real-time monitoring of Stata process memory usage: - Tracks RAM usage during execution - Automatically terminates processes exceeding limits - Configurable RAM limits per project - Minimal overhead with daemon thread architecture - See Monitoring documentation for details

Unified Configuration ✅

Hierarchical configuration system: - TOML-based configuration file (~/.statamcp/config.toml) - Environment variable overrides - Sections: DEBUG, SECURITY, PROJECT, MONITOR - See Configuration documentation for details

Sandbox System (not support now)

Alternative execution backend using Jupyter kernels for environments without Stata licenses or for testing purposes.

Multi-Language Support (not support now)

Configurable language settings for localized error messages and documentation.

Citation and Acknowledgments

MCP-for-Stata is developed by the empirical research community to bridge AI assistance with domain-specific analytical tools. Contributions, bug reports, and feature requests are welcome via the GitHub repository.