MCP-for-Stata Overview
What is MCP-for-Stata and Stata?
MCP-for-Stata is a Model Context Protocol (MCP) server that bridges Large Language Models (LLMs) with Stata, enabling autonomous econometric analysis and statistical computation. Built on the FastMCP framework, MCP-for-Stata exposes Stata's comprehensive analytical capabilities as structured tools that LLMs can invoke programmatically, transforming natural language queries into reproducible Stata workflows.
Why MCP-for-Stata?
Stata remains the dominant analytical engine in empirical social science research. In China's economics discipline alone, over 80% of published articles are empirical studies, with more than 98.4% utilizing Stata for analysis. This prevalence stems from Stata's mature ecosystem, methodological completeness, and reliability in reproducing published research.
MCP-for-Stata addresses a critical gap in AI-assisted research: while modern LLMs excel at code generation and statistical reasoning, they lack native execution environments for domain-specific tools like Stata. By implementing the MCP protocol, MCP-for-Stata enables:
- Deterministic Execution: LLM-generated Stata code executes in a controlled, reproducible environment
- Methodological Rigor: Access to Stata's validated econometric implementations ensures analytical integrity
- Workflow Orchestration: Complex multi-step analyses (data cleaning → estimation → visualization) become automated pipelines
- Cross-Platform Compatibility: Unified abstraction layer across macOS, Windows, and Linux environments
Architecture Overview
MCP-for-Stata operates through four architectural layers:
1. Protocol Layer (MCP Server)
The FastMCP-based server (src/stata_mcp/__init__.py) implements the Model Context Protocol, exposing Stata operations as structured tools. Each tool defines:
- Input parameter schemas with type validation
- Output serialization for LLM consumption
- Error handling and logging infrastructure
- Resource registration for stateful operations
2. Execution Layer (Stata Integration)
Platform-specific Stata controllers manage command execution:
- StataFinder: Locates Stata executables across operating systems (macOS: /Applications/Stata/, Windows: Program Files, Linux: system PATH)
- StataController: Manages Stata process lifecycle, command invocation, and exit code monitoring
- StataDo: Handles do-file execution with log capture and error reporting
3. Security & Monitoring Layer
Advanced safety features for production deployments: - Security Guard: Validates dofiles against dangerous commands (shell execution, file deletion, etc.) - Monitoring System: Real-time RAM monitoring with automatic process termination - Blacklist-based validation: Blocks dangerous operations before execution - Resource limits: Prevents memory exhaustion and system instability
4. Configuration Layer
Unified configuration management with hierarchical priority:
- Configuration System: TOML-based config file at ~/.statamcp/config.toml
- Environment variables: Override settings for specific sessions
- Priority: Environment variables > config file > defaults
- Sections: DEBUG, SECURITY, PROJECT, MONITOR, BETA, HELP, STATA, data_info
5. Application Layer (Modes & Tools)
Two primary operational modes:
MCP Server Mode (Default)
Operates as a stdio/HTTP/SSE server, responding to tool invocation requests from MCP-compliant clients. Tools include:
| Tool | Purpose |
|---|---|
stata_do |
Execute do-files with log retrieval |
write_dofile |
Create timestamped do-files (deprecated, controlled by BETA config) |
get_data_info |
Analyze CSV/DTA files with statistical summaries |
help |
Retrieve Stata command documentation (cached) (Unix only) |
ado_package_install |
Install packages from SSC/GitHub/net sources |
read_log |
Read Stata log files (text and SMCL formats) |
Data Processing Pipeline
MCP-for-Stata implements a polymorphic data analysis system supporting multiple formats:
DataInfo Architecture
Abstract base class DataInfoBase with format-specific implementations:
- DtaDataInfo: Native Stata .dta format with metadata extraction
- CsvDataInfo: CSV/TSV/PSV files with encoding detection and type inference
- ExcelDataInfo: Excel workbooks (.xlsx, .xls) with sheet selection
- SpssDataInfo: SPSS data files (.sav, .zsav) - New in v1.14.0
Statistical Metrics
Configurable metric computation (via ~/.statamcp/config.toml or environment variables):
- Default: observations, mean, standard error, minimum, maximum
- Extended: Q1, Q3, skewness, kurtosis, unique value sampling
Caching Strategy
Content-addressable cache using MD5 hashing:
~/.statamcp/.cache/data_info__<name>_<ext>__hash_<suffix>.json
Cache invalidation occurs automatically on content change detection.
Project Structure Convention
MCP-for-Stata enforces a standardized directory layout for reproducible research:
~/.statamcp/
├── stata-mcp-log/ # Stata execution logs (timestamped)
├── stata-mcp-dofile/ # Generated do-files (ISO 8601 timestamps)
├── stata-mcp-result/ # Command outputs (outreg2, esttab exports)
└── stata-mcp-tmp/ # Temporary artifacts (data info cache)
For AI-assisted research projects, the recommended template (stata-mcp --init) creates:
<project_name>/
├── .claude/
│ ├── skills/ # Custom Claude Code skills
│ └── settings.local.json # MCP server registration
├── source/
│ ├── data/
│ │ ├── raw/ # Immutable source data
│ │ ├── processing/ # Intermediate datasets
│ │ └── final/ # Analysis-ready data
│ ├── figs/ # Publication figures
│ └── tabs/ # Publication tables
├── .statamcp/ # MCP-for-Stata working directory
└── CLAUDE.md # Project-specific instructions
Integration Patterns
In AI Clients
MCP-compliant clients (Claude Code, Cline, Continue) register MCP-for-Stata as a server in their configuration:
{
"mcpServers": {
"stata-mcp": {
"command": "uvx",
"args": ["stata-mcp"]
}
}
}
Cross-Platform Support
| Platform | Stata Detection | Package Installation | Help System |
|---|---|---|---|
| macOS | /Applications/Stata/StataMP |
Native CLI | ✅ Cached |
| Windows | Program Files registry |
Do-file delegation | ❌ Not supported |
| Linux | stata-mp from PATH |
Native CLI | ✅ Cached |
Design Philosophy
- Immutability: Source files remain unmodified; all operations create timestamped artifacts
- Fail-Safety: Graceful degradation (e.g.,
append_dofilecreates new files if source missing) - Reproducibility: Deterministic paths, automatic logging, and cache invalidation
- Extensibility: Plugin architecture for custom tools and data format handlers
- Security First:
- Security Guard: Blocks dangerous commands before execution
- Path Validation: Restricts file operations to working directory
- Resource Monitoring: Prevents memory exhaustion via RAM monitoring
- Sandboxed Execution: Isolated execution environments for safety
Advanced Features
Security Guard ✅
Automatically validates all dofile code against dangerous commands:
- Blocks shell execution (!, shell, xshell, etc.)
- Prevents file deletion operations (erase, rm)
- Stops untrusted code execution (run, do, include)
- Configurable via Security settings
RAM Monitoring ✅
Real-time monitoring of Stata process memory usage: - Tracks RAM usage during execution - Automatically terminates processes exceeding limits - Configurable RAM limits per project - Minimal overhead with daemon thread architecture - See Monitoring documentation for details
Unified Configuration ✅
Hierarchical configuration system:
- TOML-based configuration file (~/.statamcp/config.toml)
- Environment variable overrides
- Sections: DEBUG, SECURITY, PROJECT, MONITOR
- See Configuration documentation for details
Sandbox System (not support now)
Alternative execution backend using Jupyter kernels for environments without Stata licenses or for testing purposes.
Multi-Language Support (not support now)
Configurable language settings for localized error messages and documentation.
Citation and Acknowledgments
MCP-for-Stata is developed by the empirical research community to bridge AI assistance with domain-specific analytical tools. Contributions, bug reports, and feature requests are welcome via the GitHub repository.