datamap
Data ManagementPrivacy-safe dataset documentation and Markdown data dictionaries
Version 1.0.0 | 2026-04-08
datamap documents Stata datasets without exporting row-level data. It produces two kinds of output:
datamapwrites a structured plain-text file designed to be pasted into an LLM prompt, attached to an internal data handoff, or fed into an automated pipeline. It includes privacy controls (exclude(),datesafe), automatic structure detection (panel, survival, survey), data quality flags, and missing-data summaries.datadictwrites a Markdown data dictionary suitable for GitHub, documentation sites, or conversion to PDF/Word/HTML via Pandoc. It includes document metadata (title(),author(),version()), optional missing-value and statistics columns, and a table of contents when documenting multiple datasets.
Both commands preserve the dataset in memory, accept the same input modes (data in memory, a single .dta file, a directory scan, or a named file list), and handle the .dta extension automatically.
Requirements
- Stata 16 or later
- Pandoc (optional) — only needed if you want to convert
datadictMarkdown output to PDF, HTML, or Word
Installation
capture ado uninstall datamap
net install datamap, from("https://raw.githubusercontent.com/tpcopeland/Stata-Tools/main/datamap") replace
Commands
| Command | Output format | Purpose |
|---|---|---|
datamap |
Plain text | LLM context, internal documentation, QA, privacy-controlled sharing |
datadict |
Markdown | GitHub repos, report appendices, Pandoc conversion, IRB submissions |
How It Works
- Choose the input source. Load data into memory (the default), or point at a file with
single(), a folder withdirectory(), or a list of names withfilelist(). - Pick the output command. Use
datamapfor plain text ordatadictfor Markdown. - Layer on options. Add privacy controls (
exclude(),datesafe), detection features (autodetect,detect(panel survival)), quality checks (quality), or missing-data analysis (missing(detail)) todatamap. Add document metadata (title(),author()) and descriptive statistics (stats,missing) todatadict. - Write the output. One combined file by default, or separate files per dataset with
separate.
Worked Examples
1. Quick text map from data in memory
The fastest starting point. datamap inspects the current dataset and writes a text summary to datamap.txt.
sysuse auto, clear
datamap
2. Quick Markdown dictionary from data in memory
sysuse auto, clear
datadict
This writes data_dictionary.md in the current directory.
3. Privacy controls with detection and quality checks
For sensitive data, exclude identifiers, suppress exact dates, and enable diagnostics:
sysuse auto, clear
datamap, exclude(make) quality missing(detail) autodetect output(auto_map.txt)
4. Markdown dictionary with statistics and metadata
datadict is the presentation-oriented companion. Add document metadata and optional columns:
sysuse auto, clear
datadict, output(auto_dictionary.md) ///
title("Auto Data Dictionary") ///
author("Timothy P Copeland, Karolinska Institutet") ///
missing stats
5. Document a saved dataset by filename
Both commands work on .dta files without loading them first:
sysuse auto, clear
save auto_example.dta, replace
datamap, single(auto_example) output(auto_example_map.txt)
datadict, single(auto_example) output(auto_example_dict.md) title("Saved auto")
6. Scale up to a directory
Document every .dta file in a folder — combined or one file per dataset:
datamap, directory("analysis_data") recursive output(project_map.txt)
datadict, directory("analysis_data") recursive separate
Feature Reference
datamap options
| Category | Options |
|---|---|
| Input | single(), directory(), filelist(), recursive |
| Output | output(), format(), separate, append |
| Content | nostats, nofreq, nolabels, maxfreq(), maxcat() |
| Privacy | exclude(), datesafe, dateformat() |
| Detection | detect(), autodetect, panelid(), survivalvars() |
| Quality | quality, quality2(strict), missing(detail|pattern) |
| Sample data | samples() |
datadict options
| Category | Options |
|---|---|
| Input | single(), directory(), filelist(), recursive |
| Output | output(), separate |
| Metadata | title(), subtitle(), version(), author(), date() |
| Content | notes(), changelog(), missing, stats, maxcat(), maxfreq(), dateformat() |
Variable classification (both commands)
| Priority | Condition | Class |
|---|---|---|
| 1 | Listed in exclude() |
Excluded |
| 2 | String type (str#, strL) |
String |
| 3 | Date format (%t*, %d*) |
Date |
| 4 | Value labels or ≤ maxcat() unique values |
Categorical |
| 5 | Everything else | Continuous |
Choosing Between the Commands
datamapwhen you need a technical inventory: LLM context windows, internal handoffs, automated pipelines, or privacy-controlled documentation.datadictwhen you need a publication-quality Markdown document: GitHub repositories, report appendices, IRB submissions, or Pandoc conversion.- Use
separatewith either command when each dataset should get its own output file. - Start with a single dataset; switch to
directory()+recursiveonce the output looks right.
Author
Timothy P Copeland, Karolinska Institutet