
data-janitor
data-janitor MCP server executes data cleaning operations including deduplication, missing value imputation, and format normalization via protocol calls. Data analysts, data scientists, and ML engineers use it to transform raw datasets into clean inputs for analysis and model training. Open 50000 lines of code llm faild but with the help of data-janitor it will shine.
🧹 Data Janitor MCP
Overview
The Data Janitor MCP is your automated data prep assistant. It enables programmatic, intelligent cleaning of datasets directly within your AI workflows (Claude Desktop, Cursor, etc.). Drop your messy HubSpot, Salesforce, or Pipedrive CSV/Excel export into your AI IDE, and Data Janitor instantly diagnoses, fuzzy-deduplicates, standardizes, and analyzes it natively using DuckDB all without cramming 50MB of data into the LLM context window.
For use you must choose a plan
Key Capabilities
-
Data Health Diagnostics: Get an instant "Dirty Laundry" health score (0-100) that pinpoints exact columns with mixed types, nulls, or anomalies.
-
Intelligent Deduplication: Catch fuzzy matches and subtle variations (e.g., "Jon Doe" vs "John Doe") using the Dice coefficient, merging them safely.
-
Format Normalization: Instantly fix messy dates, cast data types, standardize phone numbers, and map 60+ country variants to standard ISO codes.
-
Smart Imputation: Fill missing values using context-aware strategies (e.g., fill a missing Salary using the median of that specific Job Title).
-
Embedded DuckDB Analytics: Run complex, multi-file SQL joins, aggregations, and pivot tables on massive datasets instantly.
-
Time Travel (Undo): Every mutation automatically saves a full file snapshot to a local .janitor_history/ folder. Revert changes instantly if something goes wrong.
Use Cases
-
Weekly CRM Export Cleanup: Standardize messy lead and contact lists (fixing formatting, merging duplicates) before importing them into marketing automation tools.
-
Dataset Preprocessing for ML: Load a raw CSV, apply cleaning rules to remove outliers and fill gaps, and export a pristine file ready for model training.
-
Automated Reporting: Use the DuckDB analytics engine to generate materialized pivot tables and aggregated metrics directly from raw exports.
-
Ad-Hoc Data Formatting: Quickly pivot wide matrices or extract specific regex patterns (like email domains) from log files.
Why Use It?
-
Saves Hours of Manual Work: Eliminates the 60+ minutes analysts spend manually filtering, deduping, and standardizing spreadsheets every week.
-
Bypasses LLM Context Limits: By pushing the heavy data lifting to local TypeScript and DuckDB, the LLM never has to read 10,000 rows of data, saving tokens and preventing hallucinations.
-
Zero-Risk Editing: With automatic local snapshots, you can let the AI aggressively clean your files without the fear of permanently destroying your data.
Who Will Use It?
-
Marketing & Sales Ops: Processing weekly CRM exports and fixing contact data formatting.
-
Data Analysts: Standardizing ad-hoc datasets for business intelligence dashboards.
-
Machine Learning Practitioners: Ensuring feature data reliability and hygiene.
-
Software Engineers: Parsing messy server logs or seeding databases with clean dummy data.
Copy-Paste Prompts for Day 1
Try pasting these exactly into your Cursor or Claude chat:
The Full Wash:
"Janitor, run a health scan on C:/exports/hubspot_contacts.csv. Tell me the Dirty Laundry score, find any fuzzy duplicate names, standardize the countries, and export a clean version."
The Context-Aware Imputation:
"Fill the missing Salary values in employees.xlsx by using the median salary grouped by Job Title."
The DuckDB Analytical Report (Advanced):
"Use the query_dataset tool to analyze leads.csv. Group them by Country, sum up the Revenue, and export the result to a new file called revenue_by_country_report.csv."
The Panic Button:
"Wait, that last standardization messed up the custom IDs. Please undo the last change."