by @microsoft autogen team
MarkItDown by Microsoft AutoGen Team is a lightweight Python utility for converting various files to Markdown, optimized for LLM consumption and text analysis pipelines. Preserves important document structure (headings, lists, tables, links) in Markdown format, which is natively understood by mainstream LLMs like GPT-4o and is highly token-efficient. Supports conversion from PDF, PowerPoint (PPTX), Word (DOCX), Excel (XLSX/XLS), Images (EXIF metadata + OCR), Audio (EXIF metadata + speech transcription), HTML, text formats (CSV, JSON, XML), ZIP files (iterates contents), YouTube URLs (transcription), EPubs, Outlook messages, and more. Offers MCP server (markitdown-mcp) for integration with Claude Desktop and other LLM applications. Provides command-line interface (markitdown path-to-file.pdf), Python API, and Docker support. Features optional dependencies organized by format ([pdf], [docx], [pptx], [xlsx], [xls], [outlook], [audio-transcription], [youtube-transcription]), 3rd-party plugin system (#markitdown-plugin), Azure Document Intelligence integration for enhanced PDF/document conversion, LLM-enhanced image descriptions (via OpenAI client), and stream-based conversion without temporary files. Requires Python 3.10+. Recommended installation: pip install 'markitdown[all]'. Breaking changes in v0.1.0: convert_stream() now requires binary file-like objects, DocumentConverter interface reads from streams instead of paths. Designed for text analysis tools rather than high-fidelity human-readable conversions. Licensed under Microsoft Open Source Code of Conduct. Available on PyPI (85,934+ downloads) and MCP Registry.
This server provides the following tools for AI assistants:
Convert any supported file format to Markdown (PDF, DOCX, PPTX, XLSX, XLS, images, audio, HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs, Outlook messages)
Convert from binary file-like stream (io.BytesIO) to Markdown without creating temporary files
Convert PDF files to Markdown, preserving structure like headings, lists, tables, and links
Convert Microsoft Word (DOCX) files to Markdown, preserving document structure
Convert PowerPoint (PPTX) presentations to Markdown, with optional LLM-enhanced image descriptions
Convert Excel (XLSX/XLS) spreadsheets to Markdown tables
Convert images to Markdown with EXIF metadata extraction and OCR, optional LLM-enhanced descriptions
Convert audio files (WAV, MP3) to Markdown with EXIF metadata and speech transcription
Convert HTML files to Markdown, preserving structure and links
Convert web pages and YouTube URLs to Markdown (YouTube: fetch transcription)
Convert EPUB ebook files to Markdown
Convert ZIP archives to Markdown by iterating over and converting each file inside
Convert images and PPTX files to Markdown with LLM-enhanced descriptions (requires OpenAI client and model like gpt-4o)
Convert files using Microsoft Azure Document Intelligence for enhanced PDF and document conversion
List installed 3rd-party plugins (search GitHub: #markitdown-plugin)
Enable 3rd-party plugins for conversion (disabled by default)