Skip to content

Extract2MD is a powerful and versatile AI-enabled client-side JavaScript library for extracting text from PDF files and converting it into Markdown.

License

Notifications You must be signed in to change notification settings

hashangit/Extract2MD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Extract2MD - Enhanced PDF to Markdown Converter

NPM Version License Downloads

Sponsor on Patreon

A powerful client-side JavaScript library for converting PDFs to Markdown with multiple extraction methods and optional LLM enhancement. Now with scenario-specific methods for different use cases.

Extract2MD

πŸš€ Quick Start

Extract2MD now offers 5 distinct scenarios for different conversion needs:

import Extract2MDConverter from 'extract2md';

// Scenario 1: Quick conversion only
const markdown1 = await Extract2MDConverter.quickConvertOnly(pdfFile);

// Scenario 2: High accuracy OCR conversion only  
const markdown2 = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile);

// Scenario 3: Quick conversion + LLM enhancement
const markdown3 = await Extract2MDConverter.quickConvertWithLLM(pdfFile);

// Scenario 4: High accuracy conversion + LLM enhancement
const markdown4 = await Extract2MDConverter.highAccuracyConvertWithLLM(pdfFile);

// Scenario 5: Combined extraction + LLM enhancement (most comprehensive)
const markdown5 = await Extract2MDConverter.combinedConvertWithLLM(pdfFile);

πŸ“‹ Scenarios Explained

Scenario 1: Quick Convert Only

  • Use case: Fast conversion when PDF has selectable text
  • Method: quickConvertOnly(pdfFile, config?)
  • Tech: PDF.js text extraction only
  • Output: Basic markdown formatting

Scenario 2: High Accuracy Convert Only

  • Use case: PDFs with images, scanned documents, complex layouts
  • Method: highAccuracyConvertOnly(pdfFile, config?)
  • Tech: Tesseract.js OCR
  • Output: Markdown from OCR extraction

Scenario 3: Quick Convert + LLM

  • Use case: Fast extraction with AI enhancement for better formatting
  • Method: quickConvertWithLLM(pdfFile, config?)
  • Tech: PDF.js + WebLLM
  • Output: AI-enhanced markdown with improved structure and clarity

Scenario 4: High Accuracy + LLM

  • Use case: OCR extraction with AI enhancement
  • Method: highAccuracyConvertWithLLM(pdfFile, config?)
  • Tech: Tesseract.js OCR + WebLLM
  • Output: AI-enhanced markdown from OCR

Scenario 5: Combined + LLM (Recommended)

  • Use case: Most comprehensive conversion using both extraction methods
  • Method: combinedConvertWithLLM(pdfFile, config?)
  • Tech: PDF.js + Tesseract.js + WebLLM with specialized prompts
  • Output: Best possible markdown leveraging strengths of both extraction methods

βš™οΈ Configuration

Create a configuration object or JSON file to customize behavior:

const config = {
  // PDF.js Worker
  pdfJsWorkerSrc: "../pdf.worker.min.mjs",
  
  // Tesseract OCR Settings
  tesseract: {
    workerPath: "./tesseract-worker.min.js",
    corePath: "./tesseract-core.wasm.js", 
    langPath: "./lang-data/",
    language: "eng",
    options: {}
  },
  
  // LLM Configuration
  webllm: {
    model: "Qwen3-0.6B-q4f16_1-MLC",
    // Optional: Custom model
    customModel: {
      model: "https://huggingface.co/mlc-ai/your-model/resolve/main/",
      model_id: "YourModel-ID",
      model_lib: "https://example.com/your-model.wasm",
      required_features: ["shader-f16"],
      overrides: { conv_template: "qwen" }
    },
    options: {
      temperature: 0.7,
      maxTokens: 4096
    }
  },
  
  // System Prompt Customizations
  systemPrompts: {
    singleExtraction: "Focus on preserving code examples exactly.",
    combinedExtraction: "Pay attention to tables and diagrams from OCR."
  },
  
  // Processing Options
  processing: {
    splitPascalCase: false,
    pdfRenderScale: 2.5,
    postProcessRules: [
      { find: /\bAPI\b/g, replace: "API" }
    ]
  },
  
  // Progress Tracking
  progressCallback: (progress) => {
    console.log(`${progress.stage}: ${progress.message}`);
    if (progress.currentPage) {
      console.log(`Page ${progress.currentPage}/${progress.totalPages}`);
    }
  }
};

// Use configuration
const markdown = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);

πŸ”§ Advanced Usage

Using Individual Components

import { 
  WebLLMEngine, 
  OutputParser, 
  SystemPrompts,
  ConfigValidator 
} from 'extract2md';

// Validate configuration
const validatedConfig = ConfigValidator.validate(userConfig);

// Initialize WebLLM engine
const engine = new WebLLMEngine(validatedConfig);
await engine.initialize();

// Generate text
const result = await engine.generate("Your prompt here");

// Parse output
const parser = new OutputParser();
const cleanMarkdown = parser.parse(result);

Custom System Prompts

The library uses different system prompts for different scenarios:

// For scenarios 3 & 4 (single extraction)
const singlePrompt = SystemPrompts.getSingleExtractionPrompt(
  "Additional instruction: Preserve all technical terms."
);

// For scenario 5 (combined extraction) 
const combinedPrompt = SystemPrompts.getCombinedExtractionPrompt(
  "Focus on creating comprehensive documentation."
);

Configuration from JSON

import { ConfigValidator } from 'extract2md';

// Load from JSON string
const config = ConfigValidator.fromJSON(configJsonString);

// Use with any scenario
const result = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);

🎯 Error Handling & Progress Tracking

const config = {
  progressCallback: (progress) => {
    switch (progress.stage) {
      case 'scenario_5_start':
        console.log('Starting combined conversion...');
        break;
      case 'webllm_load_progress':
        console.log(`Loading model: ${progress.progress}%`);
        break;
      case 'ocr_page_process':
        console.log(`OCR: ${progress.currentPage}/${progress.totalPages}`);
        break;
      case 'webllm_generate_start':
        console.log('AI enhancement in progress...');
        break;
      case 'scenario_5_complete':
        console.log('Conversion completed!');
        break;
      default:
        console.log(`${progress.stage}: ${progress.message}`);
    }
    
    if (progress.error) {
      console.error('Error:', progress.error);
    }
  }
};

try {
  const result = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);
  console.log('Success:', result);
} catch (error) {
  console.error('Conversion failed:', error.message);
}

πŸ”„ Migration from Legacy API

If you're using the old API, you can still access it:

import { LegacyExtract2MDConverter } from 'extract2md';

// Old way
const converter = new LegacyExtract2MDConverter(options);
const quick = await converter.quickConvert(pdfFile);
const ocr = await converter.highAccuracyConvert(pdfFile);
const enhanced = await converter.llmRewrite(text);

// New way (recommended)
const quick = await Extract2MDConverter.quickConvertOnly(pdfFile, config);
const ocr = await Extract2MDConverter.highAccuracyConvertOnly(pdfFile, config);
const enhanced = await Extract2MDConverter.quickConvertWithLLM(pdfFile, config);

🌟 Features

  • 5 Scenario-Specific Methods: Choose the right approach for your use case
  • WebLLM Integration: Client-side AI enhancement with Qwen models
  • Custom Model Support: Use your own trained models
  • Advanced Output Parsing: Automatic removal of thinking tags and formatting
  • Comprehensive Configuration: Fine-tune every aspect of the conversion
  • Progress Tracking: Real-time updates for UI integration
  • TypeScript Support: Full type definitions included
  • Backwards Compatible: Legacy API still available

πŸ“š TypeScript Support

Full TypeScript definitions are included:

import Extract2MDConverter, { 
  Extract2MDConfig, 
  ProgressReport,
  CustomModelConfig 
} from 'extract2md';

const config: Extract2MDConfig = {
  webllm: {
    model: "Qwen3-0.6B-q4f16_1-MLC",
    options: {
      temperature: 0.7,
      maxTokens: 4096
    }
  },
  progressCallback: (progress: ProgressReport) => {
    console.log(progress.stage, progress.message);
  }
};

const result: string = await Extract2MDConverter.combinedConvertWithLLM(pdfFile, config);

πŸ—οΈ Installation & Deployment

NPM Installation

npm install extract2md

CDN Usage

<script src="https://unpkg.com/extract2md@2.0.0/dist/assets/extract2md.umd.js"></script>
<script>
    // Available as global Extract2MD
    const result = await Extract2MD.Extract2MDConverter.quickConvertOnly(pdfFile);
</script>

Worker Files Configuration

The package requires worker files for PDF.js and Tesseract.js. These are automatically copied during build:

// Default worker paths (adjust for your deployment)
const config = {
    pdfJsWorkerSrc: "/pdf.worker.min.mjs",
    tesseract: {
        workerPath: "/tesseract-worker.min.js",
        corePath: "/tesseract-core.wasm.js"
    }
};

Bundle Size Considerations

  • Total Size: ~11 MB (includes OCR and PDF processing)
  • PDF.js: ~950 KB
  • Tesseract.js: ~4.5 MB
  • WebLLM: Variable (model-dependent)

Use lazy loading and code splitting for production deployments.

πŸ“š Documentation

πŸ“„ License

MIT License - see LICENSE file for details.

🀝 Contributing

Contributions welcome! Please read the contributing guidelines before submitting PRs.

πŸ› Issues

Report issues on the GitHub Issues page.

About

Extract2MD is a powerful and versatile AI-enabled client-side JavaScript library for extracting text from PDF files and converting it into Markdown.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy