Skip to content

patrickiel/PDF-Image-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Image Extractor

A Python tool to extract images from PDF files.

Features

  • 📄 Extract images from PDF files
  • 📁 Organized output structure
  • 🖼️ Preserve original image formats
  • 🔍 Filter out small images and duplicates
  • 🛠️ Simple command-line interface

Prerequisites

  • Python 3.6 or higher
  • pip (Python package installer)

Installation

  1. Clone or download this repository
  2. Navigate to the project directory
  3. Create and activate a virtual environment (recommended):
    python -m venv .venv
    .venv\Scripts\activate  # On Windows
    source .venv/bin/activate  # On Unix/MacOS
  4. Install required packages:
    pip install -r requirements.txt

Usage

Basic Usage

  1. Place your PDF files in the pdfs directory
  2. Run the script:
    python pdf-image-extractor.py
  3. Extracted images will be saved in the extracted_images directory

Advanced Options

python pdf-image-extractor.py [INPUT_DIR] [--output_dir OUTPUT_DIR] [--min_size MIN_SIZE]

Arguments:

  • INPUT_DIR: Directory containing PDF files (optional, default: ./pdfs)
  • --output_dir: Directory to save extracted images (default: ./extracted_images)
  • --min_size: Minimum pixel dimension for images (default: 100)

Examples:

# Use default pdfs directory
python pdf-image-extractor.py

# Specify custom input directory
python pdf-image-extractor.py my_pdfs

# Extract images with custom minimum size
python pdf-image-extractor.py --min_size 200

# Specify custom input and output directories
python pdf-image-extractor.py my_pdfs --output_dir my_images

Directory Structure

.
├── pdfs/                  # Place your PDF files here
├── extracted_images/     # Contains extracted images
│   └── pdf_name/        # Subdirectory for each PDF
│       └── pageX_imgY_WxH.ext  # Extracted images
├── pdf-image-extractor.py
├── requirements.txt
└── README.md

Output Format

Extracted images are named using the following format:

page{page_number}_img{image_index}_{width}x{height}.{extension}

Example: page1_img0_800x600.jpg

Notes

  • Each PDF's images are extracted to a separate subdirectory
  • Small images and duplicates are automatically filtered
  • Original image formats are preserved

Troubleshooting

  1. No PDFs found: Ensure your PDF files are in the specified input directory
  2. Permission errors: Check write permissions for output directory
  3. Corrupted PDFs: The script will skip problematic pages and continue processing
  4. Memory issues: Process large PDFs one at a time

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A Python tool to extract images from PDF files with filtering and organization.

Topics

Resources

License

Stars

Watchers

Forks

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy