0% found this document useful (0 votes)

38 views2 pages

Extracting Details AWS Textract

Uploaded by

padmanabh.p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views2 pages

Extracting Details AWS Textract

Uploaded by

padmanabh.p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

Extracting Details from PDF and Images

using AWS Textract and Regular

Expressions
1. Extracting Details from PDF and Images using AWS Textract and Regular
Expressions (KYC Documents-Aadhar, PAN, etc.)
This code is designed to extract identification details (like Aadhar, PAN, Voter ID, Name,
Date of Birth, and Address) from a set of PDF files, including those that may contain images.
It uses PyMuPDF (fitz) for reading PDF content and AWS Textract for performing OCR on
images.

Steps Involved:
1. PDF Processing: The code opens and reads PDF files using PyMuPDF (fitz). It tries to
extract text directly from each page. If the page contains an image or scanned document,
AWS Textract is used to extract the text by performing OCR.

2. Regular Expressions for Detail Extraction: After extracting text, regular expressions
detect Aadhar numbers, PAN numbers, Voter IDs, names, DOBs, and addresses.

3. Data Collection: The extracted details (ID type, ID number, name, DOB, and address) are
stored in a list for further processing.

4. Saving Data to Excel: The results are saved in an Excel file using pandas. Each row
represents an ID with associated details.

5. Batch Processing Multiple PDFs: The function process_pdfs_in_directory processes all

PDF files in a specified directory and saves the combined results in an Excel file.

2. Uploading Images to S3 and Extracting Text using AWS Textract (SOA

Pdf File)
This code converts a PDF into images, uploads them to an Amazon S3 bucket, and extracts
text using AWS Textract. It also saves the extracted data into an Excel file.

Steps Involved:
1. PDF to Image Conversion: The convert_pdf_to_images function converts each PDF page
into an image using PyMuPDF (fitz) and saves the images locally.

2. Uploading Images to S3: Each converted image is uploaded to an Amazon S3 bucket using
the boto3 S3 client.
3. Text Extraction from Images: AWS Textract is used to extract text from the uploaded
images using the analyze_document API.

4. Combining with Excel Export: The extracted text is saved into an Excel file using pandas
for further storage or analysis.

PDF To Text With Python 1658153600
No ratings yet
PDF To Text With Python 1658153600
12 pages
Lecture Week 5-Data Analytics-Data Scraping and Data Wrangling
No ratings yet
Lecture Week 5-Data Analytics-Data Scraping and Data Wrangling
15 pages
Extracting Text From PDF Files With Python - A Comprehensive Guide - Modo Leitor
No ratings yet
Extracting Text From PDF Files With Python - A Comprehensive Guide - Modo Leitor
17 pages
Documentation ML
No ratings yet
Documentation ML
10 pages
Research and Implementation of PDF Specific Element Fast Extraction
No ratings yet
Research and Implementation of PDF Specific Element Fast Extraction
7 pages
Object Detection With YOLO - Simplified and Applied
No ratings yet
Object Detection With YOLO - Simplified and Applied
15 pages
Extract Data From A PDF-2025061111405351
No ratings yet
Extract Data From A PDF-2025061111405351
4 pages
Handwriting Recognition
No ratings yet
Handwriting Recognition
31 pages
SDLC Document
No ratings yet
SDLC Document
15 pages
SDLC File New
No ratings yet
SDLC File New
15 pages
Building A Scalable Intelligent Document Processing Platform For Financial Institutions
No ratings yet
Building A Scalable Intelligent Document Processing Platform For Financial Institutions
12 pages
2332 m3 Demo 1 I73 Pvygngu
No ratings yet
2332 m3 Demo 1 I73 Pvygngu
8 pages
Pdfreader Readthedocs Io en Latest
No ratings yet
Pdfreader Readthedocs Io en Latest
40 pages
Extract Tables From PDFs With Tesseract OCR - LedgerBox
No ratings yet
Extract Tables From PDFs With Tesseract OCR - LedgerBox
15 pages
Pdfreader Documentation: Release 0.1.10
No ratings yet
Pdfreader Documentation: Release 0.1.10
40 pages
Pdfreader Documentation: Release 0.1.7
No ratings yet
Pdfreader Documentation: Release 0.1.7
40 pages
Pdfreader Documentation: Release 0.1.6
No ratings yet
Pdfreader Documentation: Release 0.1.6
38 pages
PDF File Extraction
No ratings yet
PDF File Extraction
6 pages
Project
No ratings yet
Project
3 pages
API Endpoints
No ratings yet
API Endpoints
2 pages
Updated Code That Flags Faulty Jpgs
No ratings yet
Updated Code That Flags Faulty Jpgs
3 pages
AWS Textxtract2019 0312 MCL Slide Deck
100% (1)
AWS Textxtract2019 0312 MCL Slide Deck
64 pages
BT4161 PPT
No ratings yet
BT4161 PPT
12 pages
Automated Data Extraction
No ratings yet
Automated Data Extraction
1 page
Extracting Text and Images From PDF Files
No ratings yet
Extracting Text and Images From PDF Files
10 pages
AI Over PDF Library
No ratings yet
AI Over PDF Library
2 pages
This Little-Known PDF Parsing Library Will Save Enterprises Millions by Michael Ryaboy Jun, 2025
No ratings yet
This Little-Known PDF Parsing Library Will Save Enterprises Millions by Michael Ryaboy Jun, 2025
1 page
AI Data Extraction Checklist - v6
No ratings yet
AI Data Extraction Checklist - v6
10 pages
AWS Lambda
No ratings yet
AWS Lambda
3 pages
Final Code For Markup
No ratings yet
Final Code For Markup
1 page
Text Extraction From Image: Team Members CH - Suneetha (19mcmb22) Mohit Sharma (19mcmb13)
No ratings yet
Text Extraction From Image: Team Members CH - Suneetha (19mcmb22) Mohit Sharma (19mcmb13)
20 pages
AI POC - Ryan Fernandes
No ratings yet
AI POC - Ryan Fernandes
7 pages
Adobe PDF Extract API Tutorial
No ratings yet
Adobe PDF Extract API Tutorial
6 pages
Steps To Create and Deploy Our YOLO Model On AWS Sagemaker
No ratings yet
Steps To Create and Deploy Our YOLO Model On AWS Sagemaker
3 pages
Digitization Brochure DET NEW
No ratings yet
Digitization Brochure DET NEW
4 pages
Automation Anywhere Client (PDF Integration)
No ratings yet
Automation Anywhere Client (PDF Integration)
14 pages
Maxbox - Starter75 Object Detection
No ratings yet
Maxbox - Starter75 Object Detection
7 pages
Scrape Data From PDF Files Using Python Towards Data Science
No ratings yet
Scrape Data From PDF Files Using Python Towards Data Science
8 pages
R PDF Tables
No ratings yet
R PDF Tables
4 pages
Extract Image Stream From PDF
No ratings yet
Extract Image Stream From PDF
2 pages
Practical Business Intelligence
From Everand
Practical Business Intelligence
Ahmed Sherif
3/5 (1)
Amazon SimpleDB Developer Guide
From Everand
Amazon SimpleDB Developer Guide
Prabhakar Chaganti
No ratings yet
Make AI Work for You While You Nap
From Everand
Make AI Work for You While You Nap
Nexia
No ratings yet
Zend Framework 2 Cookbook
From Everand
Zend Framework 2 Cookbook
Josephus Callaars
No ratings yet
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
Node.js 6.x Blueprints
From Everand
Node.js 6.x Blueprints
Fernando Monteiro
No ratings yet
Introduction to AutoCAD Plant 3D 2021
From Everand
Introduction to AutoCAD Plant 3D 2021
Tutorial Books
4/5 (6)
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
How To Create An App
From Everand
How To Create An App
Duong Tran
3/5 (8)
TIBCO Spotfire – A Comprehensive Primer
From Everand
TIBCO Spotfire – A Comprehensive Primer
Michael Phillips
No ratings yet
Yii2 By Example: Develop complete web applications from scratch through practical examples and tips for beginners and more advanced users
From Everand
Yii2 By Example: Develop complete web applications from scratch through practical examples and tips for beginners and more advanced users
Fabrizio Caldarelli
No ratings yet
Flask By Example: Unleash the full potential of the Flask web framework by creating simple yet powerful web applications
From Everand
Flask By Example: Unleash the full potential of the Flask web framework by creating simple yet powerful web applications
Gareth Dwyer
4/5 (1)
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
From Everand
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
Anand Vemula
No ratings yet
Adobe Acrobat X PDF Bible
From Everand
Adobe Acrobat X PDF Bible
Ted Padova
No ratings yet
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Introduction to AutoCAD Plant 3D 2019
From Everand
Introduction to AutoCAD Plant 3D 2019
Tutorial Books
4.5/5 (5)
Introduction to AutoCAD Plant 3D 2017
From Everand
Introduction to AutoCAD Plant 3D 2017
Tutorial Books
4.5/5 (3)
Swift Essentials: A Comprehensive Guide to iOS App Development Category
From Everand
Swift Essentials: A Comprehensive Guide to iOS App Development Category
Kameron Hussain
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Extracting Details AWS Textract

Uploaded by

Extracting Details AWS Textract

Uploaded by

Extracting Details from PDF and Images

using AWS Textract and Regular

5. Batch Processing Multiple PDFs: The function process_pdfs_in_directory processes all

2. Uploading Images to S3 and Extracting Text using AWS Textract (SOA

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.