0% found this document useful (0 votes)
38 views2 pages

Extracting Details AWS Textract

Uploaded by

padmanabh.p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views2 pages

Extracting Details AWS Textract

Uploaded by

padmanabh.p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Extracting Details from PDF and Images

using AWS Textract and Regular


Expressions
1. Extracting Details from PDF and Images using AWS Textract and Regular
Expressions (KYC Documents-Aadhar, PAN, etc.)
This code is designed to extract identification details (like Aadhar, PAN, Voter ID, Name,
Date of Birth, and Address) from a set of PDF files, including those that may contain images.
It uses PyMuPDF (fitz) for reading PDF content and AWS Textract for performing OCR on
images.

Steps Involved:
1. PDF Processing: The code opens and reads PDF files using PyMuPDF (fitz). It tries to
extract text directly from each page. If the page contains an image or scanned document,
AWS Textract is used to extract the text by performing OCR.

2. Regular Expressions for Detail Extraction: After extracting text, regular expressions
detect Aadhar numbers, PAN numbers, Voter IDs, names, DOBs, and addresses.

3. Data Collection: The extracted details (ID type, ID number, name, DOB, and address) are
stored in a list for further processing.

4. Saving Data to Excel: The results are saved in an Excel file using pandas. Each row
represents an ID with associated details.

5. Batch Processing Multiple PDFs: The function process_pdfs_in_directory processes all


PDF files in a specified directory and saves the combined results in an Excel file.

2. Uploading Images to S3 and Extracting Text using AWS Textract (SOA


Pdf File)
This code converts a PDF into images, uploads them to an Amazon S3 bucket, and extracts
text using AWS Textract. It also saves the extracted data into an Excel file.

Steps Involved:
1. PDF to Image Conversion: The convert_pdf_to_images function converts each PDF page
into an image using PyMuPDF (fitz) and saves the images locally.

2. Uploading Images to S3: Each converted image is uploaded to an Amazon S3 bucket using
the boto3 S3 client.
3. Text Extraction from Images: AWS Textract is used to extract text from the uploaded
images using the analyze_document API.

4. Combining with Excel Export: The extracted text is saved into an Excel file using pandas
for further storage or analysis.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy