Extracting Details AWS Textract
Extracting Details AWS Textract
Steps Involved:
1. PDF Processing: The code opens and reads PDF files using PyMuPDF (fitz). It tries to
extract text directly from each page. If the page contains an image or scanned document,
AWS Textract is used to extract the text by performing OCR.
2. Regular Expressions for Detail Extraction: After extracting text, regular expressions
detect Aadhar numbers, PAN numbers, Voter IDs, names, DOBs, and addresses.
3. Data Collection: The extracted details (ID type, ID number, name, DOB, and address) are
stored in a list for further processing.
4. Saving Data to Excel: The results are saved in an Excel file using pandas. Each row
represents an ID with associated details.
Steps Involved:
1. PDF to Image Conversion: The convert_pdf_to_images function converts each PDF page
into an image using PyMuPDF (fitz) and saves the images locally.
2. Uploading Images to S3: Each converted image is uploaded to an Amazon S3 bucket using
the boto3 S3 client.
3. Text Extraction from Images: AWS Textract is used to extract text from the uploaded
images using the analyze_document API.
4. Combining with Excel Export: The extracted text is saved into an Excel file using pandas
for further storage or analysis.