fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

lzamparo · 2025-05-21T04:29:30Z

Closes #25

Modifies ingest.py to correctly dispatch to PDFParser based on retrieved content header, and modifies PDFParser to retrieve content again and parse from a tempfile. This isn't super clean, but should be a good starting point for caching & parsing PDFs as the issue identified

lzamparo · 2025-06-03T15:01:19Z

@init27 any chance to have a quick look at this? Don't want to get too stale

enables detection of pdf in URLs, and parsing of pdf content via URL

dbd709b

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 21, 2025

init27 merged commit 1903098 into meta-llama:main Jul 4, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

Uh oh!

lzamparo commented May 21, 2025

Uh oh!

lzamparo commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

fix: enables detection of pdf in URLs, and parsing of pdf content via URL #30

Uh oh!

Conversation

lzamparo commented May 21, 2025

Uh oh!

lzamparo commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.