PDF Document Loader
PDF (Portable Document Format) is a file format developed by Adobe for presenting documents consistently across software platforms. This module provides functionality to load and process PDF files using pdf.js.
This module provides a sophisticated PDF document loader that can:
- Load single or multiple PDF files
- Split documents by page or file
- Support base64 encoded files
- Handle file storage integration
- Process content with text splitters
- Support legacy PDF versions
- Customize metadata extraction
Inputsβ
Required Parametersβ
- PDF File: The PDF file(s) to process (.pdf extension)
- Usage: Choose between:
- One document per page
- One document per file
Optional Parametersβ
- Text Splitter: A text splitter to process the extracted content
- Use Legacy Build: Whether to use legacy PDF.js build
- Additional Metadata: JSON object with additional metadata
- Omit Metadata Keys: Comma-separated list of metadata keys to omit
Outputsβ
- Document: Array of document objects containing metadata and pageContent
- Text: Concatenated string from pageContent of documents
Featuresβ
- Multiple file support
- Page-level splitting
- Legacy version support
- Text extraction
- Metadata handling
- Error handling
- Memory-efficient processing
Processing Modesβ
Per Page Modeβ
- Each page becomes a document
- Preserves page numbers
- Individual page metadata
- Granular content access
Per File Modeβ
- Entire PDF as one document
- Combined content
- Single metadata set
- Memory efficient
Document Structureβ
Each document contains:
- pageContent: Extracted text content
- metadata:
- source: Original file path
- pdf: PDF-specific metadata
- page: Page number (in per-page mode)
- Additional custom metadata
File Handlingβ
Local Filesβ
- Direct file loading
- Base64 encoded content
- Multiple file support
Storage Integrationβ
- File storage system support
- Organization-based storage
- Chatflow-based storage
Notesβ
- Uses pdf.js for extraction
- Legacy version support
- Memory-efficient processing
- Error handling for invalid files
- Support for large PDFs
- Flexible output formats
- Metadata customization
- Text encoding handling