2.0
Features
Advanced Configurations: Introduced a new configuration layer that allows fine-grained control over sanitization rules, enabling users to customize thresholds, logging levels, and processing options without modifying code.
Enhanced Detailed Results: Expanded the result payload to include in-depth information about each sanitization step—such as file size before and after, exact modifications applied, and warnings for potential issues—so that teams can better audit and debug pipeline behavior.
Comprehensive File-Type Validation: Integrated a detailed file-type checker that now explicitly detects and rejects legacy Office formats (e.g., Excel 5/95, Word 6/95) which Apache POI does not support, preventing undefined behavior or silent failures during processing.
Improved PDF Image Handling: Bundled the JAI-ImageIO libraries into PDFBox to enhance support for uncommon image formats (such as TIFF and JPEG2000) within PDF documents, ensuring more reliable image extraction and recompression.
Excel Comment Removal: Added logic to strip out all embedded comments from Excel workbooks, eliminating hidden annotations or notes that might carry sensitive information or macros.
Fixes
Codebase Refactor & Quality Improvements: Cleaned up legacy code paths, removed redundant functions, and standardized naming conventions to increase maintainability and readability across the sanitization modules.
Optimized XML Traversal: Replaced manual XML for-loop iterations with lxml XPath queries for faster attribute lookups and more accurate node selection, reducing CPU overhead during large-scale document processing.
OpenDocument Sanitization Corrections: Fixed incorrect regex patterns that previously led to improper sanitization of ODS/ODT files; updated the sanitization logic to reliably remove unwanted tags, metadata, and embedded objects.
Hyperlink Relationship Preservation: Resolved a bug where hyperlinks in Office documents were getting unset after sanitization. Now, valid relationships remain intact while only unsafe or broken links are removed.
Changes
Dependency Upgrade: Bumped the
icalendar
library from version 4.0.9 to 6.0.0 to leverage improved recurrence rule parsing, timezone handling, and overall performance optimizations in calendar file processing.
Last updated
Was this helpful?