2.0

Features

  • Advanced Configurations: Introduced a new configuration layer that allows fine-grained control over sanitization rules, enabling users to customize thresholds, logging levels, and processing options without modifying code.

  • Enhanced Detailed Results: Expanded the result payload to include in-depth information about each sanitization step—such as file size before and after, exact modifications applied, and warnings for potential issues—so that teams can better audit and debug pipeline behavior.

  • Comprehensive File-Type Validation: Integrated a detailed file-type checker that now explicitly detects and rejects legacy Office formats (e.g., Excel 5/95, Word 6/95) which Apache POI does not support, preventing undefined behavior or silent failures during processing.

  • Improved PDF Image Handling: Bundled the JAI-ImageIO libraries into PDFBox to enhance support for uncommon image formats (such as TIFF and JPEG2000) within PDF documents, ensuring more reliable image extraction and recompression.

  • Excel Comment Removal: Added logic to strip out all embedded comments from Excel workbooks, eliminating hidden annotations or notes that might carry sensitive information or macros.

Fixes

  • Codebase Refactor & Quality Improvements: Cleaned up legacy code paths, removed redundant functions, and standardized naming conventions to increase maintainability and readability across the sanitization modules.

  • Optimized XML Traversal: Replaced manual XML for-loop iterations with lxml XPath queries for faster attribute lookups and more accurate node selection, reducing CPU overhead during large-scale document processing.

  • OpenDocument Sanitization Corrections: Fixed incorrect regex patterns that previously led to improper sanitization of ODS/ODT files; updated the sanitization logic to reliably remove unwanted tags, metadata, and embedded objects.

  • Hyperlink Relationship Preservation: Resolved a bug where hyperlinks in Office documents were getting unset after sanitization. Now, valid relationships remain intact while only unsafe or broken links are removed.

Changes

  • Dependency Upgrade: Bumped the icalendar library from version 4.0.9 to 6.0.0 to leverage improved recurrence rule parsing, timezone handling, and overall performance optimizations in calendar file processing.

Last updated

Was this helpful?