DocWire SDK
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing possible for security and confidentiality
docwire::content_type Namespace Reference

Provides a multi-stage pipeline for content type detection. More...

Namespaces

 by_file_extension
 
 by_signature
 Provides content type detection based on file signatures (magic bytes).
 

Classes

class  detector
 Content type detection chain element. More...
 

Functions

DOCWIRE_CONTENT_TYPE_EXPORT void detect (data_source &data, const by_signature::database &signatures_db_to_use=by_signature::database{})
 

Detailed Description

Provides a multi-stage pipeline for content type detection.

Architecture

The content type detection in DocWire uses a multi-stage pipeline to ensure accuracy and performance:

  1. By File Extension: Fast lookup using a comprehensive dictionary. Note that we intentionally keep multiple MIME type aliases for a single extension (e.g., .xml maps to both text/xml and application/xml) so users can query any valid historical variant.
  2. By Signature (libmagic): Reads magic bytes.
  3. Heuristic Fallbacks: Custom detectors (e.g., HTML, OOXML, Images) that correct limitations in signature detection.

Stream Processing & Performance

When processing non-seekable streams (like network sockets), signature detection cannot seek to the end of the file. This causes ZIP-based formats (DOCX, ODT) to be detected generically as application/zip. To fix this, heuristic detectors are used. Rule: Heuristic detectors must prioritize performance by reading only a small initial buffer (e.g., 4KB) to check for local file headers before falling back to deep inspection (like ZIP parsing), preventing massive files from being downloaded into memory.

Function Documentation

◆ detect()

DOCWIRE_CONTENT_TYPE_EXPORT void docwire::content_type::detect ( data_source data,
const by_signature::database signatures_db_to_use = by_signature::database{} 
)

Detects and assigns content types to the provided data source using various detection strategies.

This function attempts to identify the content type of the data by using the following detection methods:

  • By file extension
  • By file signature
  • Image content detection
  • ODF and OOXML format detection
  • ASP content detection
  • HTML content detection
  • iWork content detection
  • ODF Flat format detection
  • Outlook format detection
  • XLSB format detection
Parameters
dataThe data source to be analyzed for content type detection.
signatures_db_to_useThe loaded database of signatures used for signature-based content detection. It will be created (and loaded) if not provided.
See also
performing file type detection example
content_type::detector
content_type::by_signature::database
content_type::by_file_extension::detect
content_type::by_signature::detect
content_type::image::detect
content_type::odf_ooxml::detect
content_type::asp::detect
content_type::html::detect
content_type::iwork::detect
content_type::odf_flat::detect
content_type::outlook::detect
content_type::xlsb::detect
Examples
file_type_determination.cpp.