|
DocWire SDK
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing possible for security and confidentiality
|
Provides a multi-stage pipeline for content type detection. More...
Namespaces | |
| by_file_extension | |
| by_signature | |
| Provides content type detection based on file signatures (magic bytes). | |
Classes | |
| class | detector |
| Content type detection chain element. More... | |
Functions | |
| DOCWIRE_CONTENT_TYPE_EXPORT void | detect (data_source &data, const by_signature::database &signatures_db_to_use=by_signature::database{}) |
Provides a multi-stage pipeline for content type detection.
The content type detection in DocWire uses a multi-stage pipeline to ensure accuracy and performance:
.xml maps to both text/xml and application/xml) so users can query any valid historical variant.When processing non-seekable streams (like network sockets), signature detection cannot seek to the end of the file. This causes ZIP-based formats (DOCX, ODT) to be detected generically as application/zip. To fix this, heuristic detectors are used. Rule: Heuristic detectors must prioritize performance by reading only a small initial buffer (e.g., 4KB) to check for local file headers before falling back to deep inspection (like ZIP parsing), preventing massive files from being downloaded into memory.
| DOCWIRE_CONTENT_TYPE_EXPORT void docwire::content_type::detect | ( | data_source & | data, |
| const by_signature::database & | signatures_db_to_use = by_signature::database{} |
||
| ) |
Detects and assigns content types to the provided data source using various detection strategies.
This function attempts to identify the content type of the data by using the following detection methods:
| data | The data source to be analyzed for content type detection. |
| signatures_db_to_use | The loaded database of signatures used for signature-based content detection. It will be created (and loaded) if not provided. |