DocWire SDK
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing possible for security and confidentiality
docwire::data_source Class Reference

#include <data_source.h>

Public Member Functions

template<data_source_compatible_type T>
 data_source (const T &source)
 Constructs a data_source from a compatible type. More...
 
template<data_source_compatible_type T>
 data_source (T &&source)
 Constructs a data_source by moving from a compatible type. More...
 
template<data_source_compatible_type T>
 data_source (const T &source, file_extension file_extension)
 Constructs a data_source with an explicit file extension. More...
 
template<data_source_compatible_type T>
 data_source (T &&source, file_extension file_extension)
 Constructs a data_source by moving, with an explicit file extension. More...
 
template<data_source_compatible_type T>
 data_source (const T &source, mime_type mime_type, confidence mime_type_confidence)
 Constructs a data_source with an initial MIME type and confidence. More...
 
template<data_source_compatible_type T>
 data_source (T &&source, mime_type mime_type, confidence mime_type_confidence)
 Constructs a data_source by moving, with an initial MIME type and confidence. More...
 
std::span< const std::byte > span (std::optional< length_limit > limit=std::nullopt) const
 Returns the content as a span of bytes. More...
 
std::string string (std::optional< length_limit > limit=std::nullopt) const
 Returns the content as a string. More...
 
std::string_view string_view (std::optional< length_limit > limit=std::nullopt) const
 Returns the content as a string_view. More...
 
std::shared_ptr< std::istream > istream () const
 Returns an input stream for reading the data.
 
std::optional< std::filesystem::path > path () const
 Returns the file path if the source is a file, otherwise std::nullopt.
 
std::optional< docwire::file_extensionfile_extension () const
 Returns the file extension if available.
 
unique_identifier id () const
 Returns the unique identifier for this data source.
 
std::optional< std::pair< mime_type, confidence > > highest_confidence_mime_type_info () const
 Returns the MIME type with the highest confidence and its confidence level. More...
 
std::optional< mime_typehighest_confidence_mime_type () const
 Returns the MIME type with the highest confidence.
 
confidence highest_mime_type_confidence () const
 Returns the highest confidence level found among detected MIME types.
 
bool has_highest_confidence_mime_type_in (const std::vector< mime_type > &mts) const
 Checks if the highest confidence mime type is present in the given list. More...
 
void assert_not_encrypted () const
 Asserts that the data source is not encrypted.
 
confidence mime_type_confidence (mime_type mt) const
 Returns the confidence level for a specific MIME type.
 
void add_mime_type (mime_type mt, confidence c)
 Adds a mime type with a confidence level. More...
 

Public Attributes

std::unordered_map< mime_type, confidencemime_types
 Map of detected MIME types and their confidence levels.
 

Detailed Description

The class below represents a binary data source for data processing. It can be initialized with a file path, memory buffer, input stream or other data source. All popular C++ data sources are supported. Document parsers and 3rdparty libraries needs to have access to the data in preferred way like memory buffer or file path or stream or range, because of their implementation and it cannot be changed. Sometimes one method is faster than other, and parser needs to know about state of data source to decide. Converting data from one storage form to other should be possible in all combinations but performed only as required (lazy) and cached inside the class, for example file should be read to memory only once. Performance is very important, for example we should not duplicate memory buffer that is passed to class.

Examples
local_embedding_similarity.cpp.

Definition at line 127 of file data_source.h.

Constructor & Destructor Documentation

◆ data_source() [1/6]

template<data_source_compatible_type T>
docwire::data_source::data_source ( const T &  source)
inlineexplicit

Constructs a data_source from a compatible type.

Parameters
sourceThe data source (e.g., path, string, vector<byte>).

Definition at line 135 of file data_source.h.

◆ data_source() [2/6]

template<data_source_compatible_type T>
docwire::data_source::data_source ( T &&  source)
inlineexplicit

Constructs a data_source by moving from a compatible type.

Parameters
sourceThe data source to move from.

Definition at line 144 of file data_source.h.

◆ data_source() [3/6]

template<data_source_compatible_type T>
docwire::data_source::data_source ( const T &  source,
file_extension  file_extension 
)
inlineexplicit

Constructs a data_source with an explicit file extension.

Parameters
sourceThe data source.
file_extensionThe file extension to associate with the data.

Definition at line 154 of file data_source.h.

◆ data_source() [4/6]

template<data_source_compatible_type T>
docwire::data_source::data_source ( T &&  source,
file_extension  file_extension 
)
inlineexplicit

Constructs a data_source by moving, with an explicit file extension.

Parameters
sourceThe data source to move from.
file_extensionThe file extension to associate with the data.

Definition at line 164 of file data_source.h.

◆ data_source() [5/6]

template<data_source_compatible_type T>
docwire::data_source::data_source ( const T &  source,
mime_type  mime_type,
confidence  mime_type_confidence 
)
inlineexplicit

Constructs a data_source with an initial MIME type and confidence.

Parameters
sourceThe data source.
mime_typeThe initial MIME type.
mime_type_confidenceThe confidence level for the initial MIME type.

Definition at line 175 of file data_source.h.

◆ data_source() [6/6]

template<data_source_compatible_type T>
docwire::data_source::data_source ( T &&  source,
mime_type  mime_type,
confidence  mime_type_confidence 
)
inlineexplicit

Constructs a data_source by moving, with an initial MIME type and confidence.

Parameters
sourceThe data source to move from.
mime_typeThe initial MIME type.
mime_type_confidenceThe confidence level for the initial MIME type.

Definition at line 188 of file data_source.h.

Member Function Documentation

◆ add_mime_type()

void docwire::data_source::add_mime_type ( mime_type  mt,
confidence  c 
)
inline

Adds a mime type with a confidence level.

Parameters
mtThe mime type to add.
cThe confidence level.

Definition at line 300 of file data_source.h.

◆ has_highest_confidence_mime_type_in()

bool docwire::data_source::has_highest_confidence_mime_type_in ( const std::vector< mime_type > &  mts) const

Checks if the highest confidence mime type is present in the given list.

Parameters
mtsThe list of mime types to check against.

◆ highest_confidence_mime_type_info()

std::optional<std::pair<mime_type, confidence> > docwire::data_source::highest_confidence_mime_type_info ( ) const
inline

Returns the MIME type with the highest confidence and its confidence level.

Note
Because a file extension can map to multiple valid MIME type aliases with the same confidence level, this method uses a deterministic alphabetical tie-breaker (e.g., application/xml wins over text/xml) to guarantee consistent cross-platform behavior.

Definition at line 240 of file data_source.h.

◆ span()

std::span<const std::byte> docwire::data_source::span ( std::optional< length_limit limit = std::nullopt) const

Returns the content as a span of bytes.

Parameters
limitOptional limit on the number of bytes to return.
Returns
A span over the data.

◆ string()

std::string docwire::data_source::string ( std::optional< length_limit limit = std::nullopt) const

Returns the content as a string.

Parameters
limitOptional limit on the number of characters to return.
Returns
A string containing the data.

◆ string_view()

std::string_view docwire::data_source::string_view ( std::optional< length_limit limit = std::nullopt) const

Returns the content as a string_view.

This method avoids memory allocation if the underlying source is already in memory (e.g. string, vector<byte>). If the source is a stream or file, it may load data.

Parameters
limitOptional limit on the number of characters to return.

The documentation for this class was generated from the following file: