|
DocWire SDK
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing possible for security and confidentiality
|
The main namespace for the DocWire SDK. More...
Namespaces | |
| attributes | |
| Contains definitions for document attributes and metadata. | |
| chaining | |
| Provides functionality for chaining function calls and value transformations. | |
| content_type | |
| Provides a multi-stage pipeline for content type detection. | |
| convert | |
| Namespace for type conversion utilities. | |
| errors | |
| Provides features for reporting and handling errors with context data using nested exceptions. | |
| invocation_concepts | |
| Provides concepts for working with invocable objects and pushable containers. | |
| invocation_traits | |
| Provides traits classes for working with callable objects. | |
| log | |
| Provides a modern, high-performance, and structured logging framework. | |
| named | |
| Utilities for named parameters. | |
| serialization | |
| Provides a generic, concept-based serialization framework. | |
| tuple_utils | |
| Provides transformations for tuples. | |
| xml | |
| XML processing utilities. | |
Classes | |
| class | archives_parser |
| class | chain_element |
| class | charset_converter |
| class | checked |
| A generic wrapper for dereferenceable types (like pointers and optionals) that provides checked access based on a safety_policy. More... | |
| class | common_xml_document_parser |
| Base class for XML-based document parsers (ODF, OOXML, etc.). More... | |
| struct | is_variant_alternative_trait |
| struct | is_variant_alternative_trait< T, std::variant< Us... > > |
| class | csv_exporter |
| Exports data to CSV format. More... | |
| class | csv_writer |
| struct | seekable_stream_ptr |
| Wrapper for a shared pointer to a seekable input stream. More... | |
| struct | unseekable_stream_ptr |
| Wrapper for a shared pointer to an unseekable input stream. More... | |
| struct | length_limit |
| Wrapper for a length limit value. More... | |
| struct | mime_type |
| Wrapper for a MIME type string. More... | |
| struct | overloaded |
A helper for creating a visitor from a set of lambdas, used for visiting std::variant. More... | |
| class | data_source |
| class | data_stream |
| class | file_stream |
| class | buffer_stream |
| class | doc_parser |
| class | eml_parser |
| class | ensure |
| A utility for creating expressive, exception-throwing assertions in a fluent style. More... | |
| class | file_extension |
| A class representing a file extension. More... | |
| class | html_exporter |
| Exports data to HTML format. More... | |
| class | html_parser |
| class | html_writer |
| class | input_chain_element |
| class | iwork_parser |
| class | lru_memory_cache |
| Least Recently Used (LRU) cache with fixed memory size. More... | |
| class | mail_parser |
| class | imemorystreambuf |
| imemorystreambuf is a stream buffer that wraps std::span<const std::byte> and provides a compatible interface with std::istream. More... | |
| class | imemorystream |
| imemorystream is a stream that wraps a std::span of const std::byte and provides a compatible interface with std::istream. More... | |
| struct | message |
| struct | message_base |
| struct | message_callbacks |
| struct | message_counters |
| class | metadata_exporter |
| Exports meta data only to plain text format. More... | |
| class | metadata_writer |
| struct | guaranteed_t |
| A tag to indicate that a pointer is guaranteed to be non-null, bypassing the runtime check in not_null's constructor. More... | |
| class | not_null |
| A wrapper for pointer-like types that enforces a non-null invariant. More... | |
| struct | ocr_confidence_threshold |
| struct | ocr_data_path |
| struct | ocr_timeout |
| class | ocr_parser |
| class | odf_ooxml_parser |
| A parser for ODF and OOXML document formats. More... | |
| class | odfxml_parser |
| A parser for flat ODF XML documents. More... | |
| class | office_formats_parser |
| A composite parser handling various office document formats. More... | |
| class | output_chain_element |
| output_chain_element class is responsible for saving data from parsing chain to an output stream. More... | |
| class | parsing_chain |
| class | pdf_parser |
| struct | pimpl_impl |
| class | with_pimpl_base |
| struct | pimpl_impl_base |
| class | with_pimpl_owner |
| class | with_pimpl |
| struct | eol_sequence |
| struct | link_formatter |
| class | plain_text_exporter |
| Exports data to plain text format. More... | |
| class | plain_text_writer |
| struct | default_file_name |
| class | ppt_parser |
| class | pst_parser |
| struct | unlimited_t |
| A marker type to signify an unlimited bound in a ranged type. More... | |
| class | ranged |
| A wrapper for numeric types that enforces a range [Min, Max]. More... | |
| class | ref_or_owned |
| A utility class that simplifies declaring function attributes that need to be stored without requiring the user to create a shared pointer. More... | |
| class | rtf_parser |
| struct | sentinel |
| A sentinel type used to define the end of a range or view. More... | |
| struct | basic_source_location |
| A fallback implementation of source_location for compilers that do not support std::source_location. More... | |
| class | standard_filter |
| Sets of standard filters to use in parsers. example of use: More... | |
| struct | stringifier |
| Primary template for the stringifier. More... | |
| struct | stringifier< T > |
Specialization for types with a string() method. More... | |
| struct | stringifier< const char * > |
| struct | stringifier< std::exception_ptr > |
Specialization for std::exception_ptr. More... | |
| struct | stringifier< std::pair< T1, T2 > > |
| struct | stringifier< named::value< T > > |
Specialization for docwire::named::value, providing a "name: value" string representation. More... | |
| struct | stringifier< std::string > |
| struct | stringifier< serialization::object > |
| struct | stringifier< serialization::value > |
| class | thread_safe_ole_storage |
| class | thread_safe_ole_stream_reader |
| class | transformer_func |
| Wraps single function (tag_transform_func) into chain_element object. More... | |
| struct | parse_paragraphs |
| struct | parse_lines |
| class | txt_parser |
| class | unique_identifier |
| The class represents unique (for a single program run) identifier of an object. More... | |
| class | text_element |
| class | writer |
| The Writer class is used to write data from callbacks to an output stream. More... | |
| class | xls_parser |
| class | xlsb_parser |
| class | xml_fixer |
| class | xml_parser |
| A parser for generic XML documents. More... | |
| class | zip_reader |
Typedefs | |
| using | message_sequence_streamer = std::function< continuation(const message_callbacks &)> |
| using | message_ptr = std::shared_ptr< message_base > |
| typedef std::vector< std::string > | svector |
| template<auto Min, typename T , safety_policy safety_level = default_safety_level> | |
| using | at_least = ranged< Min, unlimited, T, safety_level > |
| template<auto Max, typename T , safety_policy safety_level = default_safety_level> | |
| using | at_most = ranged< unlimited, Max, T, safety_level > |
| template<auto Value, typename T , safety_policy safety_level = default_safety_level> | |
| using | exactly = ranged< Value, Value, T, safety_level > |
| template<typename T , safety_policy safety_level = default_safety_level> | |
| using | non_negative = at_least< 0, T, safety_level > |
| using | source_location = basic_source_location |
| A type that describes a location in source code. Uses a custom fallback implementation. | |
| using | message_transform_func = std::function< continuation(message_ptr, const message_callbacks &emit_message)> |
Enumerations | |
| enum | XmlParseMode { PARSE_XML , FIX_XML , STRIP_XML } |
| enum class | confidence { none , low , medium , high , very_high , highest } |
| Represents the confidence level of a detected MIME type. | |
| enum class | Language { afr , amh , ara , asm_ , aze , aze_cyrl , bel , ben , bod , bos , bre , bul , cat , ceb , ces , chi_sim , chi_tra , chr , cos , cym , dan , deu , div , dzo , ell , eng , enm , epo , equ , est , eus , fao , fas , fil , fin , fra , frk , frm , fry , gla , gle , glg , grc , guj , hat , heb , hin , hrv , hun , hye , iku , ind , isl , ita , ita_old , jav , jpn , kan , kat , kat_old , kaz , khm , kir , kmr , kor , lao , lat , lav , lit , ltz , mal , mar , mkd , mlt , mon , mri , msa , mya , nep , nld , nor , oci , ori , pan , pol , por , pus , que , ron , rus , san , sin , slk , slv , snd , spa , spa_old , sqi , srp , srp_latn , sun , swa , swe , syr , tam , tat , tel , tgk , tha , tir , ton , tur , uig , ukr , urd , uzb , uzb_cyrl , vie , yid , yor } |
| enum class | continuation { proceed , skip , stop } |
| enum class | safety_policy { strict , relaxed } |
| Defines the safety policy for operations. More... | |
Functions | |
| DOCWIRE_CORE_EXPORT double | cosine_similarity (const std::vector< double > &a, const std::vector< double > &b) |
| Calculates the cosine similarity between two vectors. More... | |
| template<class... Ts> | |
| overloaded (Ts...) -> overloaded< Ts... > | |
| template<typename... Context> | |
| void | debug_assert (detail::with_source_location< bool > condition, Context &&... context) |
| Asserts a condition in debug builds. More... | |
| template<safety_policy safety_level = default_safety_level, typename... Context> | |
| void | enforce (detail::with_source_location< bool > condition, Context &&... context) |
| Enforces a condition based on the safety policy. More... | |
| template<typename T > | |
| ensure (const T &, const docwire::source_location &) -> ensure< T > | |
Deduction guide for the ensure class template. More... | |
| DOCWIRE_CORE_EXPORT size_t | decode_html_entities_utf8 (char *dest, const char *src) |
| parsing_chain | operator| (ref_or_owned< data_source > data, ref_or_owned< chain_element > chain_element) |
| parsing_chain | operator| (ref_or_owned< std::istream > stream, ref_or_owned< chain_element > chain_element) |
| template<data_source_compatible_type_ref_qualified T> | |
| parsing_chain | operator| (T &&v, ref_or_owned< chain_element > chain_element) |
| bool | is_framing_message (const message_base &msg) |
| message_callbacks | make_counted_message_callbacks (const message_callbacks &original, message_counters &counters) |
| DOCWIRE_CORE_EXPORT std::string | formatTable (std::vector< svector > &mcols) |
| DOCWIRE_CORE_EXPORT std::string | formatUrl (const std::string &mlink_url, const std::string &mlink_text) |
| DOCWIRE_CORE_EXPORT std::string | formatList (std::vector< std::string > &mlist) |
| DOCWIRE_CORE_EXPORT std::string | formatNumberedList (std::vector< std::string > &mlist) |
| DOCWIRE_CORE_EXPORT std::string | ustring_to_string (const UString &s) |
| DOCWIRE_CORE_EXPORT UString | utf8_to_ustring (const std::string &src) |
| DOCWIRE_CORE_EXPORT std::string | unichar_to_utf8 (unsigned int unichar) |
| bool | utf16_unichar_has_4_bytes (unsigned int ch) |
| DOCWIRE_CORE_EXPORT bool | is_encrypted_with_ms_offcrypto (const data_source &data) |
| DOCWIRE_CORE_EXPORT tm * | thread_safe_gmtime (const time_t *timer, struct tm &time_buffer) |
| template<typename Ptr > | |
| not_null< std::remove_cvref_t< Ptr > > | assume_not_null (Ptr &&ptr) |
| Wraps a pointer-like object in a not_null, bypassing the runtime check. More... | |
| void | parse_oshared_summary_info (thread_safe_ole_storage &storage, attributes::metadata &meta, const std::function< void(std::exception_ptr)> &non_fatal_error_handler) |
| void | parse_oshared_document_summary_info (thread_safe_ole_storage &storage, int &slide_count) |
| std::string | get_codepage_from_document_summary_info (thread_safe_ole_storage &storage) |
| parsing_chain | operator| (ref_or_owned< chain_element > element, ref_or_owned< std::ostream > stream) |
| parsing_chain & | operator|= (parsing_chain &chain, ref_or_owned< std::ostream > stream) |
| parsing_chain | operator| (ref_or_owned< chain_element > element, ref_or_owned< std::vector< message_ptr >> vector) |
| parsing_chain & | operator|= (parsing_chain &chain, ref_or_owned< std::vector< message_ptr >> vector) |
| DOCWIRE_CORE_EXPORT parsing_chain | operator| (ref_or_owned< chain_element > lhs, ref_or_owned< chain_element > rhs) |
| parsing_chain & | operator|= (parsing_chain &lhs, ref_or_owned< chain_element > rhs) |
| template<typename T > | |
| std::string | stringify (const T &value) |
| template<streamable T> | |
| requires (!string_method_equipped< T >) struct stringifier< T > | |
Specialization for types that are streamable to std::ostream. | |
| template<typename T > | |
| requires (!string_method_equipped< T > &&!streamable< T > &&!strong_type_alias< T >) struct stringifier< T > | |
| Default stringifier for types not covered by more specific specializations. More... | |
| template<strong_type_alias T> | |
| requires (!std::is_same_v< T, serialization::object > &&!std::is_same_v< T, serialization::array >) struct stringifier< T > | |
| template<typename T > | |
| requires (std::is_convertible_v< T, message_transform_func > &&!std::is_base_of_v< chain_element, std::remove_cvref_t< T >>) parsing_chain operator|(ref_or_owned< chain_element > element | |
Variables | |
| template<typename T > | |
| concept | container |
| Concept to detect if a type is a container (iterable and not self-recursive). More... | |
| template<typename T > | |
| concept | strong_type_alias = requires(T value) { value.v; } |
Concept for strong type aliases that wrap a single public member v. | |
| template<typename T > | |
| concept | dereferenceable = requires(const T& t) { *t; !t; } |
| Concept to detect if a type is dereferenceable like a pointer. | |
| template<typename T > | |
| concept | empty = std::is_empty_v<T> |
| Concept for empty structs. | |
| template<typename T > | |
| concept | streamable = requires (std::ostream& os, const T& value) { { os << value } -> std::convertible_to<std::ostream&>; } |
Concept for types that are streamable to std::ostream. | |
| template<typename T > | |
| concept | string_method_equipped = requires(const T& t) { { t.string() } -> std::convertible_to<std::string_view>; } |
Concept for types that have a string() member method. | |
| template<typename T > | |
| concept | string_like = std::is_convertible_v<T, std::string_view> |
| Concept for string-like types that can be converted to a string view. | |
| template<typename T , typename Variant > | |
| concept | variant_alternative = is_variant_alternative_trait<T, Variant>::value |
| template<typename T > | |
| concept | data_source_compatible_type |
| Concept matching types that can be used to initialize a data_source. More... | |
| template<typename T > | |
| concept | data_source_compatible_type_ref_qualified = data_source_compatible_type<std::remove_reference_t<T>> |
| Concept matching reference-qualified types compatible with data_source. | |
| template<typename T > | |
| concept | context_tag = std::is_empty_v<T> && requires { { T::string() } -> std::convertible_to<std::string_view>; } |
| template<class T > | |
| concept | IStreamDerived = std::derived_from<T, std::istream> |
| template<typename T > | |
| concept | istream_derived_ref_qualified = IStreamDerived<std::remove_reference_t<T>> |
| constexpr guaranteed_t | guaranteed |
A constant to use with the unchecked not_null constructor, e.g., not_null(ptr, guaranteed). | |
| template<class T > | |
| concept | OStreamDerived = std::derived_from<T, std::ostream> |
| template<typename T > | |
| concept | ostream_derived_ref_qualified = OStreamDerived<std::remove_reference_t<T>> |
| constexpr unlimited_t | unlimited |
| A convenient instance of the unlimited_t marker. | |
| template<typename U , typename T > | |
| concept | ref_or_owned_compatible = std::is_convertible_v<std::shared_ptr<std::remove_reference_t<U>>, std::shared_ptr<T>> |
| constexpr safety_policy | default_safety_level = strict |
| T | func |
The main namespace for the DocWire SDK.
|
strong |
Definition at line 18 of file language.h.
|
strong |
Defines the safety policy for operations.
| Enumerator | |
|---|---|
| strict | Perform runtime checks and throw exceptions on violations. |
| relaxed | Skip runtime checks for performance; undefined behavior on violations. |
Definition at line 21 of file safety_policy.h.
| not_null<std::remove_cvref_t<Ptr> > docwire::assume_not_null | ( | Ptr && | ptr | ) |
Wraps a pointer-like object in a not_null, bypassing the runtime check.
This should only be used when the pointer is guaranteed to be non-null, for example, when it's the result of a factory like std::make_shared which throws on failure instead of returning null.
Definition at line 102 of file not_null.h.
| DOCWIRE_CORE_EXPORT double docwire::cosine_similarity | ( | const std::vector< double > & | a, |
| const std::vector< double > & | b | ||
| ) |
Calculates the cosine similarity between two vectors.
This function computes the cosine similarity between two double-precision floating-point vectors. The vectors must have the same size.
| a | The first vector. |
| b | The second vector. |
| void docwire::debug_assert | ( | detail::with_source_location< bool > | condition, |
| Context &&... | context | ||
| ) |
Asserts a condition in debug builds.
In debug builds (NDEBUG not defined), if the condition is false, the program terminates with a panic message containing the provided context. In release builds, this function does nothing.
| condition | The condition to check. |
| context | Additional context information to log if the assertion fails. |
Definition at line 47 of file debug_assert.h.
| void docwire::enforce | ( | detail::with_source_location< bool > | condition, |
| Context &&... | context | ||
| ) |
Enforces a condition based on the safety policy.
If the condition is false:
strict mode: Throws a docwire::error exception with the docwire::errors::program_logic tag attached and the provided context.relaxed mode: Triggers a debug_assert, which terminates in debug builds but does nothing in release builds.| safety_level | The safety policy to apply (default is strict). |
| condition | The condition to check. |
| context | Additional context information to include in the error/assertion. |
| docwire::ensure | ( | const T & | , |
| const docwire::source_location & | |||
| ) | -> ensure< T > |
Deduction guide for the ensure class template.
This allows the compiler to deduce the template argument T from the constructor call, enabling the clean ensure(value) syntax without needing a factory function.
| docwire::requires | ( | !string_method_equipped< T > &&!streamable< T > &&!strong_type_alias< T > | ) |
Default stringifier for types not covered by more specific specializations.
This fallback uses the generic docwire::serialization mechanism:
docwire::serialization::value.docwire::serialization::value to a human-readable string.This ensures that any type that can be serialized can also be stringified in a default way, providing a consistent fallback for complex types.
| T | The type to be stringified. |
Definition at line 77 of file stringification.h.
|
inline |
UTF16 characters take from 2 to 4 bytes length. Code points from 0x0000 to 0xFFFF require two bytes (so called BMP, most popular characters). But there are rare used characters which use codes between 0x10000 to 0x10FFFF. In that case utf16 character requires 4 bytes. Algorithm:
0x10000 is subtracted from the code point, leaving a 20 bit number in the range 0..0xFFFFF. The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give the first code unit or lead surrogate, which will be in the range 0xD800..0xDBFF (previous versions of the Unicode Standard referred to these as high surrogates). The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give the second code unit or trail surrogate, which will be in the range 0xDC00..0xDFFF (previous versions of the Unicode Standard referred to these as low surrogates).
In BMP set, there are no 16-bit characters which first 6 bits have 110110 or 110111. They are reserved.
In the following function below, we check if first two bytes require another two bytes.
| concept docwire::container |
Concept to detect if a type is a container (iterable and not self-recursive).
Definition at line 25 of file concepts_container.h.
| concept docwire::data_source_compatible_type |
Concept matching types that can be used to initialize a data_source.
Definition at line 92 of file data_source.h.
| T docwire::func |
Definition at line 59 of file transformer_func.h.