DocWire SDK
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing possible for security and confidentiality
docwire::common_xml_document_parser< safety_level > Class Template Reference

Base class for XML-based document parsers (ODF, OOXML, etc.). More...

#include <common_xml_document_parser.h>

Inheritance diagram for docwire::common_xml_document_parser< safety_level >:
docwire::chain_element docwire::with_pimpl< common_xml_document_parser< default_safety_level > > docwire::with_pimpl< chain_element > docwire::with_pimpl_base docwire::with_pimpl_base

Classes

struct  comment
 Represents a comment with author, time, and text. More...
 
struct  relationship
 Represents a relationship, typically for hyperlinks or embedded objects. More...
 
class  scoped_context_stack_push
 Helper class to manage the context stack scope. Pushes a new context on construction and pops it on destruction. More...
 
struct  shared_string
 Represents a shared string, a common optimization in OOXML formats. More...
 

Public Types

enum  ODFOOXMLListStyle { number , bullet }
 Enum for list styles (e.g., numbered or bulleted).
 
typedef std::vector< ODFOOXMLListStyleListStyleVector
 Type alias for a vector of list styles.
 
using ListStyleMap = std::map< std::string, common_xml_document_parser< safety_level >::ListStyleVector >
 Type alias for a map of list style names to their definitions.
 
using CommentMap = std::map< int, common_xml_document_parser< safety_level >::comment >
 Type alias for a map of comment IDs to Comment objects.
 
using RelationshipMap = std::map< std::string, common_xml_document_parser< safety_level >::relationship >
 Type alias for a map of relationship IDs to Relationship objects.
 
using SharedStringVector = std::vector< shared_string >
 Type alias for a vector of shared strings.
 
typedef std::function< void(xml::node_ref< safety_level > &xml_node, XmlParseMode mode, zip_reader *zipfile, std::string &text, bool &children_processed, std::string &level_suffix, bool first_on_level)> CommandHandler
 Defines the function signature for an XML tag command handler.
 

Public Member Functions

void registerODFOOXMLCommandHandler (const std::string &xml_tag, const CommandHandler &handler)
 Registers a handler for a specific XML tag. More...
 
std::string parseXmlData (xml::children_view< safety_level > xml_nodes, XmlParseMode mode, zip_reader *zipfile)
 Parses XML data from a view of nodes. More...
 
std::string parseXmlChildren (xml::node_ref< safety_level > &xml_node, XmlParseMode mode, zip_reader *zipfile)
 Parses the children of a given XML node. More...
 
void extractText (std::string_view xml_contents, XmlParseMode mode, zip_reader *zipfile, std::string &text)
 Extracts text from raw XML content. More...
 
void parseODFMetadata (std::string_view xml_content, attributes::metadata &metadata) const
 Parses ODF metadata from XML content. More...
 
const std::string formatComment (const std::string &author, const std::string &time, const std::string &text)
 Formats a comment for output. More...
 
size_t & getListDepth ()
 Returns the current nesting depth of lists.
 
ListStyleMapgetListStyles ()
 Gets the map of list styles.
 
CommentMapgetComments ()
 Gets the map of comments.
 
RelationshipMapgetRelationships ()
 Gets the map of relationships.
 
SharedStringVectorgetSharedStrings ()
 Gets the vector of shared strings.
 
bool disabledText () const
 Checks if text extraction is currently disabled.
 
xml::reader_blanks blanks () const
 Gets the current blank node handling policy.
 
void disableText (bool disable)
 Enables or disables text extraction.
 
void set_blanks (xml::reader_blanks blanks)
 Sets the blank node handling policy for the XML reader.
 
void activeEmittingSignals (bool flag)
 Controls whether signal emission (callbacks) is active.
 
 common_xml_document_parser ()
 Default constructor.
 
- Public Member Functions inherited from docwire::chain_element
 chain_element (chain_element &&)=default
 
chain_elementoperator= (chain_element &&)=default
 
virtual continuation operator() (message_ptr msg, const message_callbacks &emit_message)=0
 
virtual bool is_leaf () const =0
 Check if chain element is a leaf (last element which doesn't produce any messages). At this moment only exporters are leafs. More...
 
virtual bool is_generator () const
 

Additional Inherited Members

- Protected Types inherited from docwire::with_pimpl< chain_element >
using impl_type = pimpl_impl< chain_element >
 
- Protected Types inherited from docwire::with_pimpl< common_xml_document_parser< default_safety_level > >
using impl_type = pimpl_impl< common_xml_document_parser< default_safety_level > >
 
- Protected Member Functions inherited from docwire::with_pimpl< chain_element >
impl_typecreate_impl (Args &&... args)
 
 with_pimpl (Args &&... args)
 
 with_pimpl (with_pimpl< chain_element > &&other) noexcept
 
 with_pimpl (std::nullptr_t)
 
with_pimploperator= (with_pimpl &&other) noexcept
 
impl_typeimpl ()
 
const impl_typeimpl () const
 
- Protected Member Functions inherited from docwire::with_pimpl< common_xml_document_parser< default_safety_level > >
impl_typecreate_impl (Args &&... args)
 
 with_pimpl (Args &&... args)
 
 with_pimpl (with_pimpl< common_xml_document_parser< default_safety_level > > &&other) noexcept
 
 with_pimpl (std::nullptr_t)
 
with_pimploperator= (with_pimpl &&other) noexcept
 
impl_typeimpl ()
 
const impl_typeimpl () const
 

Detailed Description

template<safety_policy safety_level = default_safety_level>
class docwire::common_xml_document_parser< safety_level >

Base class for XML-based document parsers (ODF, OOXML, etc.).

This class is inherited by specific parsers (e.g., odf_ooxml_parser, odfxml_parser). It allows registering handlers for specific XML tags.

Template Parameters
safety_levelThe safety policy used for XML parsing operations.
See also
xml::reader
XML parsing example

Definition at line 41 of file common_xml_document_parser.h.

Member Function Documentation

◆ extractText()

template<safety_policy safety_level = default_safety_level>
void docwire::common_xml_document_parser< safety_level >::extractText ( std::string_view  xml_contents,
XmlParseMode  mode,
zip_reader zipfile,
std::string &  text 
)

Extracts text from raw XML content.

This is a high-level function that initializes the XML reader and calls parseXmlData.

Parameters
xml_contentsThe raw XML string.
modeThe parsing mode.
zipfilePointer to the zip_reader if applicable.
textOutput parameter where the extracted text will be appended.

◆ formatComment()

template<safety_policy safety_level = default_safety_level>
const std::string docwire::common_xml_document_parser< safety_level >::formatComment ( const std::string &  author,
const std::string &  time,
const std::string &  text 
)

Formats a comment for output.

Parameters
authorThe author of the comment.
timeThe timestamp of the comment.
textThe content of the comment.
Returns
The formatted comment string.

◆ parseODFMetadata()

template<safety_policy safety_level = default_safety_level>
void docwire::common_xml_document_parser< safety_level >::parseODFMetadata ( std::string_view  xml_content,
attributes::metadata metadata 
) const

Parses ODF metadata from XML content.

Parameters
xml_contentThe raw XML content of the metadata file.
metadataThe structure to populate with parsed metadata.

◆ parseXmlChildren()

template<safety_policy safety_level = default_safety_level>
std::string docwire::common_xml_document_parser< safety_level >::parseXmlChildren ( xml::node_ref< safety_level > &  xml_node,
XmlParseMode  mode,
zip_reader zipfile 
)

Parses the children of a given XML node.

Parameters
xml_nodeThe parent node whose children will be parsed.
modeThe parsing mode.
zipfilePointer to the zip_reader if applicable.
Returns
The extracted text content from the children.

◆ parseXmlData()

template<safety_policy safety_level = default_safety_level>
std::string docwire::common_xml_document_parser< safety_level >::parseXmlData ( xml::children_view< safety_level >  xml_nodes,
XmlParseMode  mode,
zip_reader zipfile 
)

Parses XML data from a view of nodes.

Iterates through the provided XML nodes and executes registered command handlers.

Parameters
xml_nodesThe view of XML nodes to parse.
modeThe parsing mode (e.g., PARSE_XML, STRIP_XML).
zipfilePointer to the zip_reader if the XML is part of a zipped archive (e.g., DOCX, ODT).
Returns
The extracted text content.

◆ registerODFOOXMLCommandHandler()

template<safety_policy safety_level = default_safety_level>
void docwire::common_xml_document_parser< safety_level >::registerODFOOXMLCommandHandler ( const std::string &  xml_tag,
const CommandHandler handler 
)

Registers a handler for a specific XML tag.

Derived classes can use this to add or override behavior for specific XML tags.

Parameters
xml_tagThe XML tag name to handle.
handlerThe function to execute when the tag is encountered.

The documentation for this class was generated from the following file: