Skip to main content

TruncateHTMLParser

This class provides a specialized HTML parser designed to truncate HTML content to a specific length while maintaining valid tag structures. It tracks open elements using a stack to ensure all tags are properly closed when the character limit is reached. The parser handles void elements and character references to produce a clean, truncated string with an optional replacement suffix.

Attributes

AttributeTypeDescription
tagscollections.dequeA stack of currently open HTML tags used to ensure proper closing of elements after truncation.
outputlistA list of strings representing the processed HTML fragments that will be joined to form the final truncated output.
remainingintThe number of characters remaining in the truncation quota before the content is cut off.
replacementstringThe string to be appended to the output when truncation occurs, such as an ellipsis.
void_elementssetA set of HTML tags that are defined as void elements and do not require closing tags.

Constructor

Signature

def TruncateHTMLParser(
length: int,
replacement: str,
convert_charrefs: bool = True
)

Parameters

NameTypeDescription
lengthintThe maximum number of characters allowed before truncation occurs.
replacementstrThe string to append to the output when truncation happens.
convert_charrefsbool = TrueWhether the parser should convert character references automatically.

Methods


void_elements()

@classmethod
def void_elements() - > set

Retrieves the set of HTML void elements that do not require a closing tag, such as < br > or < img >.

Returns

TypeDescription
setA collection of strings representing HTML tags that are defined as void elements.

handle_startendtag()

@classmethod
def handle_startendtag(
tag: string,
attrs: list
) - > null

Processes a self-closing HTML tag by treating it as a start tag and conditionally handling it as an end tag if it is not a void element.

Parameters

NameTypeDescription
tagstringThe name of the HTML tag being processed.
attrslistA list of (name, value) pairs containing the attributes found inside the tag's brackets.

Returns

TypeDescription
null

handle_starttag()

@classmethod
def handle_starttag(
tag: string,
attrs: list
) - > null

Appends the raw start tag text to the output and tracks non-void elements in a stack to ensure proper closing later.

Parameters

NameTypeDescription
tagstringThe name of the HTML tag being opened.
attrslistA list of (name, value) pairs containing the attributes found inside the tag's brackets.

Returns

TypeDescription
null

handle_endtag()

@classmethod
def handle_endtag(
tag: string
) - > null

Appends a closing HTML tag to the output and removes the corresponding tag from the tracking stack if it matches the most recent start tag.

Parameters

NameTypeDescription
tagstringThe name of the HTML tag being closed.

Returns

TypeDescription
null

handle_data()

@classmethod
def handle_data(
data: string
) - > null

Processes text content and tracks the remaining character budget, raising TruncationCompleted if the length limit is exceeded.

Parameters

NameTypeDescription
datastringThe raw text content found between HTML tags to be processed and potentially truncated.

Returns

TypeDescription
null

feed()

@classmethod
def feed(
data: string
) - > null

Parses the provided HTML data and ensures all open tags are gracefully closed if truncation occurs during processing.

Parameters

NameTypeDescription
datastringThe HTML formatted string to be parsed and truncated.

Returns

TypeDescription
null