TruncateHTMLParser
This class provides a specialized HTML parser designed to truncate HTML content to a specific length while maintaining valid tag structures. It tracks open elements using a stack to ensure all tags are properly closed when the character limit is reached. The parser handles void elements and character references to produce a clean, truncated string with an optional replacement suffix.
Attributes
| Attribute | Type | Description |
|---|---|---|
| tags | collections.deque | A stack of currently open HTML tags used to ensure proper closing of elements after truncation. |
| output | list | A list of strings representing the processed HTML fragments that will be joined to form the final truncated output. |
| remaining | int | The number of characters remaining in the truncation quota before the content is cut off. |
| replacement | string | The string to be appended to the output when truncation occurs, such as an ellipsis. |
| void_elements | set | A set of HTML tags that are defined as void elements and do not require closing tags. |
Constructor
Signature
def TruncateHTMLParser(
length: int,
replacement: str,
convert_charrefs: bool = True
)
Parameters
| Name | Type | Description |
|---|---|---|
| length | int | The maximum number of characters allowed before truncation occurs. |
| replacement | str | The string to append to the output when truncation happens. |
| convert_charrefs | bool = True | Whether the parser should convert character references automatically. |
Methods
void_elements()
@classmethod
def void_elements() - > set
Retrieves the set of HTML void elements that do not require a closing tag, such as < br > or < img >.
Returns
| Type | Description |
|---|---|
set | A collection of strings representing HTML tags that are defined as void elements. |
handle_startendtag()
@classmethod
def handle_startendtag(
tag: string,
attrs: list
) - > null
Processes a self-closing HTML tag by treating it as a start tag and conditionally handling it as an end tag if it is not a void element.
Parameters
| Name | Type | Description |
|---|---|---|
| tag | string | The name of the HTML tag being processed. |
| attrs | list | A list of (name, value) pairs containing the attributes found inside the tag's brackets. |
Returns
| Type | Description |
|---|---|
null |
handle_starttag()
@classmethod
def handle_starttag(
tag: string,
attrs: list
) - > null
Appends the raw start tag text to the output and tracks non-void elements in a stack to ensure proper closing later.
Parameters
| Name | Type | Description |
|---|---|---|
| tag | string | The name of the HTML tag being opened. |
| attrs | list | A list of (name, value) pairs containing the attributes found inside the tag's brackets. |
Returns
| Type | Description |
|---|---|
null |
handle_endtag()
@classmethod
def handle_endtag(
tag: string
) - > null
Appends a closing HTML tag to the output and removes the corresponding tag from the tracking stack if it matches the most recent start tag.
Parameters
| Name | Type | Description |
|---|---|---|
| tag | string | The name of the HTML tag being closed. |
Returns
| Type | Description |
|---|---|
null |
handle_data()
@classmethod
def handle_data(
data: string
) - > null
Processes text content and tracks the remaining character budget, raising TruncationCompleted if the length limit is exceeded.
Parameters
| Name | Type | Description |
|---|---|---|
| data | string | The raw text content found between HTML tags to be processed and potentially truncated. |
Returns
| Type | Description |
|---|---|
null |
feed()
@classmethod
def feed(
data: string
) - > null
Parses the provided HTML data and ensures all open tags are gracefully closed if truncation occurs during processing.
Parameters
| Name | Type | Description |
|---|---|---|
| data | string | The HTML formatted string to be parsed and truncated. |
Returns
| Type | Description |
|---|---|
null |