Skip to main content

MLStripper

This class provides a mechanism for stripping markup from HTML content by leveraging the HTMLParser interface. It captures raw data, entity references, and character references during parsing to reconstruct a plain text representation of the input. Users can retrieve the accumulated text content by calling the get_data method.

Attributes

AttributeTypeDescription
fedlist = []A list of strings used to accumulate text data, entity references, and character references encountered during HTML parsing.

Constructor

Signature

def MLStripper() - > null

Methods


handle_data()

@classmethod
def handle_data(
d: string
) - > null

Appends raw text content encountered between HTML tags to the internal buffer.

Parameters

NameTypeDescription
dstringThe text content found within an HTML element to be preserved.

Returns

TypeDescription
nullNothing is returned; the internal state is updated.

handle_entityref()

@classmethod
def handle_entityref(
name: string
) - > null

Preserves HTML entity references by appending them in their original format to the internal buffer.

Parameters

NameTypeDescription
namestringThe name of the entity reference, such as 'amp' or 'quot'.

Returns

TypeDescription
nullNothing is returned; the internal state is updated.

handle_charref()

@classmethod
def handle_charref(
name: string
) - > null

Preserves HTML character references by appending them in their original numeric format to the internal buffer.

Parameters

NameTypeDescription
namestringThe numeric code or name of the character reference.

Returns

TypeDescription
nullNothing is returned; the internal state is updated.

get_data()

@classmethod
def get_data() - > string

Concatenates and returns all collected text fragments into a single string.

Returns

TypeDescription
stringThe complete string of stripped text content with HTML tags removed.