MLStripper
This class provides a mechanism for stripping markup from HTML content by leveraging the HTMLParser interface. It captures raw data, entity references, and character references during parsing to reconstruct a plain text representation of the input. Users can retrieve the accumulated text content by calling the get_data method.
Attributes
| Attribute | Type | Description |
|---|---|---|
| fed | list = [] | A list of strings used to accumulate text data, entity references, and character references encountered during HTML parsing. |
Constructor
Signature
def MLStripper() - > null
Methods
handle_data()
@classmethod
def handle_data(
d: string
) - > null
Appends raw text content encountered between HTML tags to the internal buffer.
Parameters
| Name | Type | Description |
|---|---|---|
| d | string | The text content found within an HTML element to be preserved. |
Returns
| Type | Description |
|---|---|
null | Nothing is returned; the internal state is updated. |
handle_entityref()
@classmethod
def handle_entityref(
name: string
) - > null
Preserves HTML entity references by appending them in their original format to the internal buffer.
Parameters
| Name | Type | Description |
|---|---|---|
| name | string | The name of the entity reference, such as 'amp' or 'quot'. |
Returns
| Type | Description |
|---|---|
null | Nothing is returned; the internal state is updated. |
handle_charref()
@classmethod
def handle_charref(
name: string
) - > null
Preserves HTML character references by appending them in their original numeric format to the internal buffer.
Parameters
| Name | Type | Description |
|---|---|---|
| name | string | The numeric code or name of the character reference. |
Returns
| Type | Description |
|---|---|
null | Nothing is returned; the internal state is updated. |
get_data()
@classmethod
def get_data() - > string
Concatenates and returns all collected text fragments into a single string.
Returns
| Type | Description |
|---|---|
string | The complete string of stripped text content with HTML tags removed. |