cleaner
Class Cleaner
allows for configurable cleaning of text using spaCy
.
Cleaner
¶
Cleans a sequence of texts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
Language
|
A |
required |
*processors |
Callable[[Token], Union[str, Token]]
|
Callable token processors. |
()
|
Example
import spacy
from spacy_cleaner import Cleaner, processing
model = spacy.blank("en")
model.add_pipe("lemmatizer", config={"mode": "lookup"})
model.initialize()
texts = ["Hello, my name is Cellan! I love to swim!"]
cleaner = Cleaner(
model,
processing.remove_stopword_token,
processing.replace_punctuation_token,
processing.mutate_lemma_token,
)
cleaner.clean(texts)
['hello _IS_PUNCT_ Cellan _IS_PUNCT_ love swim _IS_PUNCT_']
Source code in spacy_cleaner/cleaners.py
clean(texts, *, as_tuples=False, batch_size=None, disable=util.SimpleFrozenList(), component_cfg=None, n_process=1)
¶
Clean a stream of texts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts |
Union[Iterable[Union[str, Doc]], Iterable[Tuple[Union[str, Doc], _AnyContext]]]
|
A sequence of texts or docs to process. |
required |
as_tuples |
bool
|
If set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. |
False
|
batch_size |
Optional[int]
|
The number of texts to buffer. |
None
|
disable |
Iterable[str]
|
The pipeline components to disable. |
SimpleFrozenList()
|
component_cfg |
Optional[Dict[str, Dict[str, Any]]]
|
An optional dictionary with extra keyword arguments for specific components. |
None
|
n_process |
int
|
Number of processors to process texts. If |
1
|
Returns:
Type | Description |
---|---|
List[str]
|
A list of cleaned strings in the order of the original text. |