Skip to content

helpers

Processing helper functions.

clean_doc(doc, *processors)

Cleans a spaCy document and returns a cleaned string.

Parameters:

Name Type Description Default
doc Doc

spaCy document to be cleaned.

required
*processors Callable[[Token], Union[str, Token]]

Callable token processors.

()

Returns:

Type Description
str

A string of the cleaned text.

Source code in spacy_cleaner/processing/helpers.py
def clean_doc(
    doc: tokens.Doc,
    *processors: Callable[[tokens.Token], Union[str, tokens.Token]],
) -> str:
    """Cleans a spaCy document and returns a cleaned string.

    Args:
        doc: spaCy document to be cleaned.
        *processors: Callable token processors.

    Returns:
        A string of the cleaned text.
    """
    s = " ".join([token_pipe(tok, *processors) for tok in doc])
    return replace_multi_whitespace(s)

replace_multi_whitespace(s, replace=' ')

Replace multiple whitespace characters with a single space.

Parameters:

Name Type Description Default
s str

The string to be replaced.

required
replace str

The replacement string.

' '

Returns:

Type Description
str

A string with all the whitespace replaced with a single space.

Source code in spacy_cleaner/processing/helpers.py
def replace_multi_whitespace(s: str, replace: str = " ") -> str:
    """Replace multiple whitespace characters with a single space.

    Args:
      s: The string to be replaced.
      replace: The replacement string.

    Returns:
      A string with all the whitespace replaced with a single space.
    """
    return re.sub(r"\s\s+", replace, s, flags=re.UNICODE).strip()

token_pipe(tok, *processors)

Applies a series of processors to a token until it becomes a string.

It takes a token, and applies a series of functions to it, until one of the functions returns a string.

Parameters:

Name Type Description Default
tok Token

The token to be transformed,

required
*processors Callable[[Token], Union[str, Token]]

Callable token processors.

()

Returns:

Type Description
str

A string of the token after being processed.

Source code in spacy_cleaner/processing/helpers.py
def token_pipe(
    tok: tokens.Token,
    *processors: Callable[[tokens.Token], Union[str, tokens.Token]],
) -> str:
    """Applies a series of processors to a token until it becomes a string.

    It takes a token, and applies a series of functions to it, until one of
        the functions returns a string.

    Args:
        tok: The token to be transformed,
        *processors: Callable token processors.

    Returns:
        A string of the token after being processed.
    """
    for processor in processors:
        tok = processor(tok)  # type: ignore[assignment]
        if isinstance(tok, str):
            return str(tok)
    return str(tok)