lector.types.strings#
Helpers to convert to types that logically remain strings (e.g. categoricals).
Classes#
Anything could be text, but we can enforce text-likeness and uniqueness. |
|
Generic enumeration. |
|
Infer values encoding a person's sex in a column and map to configurable labels. |
|
Anything could be text, but we can enforce text-likeness and uniqueness. |
|
Anything could be text, but we can enforce text-likeness and uniqueness. |
Functions#
|
Check for natural language-like texts using criteria like lengths, number of spaces. |
|
Cast to categorical depending on cardinality and whether strings are text-like. |
|
Check if the two most common values are sex-like and return them. |
|
Calculate proportion of natural language-like texts given criteria. |
|
Use regex to find proportion of strings that are (web) URL-like. |
|
Check for natural language-like texts using criteria like lengths, number of spaces. |
Attributes#
Maximum cardinalty for categoricals (arrow's default is 50 in ConvertOptions). |
|
Strings need to be this long to be considered text. |
|
Strings need to have this many spaces to be considered text. |
|
Infer text type if a proportion or values greater than this is text-like. |
|
Whether to count list-like strings as texts. |
- class lector.types.strings.Category[source]#
Bases:
lector.types.abc.ConverterAnything could be text, but we can enforce text-likeness and uniqueness.
- convert(array)[source]#
To be implemented in subclasses.
- Parameters:
array (pyarrow.Array) –
- Return type:
lector.types.abc.Conversion | None
- class lector.types.strings.Sex[source]#
Bases:
enum.EnumGeneric enumeration.
Derive from this class to define new enumerations.
- class lector.types.strings.SexMapper(values, labels=None)[source]#
Infer values encoding a person’s sex in a column and map to configurable labels.
- Parameters:
values (tuple[str, str]) –
labels (dict[Sex, str] | None) –
- class lector.types.strings.Text[source]#
Bases:
lector.types.abc.ConverterAnything could be text, but we can enforce text-likeness and uniqueness.
- convert(array)[source]#
To be implemented in subclasses.
- Parameters:
array (pyarrow.Array) –
- Return type:
lector.types.abc.Conversion | None
- class lector.types.strings.Url[source]#
Bases:
lector.types.abc.ConverterAnything could be text, but we can enforce text-likeness and uniqueness.
- convert(array)[source]#
To be implemented in subclasses.
- Parameters:
array (pyarrow.Array) –
- Return type:
lector.types.abc.Conversion | None
- lector.types.strings.is_text(arr, min_spaces=TEXT_MIN_SPACES, min_length=TEXT_MIN_LENGTH, reject_lists=TEXT_REJECT_LISTS)[source]#
Check for natural language-like texts using criteria like lengths, number of spaces.
- Parameters:
arr (pyarrow.Array) –
min_spaces (int) –
min_length (int) –
reject_lists (bool) –
- Return type:
bool
- lector.types.strings.maybe_cast_category(arr, max_cardinality=MAX_CARDINALITY)[source]#
Cast to categorical depending on cardinality and whether strings are text-like.
- Parameters:
arr (pyarrow.Array) –
max_cardinality (lector.utils.Number | None) –
- Return type:
pyarrow.Array | None
- lector.types.strings.maybe_sex(arr)[source]#
Check if the two most common values are sex-like and return them.
- Parameters:
arr (pyarrow.Array) –
- Return type:
tuple[str, str] | None
- lector.types.strings.proportion_text(arr, min_spaces=TEXT_MIN_SPACES, min_length=TEXT_MIN_LENGTH, reject_lists=TEXT_REJECT_LISTS)[source]#
Calculate proportion of natural language-like texts given criteria.
- Parameters:
arr (pyarrow.Array) –
min_spaces (int) –
min_length (int) –
reject_lists (bool) –
- Return type:
float
- lector.types.strings.proportion_url(arr)[source]#
Use regex to find proportion of strings that are (web) URL-like.
- Parameters:
arr (pyarrow.Array) –
- Return type:
float
- lector.types.strings.sufficient_texts(arr, min_spaces=TEXT_MIN_SPACES, min_length=TEXT_MIN_LENGTH, reject_lists=TEXT_REJECT_LISTS, threshold=1.0)[source]#
Check for natural language-like texts using criteria like lengths, number of spaces.
This is different from above in that for each text condition, we can early out if the condition is not met, without evaluating the remaining conditions. I.e., should be faster.
- Parameters:
arr (pyarrow.Array) –
min_spaces (int) –
min_length (int) –
reject_lists (bool) –
threshold (float) –
- Return type:
bool
- lector.types.strings.MAX_CARDINALITY: lector.utils.Number = 0.1[source]#
Maximum cardinalty for categoricals (arrow’s default is 50 in ConvertOptions).
- lector.types.strings.TEXT_MIN_LENGTH: lector.utils.Number = 15[source]#
Strings need to be this long to be considered text.
- lector.types.strings.TEXT_MIN_SPACES: lector.utils.Number = 2[source]#
Strings need to have this many spaces to be considered text.