`lector.types.strings`#

Helpers to convert to types that logically remain strings (e.g. categoricals).

Classes#

`Category`	Anything could be text, but we can enforce text-likeness and uniqueness.
`Sex`	Generic enumeration.
`SexMapper`	Infer values encoding a person's sex in a column and map to configurable labels.
`Text`	Anything could be text, but we can enforce text-likeness and uniqueness.
`Url`	Anything could be text, but we can enforce text-likeness and uniqueness.

Functions#

`is_text`(arr[, min_spaces, min_length, reject_lists])	Check for natural language-like texts using criteria like lengths, number of spaces.
`maybe_cast_category`(arr[, max_cardinality])	Cast to categorical depending on cardinality and whether strings are text-like.
`maybe_sex`(arr)	Check if the two most common values are sex-like and return them.
`proportion_text`(arr[, min_spaces, min_length, ...])	Calculate proportion of natural language-like texts given criteria.
`proportion_url`(arr)	Use regex to find proportion of strings that are (web) URL-like.
`sufficient_texts`(arr[, min_spaces, min_length, ...])	Check for natural language-like texts using criteria like lengths, number of spaces.

Attributes#

`MAX_CARDINALITY`	Maximum cardinalty for categoricals (arrow's default is 50 in ConvertOptions).
`TEXT_MIN_LENGTH`	Strings need to be this long to be considered text.
`TEXT_MIN_SPACES`	Strings need to have this many spaces to be considered text.
`TEXT_PROPORTION_THRESHOLD`	Infer text type if a proportion or values greater than this is text-like.
`TEXT_REJECT_LISTS`	Whether to count list-like strings as texts.

class lector.types.strings.Category[source]#

Bases: lector.types.abc.Converter

Anything could be text, but we can enforce text-likeness and uniqueness.

max_cardinality: lector.utils.Number | None[source]#

convert(array)[source]#

To be implemented in subclasses.

Parameters:: array (pyarrow.Array) –
Return type:: lector.types.abc.Conversion | None

class lector.types.strings.Sex[source]#

Bases: enum.Enum

Generic enumeration.

Derive from this class to define new enumerations.

Female = 0[source]#

Male = 1[source]#

class lector.types.strings.SexMapper(values, labels=None)[source]#

Infer values encoding a person’s sex in a column and map to configurable labels.

Parameters:

values (tuple[str, str]) –
labels (dict[Sex, str] | None) –

DEFAULT_VALUES[source]#

infer_values(values)[source]#

Infer which values encode female/male categories.

Parameters:: values (tuple[str, str]) –
Return type:: dict

make_mapping()[source]#

Create a mapping from inferred values to desired labels.

Return type:: dict[str, str]

class lector.types.strings.Text[source]#

Bases: lector.types.abc.Converter

Anything could be text, but we can enforce text-likeness and uniqueness.

min_unique: float = 0.1[source]#

convert(array)[source]#

To be implemented in subclasses.

Parameters:: array (pyarrow.Array) –
Return type:: lector.types.abc.Conversion | None

class lector.types.strings.Url[source]#

Bases: lector.types.abc.Converter

Anything could be text, but we can enforce text-likeness and uniqueness.

convert(array)[source]#

To be implemented in subclasses.

Parameters:: array (pyarrow.Array) –
Return type:: lector.types.abc.Conversion | None

lector.types.strings.is_text(arr, min_spaces=TEXT_MIN_SPACES, min_length=TEXT_MIN_LENGTH, reject_lists=TEXT_REJECT_LISTS)[source]#

Check for natural language-like texts using criteria like lengths, number of spaces.

Parameters:

arr (pyarrow.Array) –
min_spaces (int) –
min_length (int) –
reject_lists (bool) –

Return type:

bool

lector.types.strings.maybe_cast_category(arr, max_cardinality=MAX_CARDINALITY)[source]#

Cast to categorical depending on cardinality and whether strings are text-like.

Parameters:

arr (pyarrow.Array) –
max_cardinality (lector.utils.Number | None) –

Return type:

pyarrow.Array | None

lector.types.strings.maybe_sex(arr)[source]#

Check if the two most common values are sex-like and return them.

Parameters:: arr (pyarrow.Array) –
Return type:: tuple[str, str] | None

lector.types.strings.proportion_text(arr, min_spaces=TEXT_MIN_SPACES, min_length=TEXT_MIN_LENGTH, reject_lists=TEXT_REJECT_LISTS)[source]#

Calculate proportion of natural language-like texts given criteria.

Parameters:

arr (pyarrow.Array) –
min_spaces (int) –
min_length (int) –
reject_lists (bool) –

Return type:

float

lector.types.strings.proportion_url(arr)[source]#

Use regex to find proportion of strings that are (web) URL-like.

Parameters:: arr (pyarrow.Array) –
Return type:: float

lector.types.strings.sufficient_texts(arr, min_spaces=TEXT_MIN_SPACES, min_length=TEXT_MIN_LENGTH, reject_lists=TEXT_REJECT_LISTS, threshold=1.0)[source]#

Check for natural language-like texts using criteria like lengths, number of spaces.

This is different from above in that for each text condition, we can early out if the condition is not met, without evaluating the remaining conditions. I.e., should be faster.

Parameters:

arr (pyarrow.Array) –
min_spaces (int) –
min_length (int) –
reject_lists (bool) –
threshold (float) –

Return type:

bool

lector.types.strings.MAX_CARDINALITY: lector.utils.Number = 0.1[source]#: Maximum cardinalty for categoricals (arrow’s default is 50 in ConvertOptions).

lector.types.strings.TEXT_MIN_LENGTH: lector.utils.Number = 15[source]#: Strings need to be this long to be considered text.

lector.types.strings.TEXT_MIN_SPACES: lector.utils.Number = 2[source]#: Strings need to have this many spaces to be considered text.

lector.types.strings.TEXT_PROPORTION_THRESHOLD: float = 0.8[source]#: Infer text type if a proportion or values greater than this is text-like.

lector.types.strings.TEXT_REJECT_LISTS: bool = True[source]#: Whether to count list-like strings as texts.

lector.types.strings#

Classes#

Functions#

Attributes#

`lector.types.strings`#