lector.types.strings#

Helpers to convert to types that logically remain strings (e.g. categoricals).

Classes#

Category

Anything could be text, but we can enforce text-likeness and uniqueness.

Sex

Generic enumeration.

SexMapper

Infer values encoding a person's sex in a column and map to configurable labels.

Text

Anything could be text, but we can enforce text-likeness and uniqueness.

Url

Anything could be text, but we can enforce text-likeness and uniqueness.

Functions#

is_text(arr[, min_spaces, min_length, reject_lists])

Check for natural language-like texts using criteria like lengths, number of spaces.

maybe_cast_category(arr[, max_cardinality])

Cast to categorical depending on cardinality and whether strings are text-like.

maybe_sex(arr)

Check if the two most common values are sex-like and return them.

proportion_text(arr[, min_spaces, min_length, ...])

Calculate proportion of natural language-like texts given criteria.

proportion_url(arr)

Use regex to find proportion of strings that are (web) URL-like.

sufficient_texts(arr[, min_spaces, min_length, ...])

Check for natural language-like texts using criteria like lengths, number of spaces.

Attributes#

MAX_CARDINALITY

Maximum cardinalty for categoricals (arrow's default is 50 in ConvertOptions).

TEXT_MIN_LENGTH

Strings need to be this long to be considered text.

TEXT_MIN_SPACES

Strings need to have this many spaces to be considered text.

TEXT_PROPORTION_THRESHOLD

Infer text type if a proportion or values greater than this is text-like.

TEXT_REJECT_LISTS

Whether to count list-like strings as texts.

class lector.types.strings.Category[source]#

Bases: lector.types.abc.Converter

Anything could be text, but we can enforce text-likeness and uniqueness.

max_cardinality: lector.utils.Number | None[source]#
convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

class lector.types.strings.Sex[source]#

Bases: enum.Enum

Generic enumeration.

Derive from this class to define new enumerations.

Female = 0[source]#
Male = 1[source]#
class lector.types.strings.SexMapper(values, labels=None)[source]#

Infer values encoding a person’s sex in a column and map to configurable labels.

Parameters:
  • values (tuple[str, str]) –

  • labels (dict[Sex, str] | None) –

DEFAULT_VALUES[source]#
infer_values(values)[source]#

Infer which values encode female/male categories.

Parameters:

values (tuple[str, str]) –

Return type:

dict

make_mapping()[source]#

Create a mapping from inferred values to desired labels.

Return type:

dict[str, str]

class lector.types.strings.Text[source]#

Bases: lector.types.abc.Converter

Anything could be text, but we can enforce text-likeness and uniqueness.

min_unique: float = 0.1[source]#
convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

class lector.types.strings.Url[source]#

Bases: lector.types.abc.Converter

Anything could be text, but we can enforce text-likeness and uniqueness.

convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

lector.types.strings.is_text(arr, min_spaces=TEXT_MIN_SPACES, min_length=TEXT_MIN_LENGTH, reject_lists=TEXT_REJECT_LISTS)[source]#

Check for natural language-like texts using criteria like lengths, number of spaces.

Parameters:
  • arr (pyarrow.Array) –

  • min_spaces (int) –

  • min_length (int) –

  • reject_lists (bool) –

Return type:

bool

lector.types.strings.maybe_cast_category(arr, max_cardinality=MAX_CARDINALITY)[source]#

Cast to categorical depending on cardinality and whether strings are text-like.

Parameters:
  • arr (pyarrow.Array) –

  • max_cardinality (lector.utils.Number | None) –

Return type:

pyarrow.Array | None

lector.types.strings.maybe_sex(arr)[source]#

Check if the two most common values are sex-like and return them.

Parameters:

arr (pyarrow.Array) –

Return type:

tuple[str, str] | None

lector.types.strings.proportion_text(arr, min_spaces=TEXT_MIN_SPACES, min_length=TEXT_MIN_LENGTH, reject_lists=TEXT_REJECT_LISTS)[source]#

Calculate proportion of natural language-like texts given criteria.

Parameters:
  • arr (pyarrow.Array) –

  • min_spaces (int) –

  • min_length (int) –

  • reject_lists (bool) –

Return type:

float

lector.types.strings.proportion_url(arr)[source]#

Use regex to find proportion of strings that are (web) URL-like.

Parameters:

arr (pyarrow.Array) –

Return type:

float

lector.types.strings.sufficient_texts(arr, min_spaces=TEXT_MIN_SPACES, min_length=TEXT_MIN_LENGTH, reject_lists=TEXT_REJECT_LISTS, threshold=1.0)[source]#

Check for natural language-like texts using criteria like lengths, number of spaces.

This is different from above in that for each text condition, we can early out if the condition is not met, without evaluating the remaining conditions. I.e., should be faster.

Parameters:
  • arr (pyarrow.Array) –

  • min_spaces (int) –

  • min_length (int) –

  • reject_lists (bool) –

  • threshold (float) –

Return type:

bool

lector.types.strings.MAX_CARDINALITY: lector.utils.Number = 0.1[source]#

Maximum cardinalty for categoricals (arrow’s default is 50 in ConvertOptions).

lector.types.strings.TEXT_MIN_LENGTH: lector.utils.Number = 15[source]#

Strings need to be this long to be considered text.

lector.types.strings.TEXT_MIN_SPACES: lector.utils.Number = 2[source]#

Strings need to have this many spaces to be considered text.

lector.types.strings.TEXT_PROPORTION_THRESHOLD: float = 0.8[source]#

Infer text type if a proportion or values greater than this is text-like.

lector.types.strings.TEXT_REJECT_LISTS: bool = True[source]#

Whether to count list-like strings as texts.