lector.utils#

Common helpers to work with pyarrow objects.

Classes#

Timer

Functions#

categories(array)

Returns an array containing categories in input array of dictionary type.

decode_metadata(d)

Decode Arrow metadata to dict.

dtype_name(arr)

Return a pandas-compatible type name including extension types where possible.

empty_to_null(arr)

Convert empty strings to null values.

encode_metadata(d)

Json-byte-encode a dict, like Arrow expects its metadata.

is_stringy(type)

Check if array is stringy (string or dictionary of strings).

map_values(arr, map[, unknown])

Slow value mapping in pure Python while Arrow doesn't have a native compute function.

min_max(arr[, skip_nulls])

Wrapper to get minimum and maximum in arrow array as python tuple.

proportion_equal(arr1, arr2[, ignore_nulls])

Proportion of equal values, optionally ignoring nulls (which otherwise compare falsish.

proportion_trueish(arr)

proportion_unique(arr)

Proportion of non-null values that are unique in array.

proportion_valid(arr)

Proportion of non-null values in array.

reset_buffer(buffer)

Caches and resets buffer position.

schema_diff(s1, s2)

Check differences in schema's column types.

smallest_int_type(vmin, vmax)

Find the smallest int type able to hold vmin and vmax.

sorted_value_counts(arr[, order, top_n])

Arrow's built-in value count doesn't allow sorting.

to_pandas(array)

Proper conversion allowing pandas extension types.

uniquify(items)

Add suffixes to inputs strings if necessary to ensure is item is unique.

with_flatten(arr, func)

Apply a compute function to all elements of flattened (and restored) lists.

Attributes#

INT_LIMITS

Minimum and maximum for each integer subtype.

Limit

MISSING_STRINGS

Extension of pandas and arrow default missing values.

Number

PANDAS_INSTALLED

class lector.utils.Timer[source]#
__enter__()[source]#
__exit__(type, value, traceback)[source]#
lector.utils.categories(array)[source]#

Returns an array containing categories in input array of dictionary type.

Parameters:

array (pyarrow.Array | pyarrow.ChunkedArray) –

Return type:

pyarrow.Array

lector.utils.decode_metadata(d)[source]#

Decode Arrow metadata to dict.

Parameters:

d (dict) –

lector.utils.dtype_name(arr)[source]#

Return a pandas-compatible type name including extension types where possible.

Parameters:

arr (pyarrow.Array) –

lector.utils.empty_to_null(arr)[source]#

Convert empty strings to null values.

Parameters:

arr (pyarrow.Array) –

Return type:

pyarrow.Array

lector.utils.encode_metadata(d)[source]#

Json-byte-encode a dict, like Arrow expects its metadata.

Parameters:

d (dict) –

lector.utils.is_stringy(type)[source]#

Check if array is stringy (string or dictionary of strings).

Parameters:

type (pyarrow.DataType) –

Return type:

bool

lector.utils.map_values(arr, map, unknown='keep')[source]#

Slow value mapping in pure Python while Arrow doesn’t have a native compute function.

For now assumes type can be left unchanged.

Parameters:
  • arr (pyarrow.Array) –

  • map (dict) –

  • unknown (str) –

Return type:

pyarrow.Array

lector.utils.min_max(arr, skip_nulls=True)[source]#

Wrapper to get minimum and maximum in arrow array as python tuple.

Parameters:
  • arr (pyarrow.Array) –

  • skip_nulls (bool) –

Return type:

tuple[Number, Number]

lector.utils.proportion_equal(arr1, arr2, ignore_nulls=True)[source]#

Proportion of equal values, optionally ignoring nulls (which otherwise compare falsish.

Parameters:
  • arr1 (pyarrow.Array) –

  • arr2 (pyarrow.Array) –

Return type:

float

lector.utils.proportion_trueish(arr)[source]#
Parameters:

arr (pyarrow.Array) –

Return type:

float

lector.utils.proportion_unique(arr)[source]#

Proportion of non-null values that are unique in array.

Parameters:

arr (pyarrow.Array) –

Return type:

float

lector.utils.proportion_valid(arr)[source]#

Proportion of non-null values in array.

Parameters:

arr (pyarrow.Array) –

Return type:

float

lector.utils.reset_buffer(buffer)[source]#

Caches and resets buffer position.

lector.utils.schema_diff(s1, s2)[source]#

Check differences in schema’s column types.

Parameters:
  • s1 (pyarrow.Schema) –

  • s2 (pyarrow.Schema) –

Return type:

dict[str, tuple[pyarrow.DataType, pyarrow.DataType]]

lector.utils.smallest_int_type(vmin, vmax)[source]#

Find the smallest int type able to hold vmin and vmax.

Parameters:
Return type:

str | None

lector.utils.sorted_value_counts(arr, order='descending', top_n=None)[source]#

Arrow’s built-in value count doesn’t allow sorting.

Parameters:
  • arr (pyarrow.Array) –

  • order (str) –

  • top_n (int | None) –

Return type:

pyarrow.Array

lector.utils.to_pandas(array)[source]#

Proper conversion allowing pandas extension types.

Parameters:

array (pyarrow.Array) –

lector.utils.uniquify(items)[source]#

Add suffixes to inputs strings if necessary to ensure is item is unique.

Parameters:

items (collections.abc.Sequence[str]) –

Return type:

collections.abc.Iterator[str]

lector.utils.with_flatten(arr, func)[source]#

Apply a compute function to all elements of flattened (and restored) lists.

Parameters:
  • arr (pyarrow.Array) –

  • func (Callable) –

lector.utils.INT_LIMITS: dict[str, Limit][source]#

Minimum and maximum for each integer subtype.

lector.utils.Limit[source]#
lector.utils.MISSING_STRINGS: set[str][source]#

Extension of pandas and arrow default missing values.

lector.utils.Number[source]#
lector.utils.PANDAS_INSTALLED = True[source]#