`lector.utils`#

Common helpers to work with pyarrow objects.

Classes#

Timer

Functions#

`categories`(array)	Returns an array containing categories in input array of dictionary type.
`decode_metadata`(d)	Decode Arrow metadata to dict.
`dtype_name`(arr)	Return a pandas-compatible type name including extension types where possible.
`empty_to_null`(arr)	Convert empty strings to null values.
`encode_metadata`(d)	Json-byte-encode a dict, like Arrow expects its metadata.
`is_stringy`(type)	Check if array is stringy (string or dictionary of strings).
`map_values`(arr, map[, unknown])	Slow value mapping in pure Python while Arrow doesn't have a native compute function.
`min_max`(arr[, skip_nulls])	Wrapper to get minimum and maximum in arrow array as python tuple.
`proportion_equal`(arr1, arr2[, ignore_nulls])	Proportion of equal values, optionally ignoring nulls (which otherwise compare falsish.
`proportion_trueish`(arr)
`proportion_unique`(arr)	Proportion of non-null values that are unique in array.
`proportion_valid`(arr)	Proportion of non-null values in array.
`reset_buffer`(buffer)	Caches and resets buffer position.
`schema_diff`(s1, s2)	Check differences in schema's column types.
`smallest_int_type`(vmin, vmax)	Find the smallest int type able to hold vmin and vmax.
`sorted_value_counts`(arr[, order, top_n])	Arrow's built-in value count doesn't allow sorting.
`to_pandas`(array)	Proper conversion allowing pandas extension types.
`uniquify`(items)	Add suffixes to inputs strings if necessary to ensure is item is unique.
`with_flatten`(arr, func)	Apply a compute function to all elements of flattened (and restored) lists.

Attributes#

`INT_LIMITS`	Minimum and maximum for each integer subtype.
`Limit`
`MISSING_STRINGS`	Extension of pandas and arrow default missing values.
`Number`
`PANDAS_INSTALLED`

class lector.utils.Timer[source]#

__enter__()[source]#

__exit__(type, value, traceback)[source]#

lector.utils.categories(array)[source]#

Returns an array containing categories in input array of dictionary type.

Parameters:: array (pyarrow.Array | pyarrow.ChunkedArray) –
Return type:: pyarrow.Array

lector.utils.decode_metadata(d)[source]#

Decode Arrow metadata to dict.

Parameters:: d (dict) –

lector.utils.dtype_name(arr)[source]#

Return a pandas-compatible type name including extension types where possible.

Parameters:: arr (pyarrow.Array) –

lector.utils.empty_to_null(arr)[source]#

Convert empty strings to null values.

Parameters:: arr (pyarrow.Array) –
Return type:: pyarrow.Array

lector.utils.encode_metadata(d)[source]#

Json-byte-encode a dict, like Arrow expects its metadata.

Parameters:: d (dict) –

lector.utils.is_stringy(type)[source]#

Check if array is stringy (string or dictionary of strings).

Parameters:: type (pyarrow.DataType) –
Return type:: bool

lector.utils.map_values(arr, map, unknown='keep')[source]#

Slow value mapping in pure Python while Arrow doesn’t have a native compute function.

For now assumes type can be left unchanged.

Parameters:

arr (pyarrow.Array) –
map (dict) –
unknown (str) –

Return type:

pyarrow.Array

lector.utils.min_max(arr, skip_nulls=True)[source]#

Wrapper to get minimum and maximum in arrow array as python tuple.

Parameters:

arr (pyarrow.Array) –
skip_nulls (bool) –

Return type:

tuple[Number, Number]

lector.utils.proportion_equal(arr1, arr2, ignore_nulls=True)[source]#

Proportion of equal values, optionally ignoring nulls (which otherwise compare falsish.

Parameters:

arr1 (pyarrow.Array) –
arr2 (pyarrow.Array) –

Return type:

float

lector.utils.proportion_trueish(arr)[source]#

Parameters:: arr (pyarrow.Array) –
Return type:: float

lector.utils.proportion_unique(arr)[source]#

Proportion of non-null values that are unique in array.

Parameters:: arr (pyarrow.Array) –
Return type:: float

lector.utils.proportion_valid(arr)[source]#

Proportion of non-null values in array.

Parameters:: arr (pyarrow.Array) –
Return type:: float

lector.utils.reset_buffer(buffer)[source]#: Caches and resets buffer position.

lector.utils.schema_diff(s1, s2)[source]#

Check differences in schema’s column types.

Parameters:

s1 (pyarrow.Schema) –
s2 (pyarrow.Schema) –

Return type:

dict[str, tuple[pyarrow.DataType, pyarrow.DataType]]

lector.utils.smallest_int_type(vmin, vmax)[source]#

Find the smallest int type able to hold vmin and vmax.

Parameters:

vmin (Number) –
vmax (Number) –

Return type:

str | None

lector.utils.sorted_value_counts(arr, order='descending', top_n=None)[source]#

Arrow’s built-in value count doesn’t allow sorting.

Parameters:

arr (pyarrow.Array) –
order (str) –
top_n (int | None) –

Return type:

pyarrow.Array

lector.utils.to_pandas(array)[source]#

Proper conversion allowing pandas extension types.

Parameters:: array (pyarrow.Array) –

lector.utils.uniquify(items)[source]#

Add suffixes to inputs strings if necessary to ensure is item is unique.

Parameters:: items (collections.abc.Sequence[str]) –
Return type:: collections.abc.Iterator[str]

lector.utils.with_flatten(arr, func)[source]#

Apply a compute function to all elements of flattened (and restored) lists.

Parameters:

arr (pyarrow.Array) –
func (Callable) –

lector.utils.INT_LIMITS: dict[str, Limit][source]#: Minimum and maximum for each integer subtype.

lector.utils.Limit[source]#

lector.utils.MISSING_STRINGS: set[str][source]#: Extension of pandas and arrow default missing values.

lector.utils.Number[source]#

lector.utils.PANDAS_INSTALLED = True[source]#

lector.utils#

Classes#

Functions#

Attributes#

`lector.utils`#