lector.utils#
Common helpers to work with pyarrow objects.
Classes#
Functions#
|
Returns an array containing categories in input array of dictionary type. |
Decode Arrow metadata to dict. |
|
|
Return a pandas-compatible type name including extension types where possible. |
|
Convert empty strings to null values. |
Json-byte-encode a dict, like Arrow expects its metadata. |
|
|
Check if array is stringy (string or dictionary of strings). |
|
Slow value mapping in pure Python while Arrow doesn't have a native compute function. |
|
Wrapper to get minimum and maximum in arrow array as python tuple. |
|
Proportion of equal values, optionally ignoring nulls (which otherwise compare falsish. |
|
|
|
Proportion of non-null values that are unique in array. |
|
Proportion of non-null values in array. |
|
Caches and resets buffer position. |
|
Check differences in schema's column types. |
|
Find the smallest int type able to hold vmin and vmax. |
|
Arrow's built-in value count doesn't allow sorting. |
|
Proper conversion allowing pandas extension types. |
|
Add suffixes to inputs strings if necessary to ensure is item is unique. |
|
Apply a compute function to all elements of flattened (and restored) lists. |
Attributes#
Minimum and maximum for each integer subtype. |
|
Extension of pandas and arrow default missing values. |
|
- lector.utils.categories(array)[source]#
Returns an array containing categories in input array of dictionary type.
- Parameters:
array (pyarrow.Array | pyarrow.ChunkedArray) –
- Return type:
pyarrow.Array
- lector.utils.dtype_name(arr)[source]#
Return a pandas-compatible type name including extension types where possible.
- Parameters:
arr (pyarrow.Array) –
- lector.utils.empty_to_null(arr)[source]#
Convert empty strings to null values.
- Parameters:
arr (pyarrow.Array) –
- Return type:
pyarrow.Array
- lector.utils.encode_metadata(d)[source]#
Json-byte-encode a dict, like Arrow expects its metadata.
- Parameters:
d (dict) –
- lector.utils.is_stringy(type)[source]#
Check if array is stringy (string or dictionary of strings).
- Parameters:
type (pyarrow.DataType) –
- Return type:
bool
- lector.utils.map_values(arr, map, unknown='keep')[source]#
Slow value mapping in pure Python while Arrow doesn’t have a native compute function.
For now assumes type can be left unchanged.
- Parameters:
arr (pyarrow.Array) –
map (dict) –
unknown (str) –
- Return type:
pyarrow.Array
- lector.utils.min_max(arr, skip_nulls=True)[source]#
Wrapper to get minimum and maximum in arrow array as python tuple.
- lector.utils.proportion_equal(arr1, arr2, ignore_nulls=True)[source]#
Proportion of equal values, optionally ignoring nulls (which otherwise compare falsish.
- Parameters:
arr1 (pyarrow.Array) –
arr2 (pyarrow.Array) –
- Return type:
float
- lector.utils.proportion_unique(arr)[source]#
Proportion of non-null values that are unique in array.
- Parameters:
arr (pyarrow.Array) –
- Return type:
float
- lector.utils.proportion_valid(arr)[source]#
Proportion of non-null values in array.
- Parameters:
arr (pyarrow.Array) –
- Return type:
float
- lector.utils.schema_diff(s1, s2)[source]#
Check differences in schema’s column types.
- Parameters:
s1 (pyarrow.Schema) –
s2 (pyarrow.Schema) –
- Return type:
dict[str, tuple[pyarrow.DataType, pyarrow.DataType]]
- lector.utils.smallest_int_type(vmin, vmax)[source]#
Find the smallest int type able to hold vmin and vmax.
- lector.utils.sorted_value_counts(arr, order='descending', top_n=None)[source]#
Arrow’s built-in value count doesn’t allow sorting.
- Parameters:
arr (pyarrow.Array) –
order (str) –
top_n (int | None) –
- Return type:
pyarrow.Array
- lector.utils.to_pandas(array)[source]#
Proper conversion allowing pandas extension types.
- Parameters:
array (pyarrow.Array) –
- lector.utils.uniquify(items)[source]#
Add suffixes to inputs strings if necessary to ensure is item is unique.
- Parameters:
items (collections.abc.Sequence[str]) –
- Return type:
collections.abc.Iterator[str]
- lector.utils.with_flatten(arr, func)[source]#
Apply a compute function to all elements of flattened (and restored) lists.
- Parameters:
arr (pyarrow.Array) –
func (Callable) –