lector.types.numbers#
Helpers for parsing and downcasting numeric data.
Note: Arrow uses Google’s RE2 to implement regex functionality: https://github.com/google/re2/wiki/Syntax
Classes#
str(object='') -> str |
|
Attempts truncation of floats to ints and then downcasting of ints. |
|
Attempts to parse strings into floats or ints followed by downcasting. |
Functions#
|
Removes characters in number strings that Arrow cannot parse. |
|
Prepare an array of strings so that Arrow can cast the result to floats. |
|
Infer decimal delimiter from string representation s of an input number. |
Get most frequent decimal delimiter in array. |
|
|
Convert to smallest applicable int type. |
|
Parse valid string representations of floating point numbers. |
|
Use regex to extract castable ints. |
|
Float to int conversion if sufficient values are kept unchanged. |
Attributes#
- class lector.types.numbers.DecimalMode[source]#
Bases:
str,enum.Enumstr(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.
- class lector.types.numbers.Downcast[source]#
Bases:
lector.types.abc.ConverterAttempts truncation of floats to ints and then downcasting of ints.
- convert(array)[source]#
To be implemented in subclasses.
- Parameters:
array (pyarrow.Array) –
- Return type:
lector.types.abc.Conversion | None
- class lector.types.numbers.Number[source]#
Bases:
lector.types.abc.ConverterAttempts to parse strings into floats or ints followed by downcasting.
- decimal: str | DecimalMode[source]#
- convert(array)[source]#
To be implemented in subclasses.
- Parameters:
array (pyarrow.Array) –
- Return type:
lector.types.abc.Conversion | None
- lector.types.numbers.clean_float_pattern(thousands=',')[source]#
Removes characters in number strings that Arrow cannot parse.
- Parameters:
thousands (str) –
- Return type:
str
- lector.types.numbers.clean_float_strings(arr, decimal)[source]#
Prepare an array of strings so that Arrow can cast the result to floats.
Arrow allows exponential syntax and omission of 0s before and after the decimal point, i.e. the following are all valid string representations of floating point numbers: “-1e10”, “1e10”, “1e-2”, “1.2e3”, “-1.2e3”, “1.” “.12”, “-1.”, “-.1”.
Arrow doesn’t allow prefix of a positive sign indicator, nor thousands separator, i.e. the following are not(!) valid: “+1e10”, “+1.”, “+.1”, “123,456.0”
We hence remove occurrences of both the thousands character and the positive sign before extracting the floating point part of strings using regex.
Also see following for more regex parsing options: https://stackoverflow.com/questions/12643009/regular-expression-for-floating-point-numbers
Note, we don’t parse as float if there isn’t a single value with decimals. If this is the case they should be integers really, and if they haven’t been parsed as ints before, that’s because the values didn’t fit into Arrow’s largesy integer type, in which case it isn’t safe to parse as float, which Arrow would otherwise do unsafely(!) and silently.
- Parameters:
arr (pyarrow.Array) –
decimal (str) –
- Return type:
tuple[pyarrow.Array, pyarrow.Array, float]
- lector.types.numbers.decimal_delimiter(s, n_chars_max=20)[source]#
Infer decimal delimiter from string representation s of an input number.
Returns None if not unambiguously inferrable.
- Parameters:
s (str) –
n_chars_max (int) –
- Return type:
str | None
- lector.types.numbers.infer_decimal_delimiter(arr)[source]#
Get most frequent decimal delimiter in array.
If most frequent delimiter doesn’t occur in sufficient proportion (support), or not significantly more often than other delimiters (confidence), returns None.
- Parameters:
arr (pyarrow.Array) –
- Return type:
str | None
- lector.types.numbers.maybe_downcast_ints(arr)[source]#
Convert to smallest applicable int type.
- Parameters:
arr (pyarrow.Array) –
- Return type:
pyarrow.Array | None
- lector.types.numbers.maybe_parse_floats(arr, threshold=0.5, decimal=DecimalMode.INFER)[source]#
Parse valid string representations of floating point numbers.
- Parameters:
arr (pyarrow.Array) –
threshold (float) –
decimal (str | DecimalMode) –
- Return type:
pyarrow.Array | None
- lector.types.numbers.maybe_parse_ints(arr, threshold=1.0, allow_unsigned=False)[source]#
Use regex to extract castable ints.
Arrow’s internal casting from string to int doesn’t allow for an initial positive sign character, so we have to handle that separately.
- Parameters:
arr (pyarrow.Array) –
threshold (float) –
allow_unsigned (bool) –
- Return type:
pyarrow.Array | None