`lector.types.timestamps`#

Helpers to convert timestamp strings or time-like columns to timestamps.

Arrow seems to be using this parser under the hood: https://pubs.opengroup.org/onlinepubs/009695399/functions/strptime.html

in its compute.strptime function, which doesn’t support timezone offsets via the %z or %Z directives. Though they do support timezones when importing CSVs or casting…

For arrow internals relating to timestamps also see:

Timezone internals: https://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow13TimestampTypeE
CSV parsing: https://arrow.apache.org/docs/cpp/csv.html#timestamp-inference-parsing
Timestamp umbrella issue: https://github.com/apache/arrow/issues/31324

TODO: - Fractional seconds are handled manually, also see

https://github.com/apache/arrow/issues/20146. They are first removed via regex, converted to a pyarrow duration type and later added to parsed timestamps.

Timezones are only supported in format “+0100”, but not e.g. “+01:00”
What to do with mixed timezones: https://stackoverflow.com/questions/75656639/computing-date-features-using-pyarrow-on-mixed-timezone-data

Classes#

Timestamp

Convert string or time/date-like arrays to timestamp type.

Functions#

`extract_timezone`(timestamps)	Extract timezone from a list of string timestamps.
`find_format`(ts)	Try to find the first format that can parse given date.
`fraction_as_duration`(arr)	Convert an array (of strings) representing fractional seconds to duration type.
`maybe_parse_known_timestamps`(arr, format[, unit, ...])	Helper for parsing with known format and no fractional seconds.
`maybe_parse_timestamps`(arr[, format, unit, threshold, ...])	Parse lists of strings as dates with format inference.
`proportion_fractional_seconds`(arr)	Proportion of non-null dates in arr having fractional seconds.
`timestamp_formats`([tz])

Attributes#

`ALL_FORMATS`	All formats tried by default if None is explicitly provided when converting.
`DATE_FORMATS`
`ISO_FORMAT`	String Arrow recognizes as meaning the ISO format.
`TIMESTAMP_FORMATS`
`UNIT`	Note that pandas internal unit is fixed to nanoseconds, and with that resolution it can

class lector.types.timestamps.Timestamp[source]#

Bases: lector.types.abc.Converter

Convert string or time/date-like arrays to timestamp type.

Note: Arrow will always _parse_ either into UTC or timezone-naive timestamps, but never into specific timezones other than UTC by default. Also, internally all timestamps are represented as UTC. The timezone metadata is then used by other functions to correctly extract for example the local day of the week, time etc.

Non-UTC timestamps can only be created by specifying the TimestampType explicitly, or using the assume_timezone function.

When converting to pandas, the timezone is handled correctly.

When input strings have no explicit timezone information, uses tz parameter to interpret them as local to that tz. If tz=None, keeps them as timezone-naive timestamps. If input strings do have explicit timezone information, will be represented internally as UTC (as always), and simply set the tz metadata so that component extraction etc. will use correctly localized moments in time.

TZ-naive timestamps [“2013-07-17 05:00”, “2013-07-17 02:00”]:

assume_timezone(NY): interprets input timestamps as local to tz,
converts and stores them as UTC, and keeps tz metadata for correct localization when printing/extracting components. I.e., will convert to [2013-07-17 09:00:00, 2013-07-17 06:00:00] UTC, but when needed, will localize on demand to [2013-07-17 05:00:00-04:00 2013-07-17 02:00:00-04:00].

cast with timezone(NY): interprets input timestamps as local to UTC,
and stores the tz as metadata for on-demand localization. I.e., timestamps will be [2013-07-17 05:00:00, 2013-07-17 02:00:00] UTC, and when needed will localize on demand to [2013-07-17 01:00:00-04:00 2013-07-16 22:00:00-04:00].

TZ-aware timestamps [“2013-07-17 05:00”, “2013-07-17 02:00”] UTC:

cast with timezone(NY): since input timestamps internally are already
always in UTC, keeps them as UTC [“2013-07-17 05:00”, “2013-07-17 02:00”], but localizes to cast tz on demand, i.e. to [2013-07-17 01:00:00-04:00 2013-07-16 22:00:00-04:00].

DEFAULT_TZ: ClassVar[str] = 'UTC'[source]#

convert_temporal: bool = True[source]#: Whether time/date-only arrays should be converted to timestamps.

format: str | None[source]#: When None, default formats are tried in order.

tz: str | None[source]#: The desired timezone of the timestamps.

unit: str[source]#: Resolution the timestamps are stored with internally.

convert(array)[source]#

To be implemented in subclasses.

Parameters:: array (pyarrow.Array) –
Return type:: lector.types.abc.Conversion | None

convert_date_time(array)[source]#

Parameters:: array (pyarrow.Array) –
Return type:: lector.types.abc.Conversion | None

convert_strings(array)[source]#

Parameters:: array (pyarrow.Array) –
Return type:: lector.types.abc.Conversion | None

convert_timestamp(array)[source]#

Parameters:: array (pyarrow.Array) –
Return type:: lector.types.abc.Conversion | None

static meta(dt)[source]#

Parameters:: dt (pyarrow.TimestampType) –
Return type:: dict[str, str]

static to_timezone(array, tz)[source]#

Parameters:

array (pyarrow.TimestampArray) –
tz (str | None) –

Return type:

pyarrow.TimestampArray

lector.types.timestamps.extract_timezone(timestamps)[source]#

Extract timezone from a list of string timestamps.

Currently, the only supported format is +/-HH[:]MM, e.g. +0100.

Also, returns None if there are multiple different offsets, after some basic cleaning. E.g. Z and +0000 are considered the same.

Parameters:: timestamps (pyarrow.Array) –

lector.types.timestamps.find_format(ts)[source]#

Try to find the first format that can parse given date.

Parameters:: ts (pyarrow.TimestampScalar) –
Return type:: str | None

lector.types.timestamps.fraction_as_duration(arr)[source]#

Convert an array (of strings) representing fractional seconds to duration type.

Parameters:: arr (pyarrow.Array) –
Return type:: pyarrow.Array

lector.types.timestamps.maybe_parse_known_timestamps(arr, format, unit=UNIT, threshold=1.0)[source]#

Helper for parsing with known format and no fractional seconds.

Parameters:

arr (pyarrow.Array) –
format (str) –
unit (str) –
threshold (float) –

Return type:

pyarrow.Array | None

lector.types.timestamps.maybe_parse_timestamps(arr, format=None, unit=UNIT, threshold=1.0, return_format=False)[source]#

Parse lists of strings as dates with format inference.

Parameters:

arr (pyarrow.Array) –
format (str | None) –
unit (str) –
threshold (float) –
return_format (bool) –

Return type:

pyarrow.Array | None

lector.types.timestamps.proportion_fractional_seconds(arr)[source]#

Proportion of non-null dates in arr having fractional seconds.

Parameters:: arr (pyarrow.Array) –
Return type:: float

lector.types.timestamps.timestamp_formats(tz=True)[source]#

Parameters:: tz (bool) –
Return type:: list[str]

lector.types.timestamps.ALL_FORMATS: list[str][source]#: All formats tried by default if None is explicitly provided when converting.

lector.types.timestamps.DATE_FORMATS: list[str] = ['%d-%m-%y', '%d/%m/%y', '%Y-%m-%d', '%d-%m-%Y', '%Y/%m/%d', '%d/%m/%Y', '%m/%d/%Y', '%a %d %b...[source]#

lector.types.timestamps.ISO_FORMAT: str = 'ISO8601()'[source]#: String Arrow recognizes as meaning the ISO format.

lector.types.timestamps.TIMESTAMP_FORMATS: list[str] = ['%Y-%m-%dT%H:%M:%S', '%Y-%m-%dT%H:%M', '%Y-%m-%dT%I:%M:%S %p', '%Y-%m-%dT%I:%M %p',...[source]#

lector.types.timestamps.UNIT = 'ns'[source]#: Note that pandas internal unit is fixed to nanoseconds, and with that resolution it can represent a much smaller period of dates only.

lector.types.timestamps#

Classes#

Functions#

Attributes#

`lector.types.timestamps`#