lector.types#

Subpackage for inferring column types in CSV files.

This is instead or on top of Arrow’s built-in inference, which currently doesn’t detect list columns, timestamps in non-ISO formats, or semantic types such as URLs, natural language text etc.

Submodules#

Classes#

Autocast

Simple cast trying each registered type in order.

Boolean

Converts stringy booleans ("true" / "False"), and ints (0/1) to the boolean type.

Cast

Tries a specific cast for each column.

Category

Anything could be text, but we can enforce text-likeness and uniqueness.

Converter

Simple base class for dependency injection of new custom data types.

List

Simple base class for dependency injection of new custom data types.

Number

Attempts to parse strings into floats or ints followed by downcasting.

Text

Anything could be text, but we can enforce text-likeness and uniqueness.

Timestamp

Convert string or time/date-like arrays to timestamp type.

Url

Anything could be text, but we can enforce text-likeness and uniqueness.

Attributes#

Registry

'Singleton' conversion registry.

class lector.types.Autocast[source]#

Bases: CastStrategy

Simple cast trying each registered type in order.

As a little performance optimization (having a huge effect on execution time), types are first tested on a sample for fast rejection of non-matching types.

fallback: lector.types.abc.Converter | None#
n_samples: int = 100#
cast_array(array, name=None)[source]#

Only need to override this.

Parameters:
  • array (pyarrow.Array | pyarrow.ChunkedArray) –

  • name (str | None) –

Return type:

lector.types.abc.Conversion

class lector.types.Boolean[source]#

Bases: lector.types.abc.Converter

Converts stringy booleans (“true” / “False”), and ints (0/1) to the boolean type.

convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

class lector.types.Cast[source]#

Tries a specific cast for each column.

converters: dict[str, lector.types.abc.Converter]#
log: bool = False#
cast(table)[source]#
Parameters:

table (pyarrow.Table) –

Return type:

pyarrow.Table

class lector.types.Category[source]#

Bases: lector.types.abc.Converter

Anything could be text, but we can enforce text-likeness and uniqueness.

max_cardinality: lector.utils.Number | None#
convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

class lector.types.Converter[source]#

Bases: abc.ABC

Simple base class for dependency injection of new custom data types.

If a proportion of values smaller than threshold can be successfully converted, the converter should return None.

threshold: float = 1.0#
abstract convert(arr)[source]#

To be implemented in subclasses.

Parameters:

arr (pyarrow.Array) –

Return type:

Conversion | None

class lector.types.List[source]#

Bases: lector.types.abc.Converter

Simple base class for dependency injection of new custom data types.

If a proportion of values smaller than threshold can be successfully converted, the converter should return None.

delimiter: str = ','#
infer_urls: bool = True#
quote_char: str = '"'#
threshold_urls: float = 1.0#
type: str | pyarrow.DataType | None#
convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

class lector.types.Number[source]#

Bases: lector.types.abc.Converter

Attempts to parse strings into floats or ints followed by downcasting.

allow_unsigned_int: bool = True#
decimal: str | DecimalMode#
max_int: int | None#
convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

class lector.types.Text[source]#

Bases: lector.types.abc.Converter

Anything could be text, but we can enforce text-likeness and uniqueness.

min_unique: float = 0.1#
convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

class lector.types.Timestamp[source]#

Bases: lector.types.abc.Converter

Convert string or time/date-like arrays to timestamp type.

Note: Arrow will always _parse_ either into UTC or timezone-naive timestamps, but never into specific timezones other than UTC by default. Also, internally all timestamps are represented as UTC. The timezone metadata is then used by other functions to correctly extract for example the local day of the week, time etc.

Non-UTC timestamps can only be created by specifying the TimestampType explicitly, or using the assume_timezone function.

When converting to pandas, the timezone is handled correctly.

When input strings have no explicit timezone information, uses tz parameter to interpret them as local to that tz. If tz=None, keeps them as timezone-naive timestamps. If input strings do have explicit timezone information, will be represented internally as UTC (as always), and simply set the tz metadata so that component extraction etc. will use correctly localized moments in time.

TZ-naive timestamps [“2013-07-17 05:00”, “2013-07-17 02:00”]:

  • assume_timezone(NY): interprets input timestamps as local to tz,

    converts and stores them as UTC, and keeps tz metadata for correct localization when printing/extracting components. I.e., will convert to [2013-07-17 09:00:00, 2013-07-17 06:00:00] UTC, but when needed, will localize on demand to [2013-07-17 05:00:00-04:00 2013-07-17 02:00:00-04:00].

  • cast with timezone(NY): interprets input timestamps as local to UTC,

    and stores the tz as metadata for on-demand localization. I.e., timestamps will be [2013-07-17 05:00:00, 2013-07-17 02:00:00] UTC, and when needed will localize on demand to [2013-07-17 01:00:00-04:00 2013-07-16 22:00:00-04:00].

TZ-aware timestamps [“2013-07-17 05:00”, “2013-07-17 02:00”] UTC:

  • cast with timezone(NY): since input timestamps internally are already

    always in UTC, keeps them as UTC [“2013-07-17 05:00”, “2013-07-17 02:00”], but localizes to cast tz on demand, i.e. to [2013-07-17 01:00:00-04:00 2013-07-16 22:00:00-04:00].

DEFAULT_TZ: ClassVar[str] = 'UTC'#
convert_temporal: bool = True#

Whether time/date-only arrays should be converted to timestamps.

format: str | None#

When None, default formats are tried in order.

tz: str | None#

The desired timezone of the timestamps.

unit: str#

Resolution the timestamps are stored with internally.

convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

convert_date_time(array)[source]#
Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

convert_strings(array)[source]#
Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

convert_timestamp(array)[source]#
Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

static meta(dt)[source]#
Parameters:

dt (pyarrow.TimestampType) –

Return type:

dict[str, str]

static to_timezone(array, tz)[source]#
Parameters:
  • array (pyarrow.TimestampArray) –

  • tz (str | None) –

Return type:

pyarrow.TimestampArray

class lector.types.Url[source]#

Bases: lector.types.abc.Converter

Anything could be text, but we can enforce text-likeness and uniqueness.

convert(array)[source]#

To be implemented in subclasses.

Parameters:

array (pyarrow.Array) –

Return type:

lector.types.abc.Conversion | None

lector.types.Registry[source]#

‘Singleton’ conversion registry.