lector#

A package for fast parsing of messy CSV files and smart-ish type inference.

Subpackages#

Submodules#

Classes#

ArrowReader

Use base class detection methods to configure a pyarrow.csv.read_csv() call.

Autocast

Simple cast trying each registered type in order.

Cast

Tries a specific cast for each column.

Converter

Simple base class for dependency injection of new custom data types.

Dialect

A more convenient class for dialects than Python's built-in.

Format

Holds all parameters needed to successfully read a CSV file.

Preambles

Registry to manage preamble detectors.

Functions#

schema_view(schema[, title, padding])

Make a rich view for arrow schema.

table_view(tbl[, title, n_rows_max, n_columns_max, ...])

Pyarrow table to rich table.

Attributes#

CONSOLE

LOG

Registry

'Singleton' conversion registry.

exception lector.EmptyFileError[source]#

Bases: Exception

Raised when a binary file read() returns 0 bytes.

class lector.ArrowReader(fp, encoding=None, dialect=None, preamble=None, log=True)[source]#

Bases: lector.csv.abc.Reader

Use base class detection methods to configure a pyarrow.csv.read_csv() call.

Parameters:
configure(format)[source]#
Parameters:

format (lector.csv.abc.Format) –

Return type:

dict

parse(types=None, timestamp_formats=None, null_values=None)[source]#

Invoke Arrow’s parser with inferred CSV format.

Parameters:
  • types (str | TypeDict | None) –

  • timestamp_formats (str | list[str] | None) –

  • null_values (str | collections.abc.Iterable[str] | None) –

Return type:

pyarrow.Table

skip_invalid_row(row)[source]#
Parameters:

row (pyarrow.csv.InvalidRow) –

Return type:

str

class lector.Autocast[source]#

Bases: CastStrategy

Simple cast trying each registered type in order.

As a little performance optimization (having a huge effect on execution time), types are first tested on a sample for fast rejection of non-matching types.

fallback: lector.types.abc.Converter | None#
n_samples: int = 100#
cast_array(array, name=None)[source]#

Only need to override this.

Parameters:
  • array (pyarrow.Array | pyarrow.ChunkedArray) –

  • name (str | None) –

Return type:

lector.types.abc.Conversion

class lector.Cast[source]#

Tries a specific cast for each column.

converters: dict[str, lector.types.abc.Converter]#
log: bool = False#
cast(table)[source]#
Parameters:

table (pyarrow.Table) –

Return type:

pyarrow.Table

class lector.Converter[source]#

Bases: abc.ABC

Simple base class for dependency injection of new custom data types.

If a proportion of values smaller than threshold can be successfully converted, the converter should return None.

threshold: float = 1.0#
abstract convert(arr)[source]#

To be implemented in subclasses.

Parameters:

arr (pyarrow.Array) –

Return type:

Conversion | None

class lector.Dialect[source]#

A more convenient class for dialects than Python’s built-in.

The built-in Dialect is a class with class attributes only, and so instead of instances of that class, Python wants you to send references to subclasses around, which is, uhm, awkward to say the least (see below _to_builtin() for an example).

delimiter: str = ','#
double_quote: bool = True#
escape_char: str | None#
line_terminator: str = '\r\n'#
quote_char: str = '"'#
quoting: int#
skip_initial_space: bool = False#
classmethod from_builtin(dialect)[source]#

Make instance from built-in dialect class configured for reliable reading(!).

Parameters:

dialect (str | PyDialectT) –

Return type:

Dialect

to_builtin()[source]#

Make a subclass of built-in Dialect from this instance.

Return type:

PyDialectT

class lector.Format[source]#

Holds all parameters needed to successfully read a CSV file.

columns: list[str] | None#
dialect: lector.csv.dialects.Dialect | None#
encoding: str | None = 'utf-8'#
preamble: int | None = 0#
__rich__()[source]#
Return type:

rich.table.Table

class lector.Preambles[source]#

Registry to manage preamble detectors.

DETECTORS#
classmethod detect(buffer, detectors=None, log=False)[source]#

Get result of first preamble detector matching the csv buffer.

Matching here means detecting more than 0 rows of preamble text, and result is the number of rows to skip.

If no detectors are provided (as ordered sequence), all registered detector classes are tried in registered order and using default parameters.

Parameters:
  • buffer (TextIO) –

  • detectors (collections.abc.Iterable[PreambleDetector] | None) –

  • log (bool) –

Return type:

int

classmethod register(registered)[source]#
Parameters:

registered (type) –

Return type:

type

lector.schema_view(schema, title='Schema', padding=1)[source]#

Make a rich view for arrow schema.

Parameters:
  • schema (pyarrow.Schema) –

  • title (str | None) –

  • padding (int) –

Return type:

rich.table.Table

lector.table_view(tbl, title=None, n_rows_max=10, n_columns_max=6, max_column_width=20, padding=1)[source]#

Pyarrow table to rich table.

Parameters:
  • tbl (pyarrow.Table) –

  • title (str | None) –

  • n_rows_max (int) –

  • n_columns_max (int) –

  • max_column_width (int) –

  • padding (int) –

Return type:

rich.table.Table

lector.CONSOLE[source]#
lector.LOG[source]#
lector.Registry[source]#

‘Singleton’ conversion registry.