lector
#
A package for fast parsing of messy CSV files and smart-ish type inference.
Subpackages#
Submodules#
Classes#
Use base class detection methods to configure a pyarrow.csv.read_csv() call. |
|
Simple cast trying each registered type in order. |
|
Tries a specific cast for each column. |
|
Simple base class for dependency injection of new custom data types. |
|
A more convenient class for dialects than Python's built-in. |
|
Holds all parameters needed to successfully read a CSV file. |
|
Registry to manage preamble detectors. |
Functions#
|
Make a rich view for arrow schema. |
|
Pyarrow table to rich table. |
Attributes#
'Singleton' conversion registry. |
- exception lector.EmptyFileError[source]#
Bases:
Exception
Raised when a binary file read() returns 0 bytes.
- class lector.ArrowReader(fp, encoding=None, dialect=None, preamble=None, log=True)[source]#
Bases:
lector.csv.abc.Reader
Use base class detection methods to configure a pyarrow.csv.read_csv() call.
- Parameters:
fp (FileLike) –
encoding (str | lector.csv.encodings.EncodingDetector | None) –
dialect (dict | lector.csv.dialects.Dialect | lector.csv.dialects.DialectDetector | None) –
preamble (int | PreambleRegistry | None) –
log (bool) –
- configure(format)[source]#
- Parameters:
format (lector.csv.abc.Format) –
- Return type:
dict
- class lector.Autocast[source]#
Bases:
CastStrategy
Simple cast trying each registered type in order.
As a little performance optimization (having a huge effect on execution time), types are first tested on a sample for fast rejection of non-matching types.
- fallback: lector.types.abc.Converter | None#
- n_samples: int = 100#
- class lector.Cast[source]#
Tries a specific cast for each column.
- converters: dict[str, lector.types.abc.Converter]#
- log: bool = False#
- class lector.Converter[source]#
Bases:
abc.ABC
Simple base class for dependency injection of new custom data types.
If a proportion of values smaller than threshold can be successfully converted, the converter should return None.
- threshold: float = 1.0#
- abstract convert(arr)[source]#
To be implemented in subclasses.
- Parameters:
arr (pyarrow.Array) –
- Return type:
Conversion | None
- class lector.Dialect[source]#
A more convenient class for dialects than Python’s built-in.
The built-in Dialect is a class with class attributes only, and so instead of instances of that class, Python wants you to send references to subclasses around, which is, uhm, awkward to say the least (see below _to_builtin() for an example).
- delimiter: str = ','#
- double_quote: bool = True#
- escape_char: str | None#
- line_terminator: str = '\r\n'#
- quote_char: str = '"'#
- quoting: int#
- skip_initial_space: bool = False#
- class lector.Format[source]#
Holds all parameters needed to successfully read a CSV file.
- columns: list[str] | None#
- dialect: lector.csv.dialects.Dialect | None#
- encoding: str | None = 'utf-8'#
- preamble: int | None = 0#
- class lector.Preambles[source]#
Registry to manage preamble detectors.
- DETECTORS#
- classmethod detect(buffer, detectors=None, log=False)[source]#
Get result of first preamble detector matching the csv buffer.
Matching here means detecting more than 0 rows of preamble text, and result is the number of rows to skip.
If no detectors are provided (as ordered sequence), all registered detector classes are tried in registered order and using default parameters.
- Parameters:
buffer (TextIO) –
detectors (collections.abc.Iterable[PreambleDetector] | None) –
log (bool) –
- Return type:
int
- lector.schema_view(schema, title='Schema', padding=1)[source]#
Make a rich view for arrow schema.
- Parameters:
schema (pyarrow.Schema) –
title (str | None) –
padding (int) –
- Return type:
rich.table.Table
- lector.table_view(tbl, title=None, n_rows_max=10, n_columns_max=6, max_column_width=20, padding=1)[source]#
Pyarrow table to rich table.
- Parameters:
tbl (pyarrow.Table) –
title (str | None) –
n_rows_max (int) –
n_columns_max (int) –
max_column_width (int) –
padding (int) –
- Return type:
rich.table.Table