lector.csv#

Subpackage for smart parsing of CSV files.

Helps deteting encoding, preambles (initial junk to skip), CSV dialects etc.

Submodules#

Classes#

ArrowReader

Use base class detection methods to configure a pyarrow.csv.read_csv() call.

Chardet

An encoding detector using cchardet if the default utf-8 generates too many errors.

Dialect

A more convenient class for dialects than Python's built-in.

Format

Holds all parameters needed to successfully read a CSV file.

Preambles

Registry to manage preamble detectors.

PySniffer

Use Python's built-in csv sniffer.

Reader

Base class for CSV readers.

exception lector.csv.EmptyFileError[source]#

Bases: Exception

Raised when a binary file read() returns 0 bytes.

class lector.csv.ArrowReader(fp, encoding=None, dialect=None, preamble=None, log=True)[source]#

Bases: lector.csv.abc.Reader

Use base class detection methods to configure a pyarrow.csv.read_csv() call.

Parameters:
configure(format)[source]#
Parameters:

format (lector.csv.abc.Format) –

Return type:

dict

parse(types=None, timestamp_formats=None, null_values=None)[source]#

Invoke Arrow’s parser with inferred CSV format.

Parameters:
  • types (str | TypeDict | None) –

  • timestamp_formats (str | list[str] | None) –

  • null_values (str | collections.abc.Iterable[str] | None) –

Return type:

pyarrow.Table

skip_invalid_row(row)[source]#
Parameters:

row (pyarrow.csv.InvalidRow) –

Return type:

str

class lector.csv.Chardet[source]#

Bases: EncodingDetector

An encoding detector using cchardet if the default utf-8 generates too many errors.

confidence_threshold: float = 0.6#

Minimum level of confidence to accept an encoding automatically detected by cchardet.

error_threshold: float = 0.001#

A greater proportion of decoding errors than this will be considered a failed encoding.

n_bytes: int#

Use this many bytes to detect encoding.

detect(buffer)[source]#

Somewhat ‘opinionated’ encoding detection.

Assumes utf-8 as most common encoding, falling back on cchardet detection, and if all else fails on windows-1250 if encoding is latin-like.

Parameters:

buffer (BinaryIO) –

Return type:

str

class lector.csv.Dialect[source]#

A more convenient class for dialects than Python’s built-in.

The built-in Dialect is a class with class attributes only, and so instead of instances of that class, Python wants you to send references to subclasses around, which is, uhm, awkward to say the least (see below _to_builtin() for an example).

delimiter: str = ','#
double_quote: bool = True#
escape_char: str | None#
line_terminator: str = '\r\n'#
quote_char: str = '"'#
quoting: int#
skip_initial_space: bool = False#
classmethod from_builtin(dialect)[source]#

Make instance from built-in dialect class configured for reliable reading(!).

Parameters:

dialect (str | PyDialectT) –

Return type:

Dialect

to_builtin()[source]#

Make a subclass of built-in Dialect from this instance.

Return type:

PyDialectT

class lector.csv.Format[source]#

Holds all parameters needed to successfully read a CSV file.

columns: list[str] | None#
dialect: lector.csv.dialects.Dialect | None#
encoding: str | None = 'utf-8'#
preamble: int | None = 0#
__rich__()[source]#
Return type:

rich.table.Table

class lector.csv.Preambles[source]#

Registry to manage preamble detectors.

DETECTORS#
classmethod detect(buffer, detectors=None, log=False)[source]#

Get result of first preamble detector matching the csv buffer.

Matching here means detecting more than 0 rows of preamble text, and result is the number of rows to skip.

If no detectors are provided (as ordered sequence), all registered detector classes are tried in registered order and using default parameters.

Parameters:
  • buffer (TextIO) –

  • detectors (collections.abc.Iterable[PreambleDetector] | None) –

  • log (bool) –

Return type:

int

classmethod register(registered)[source]#
Parameters:

registered (type) –

Return type:

type

class lector.csv.PySniffer[source]#

Bases: DialectDetector

Use Python’s built-in csv sniffer.

delimiters: collections.abc.Iterable[str]#
log: bool = False#
n_rows: int#
detect(buffer)[source]#

Detect a dialect we can read(!) a CSV with using the python sniffer.

Note that the sniffer is not reliable for detecting quoting, quotechar etc., but reasonable defaults are almost guaranteed to work with most parsers. E.g. the lineterminator is not even configurable in pyarrow’s csv reader, nor in pandas (python engine).

Parameters:

buffer (TextIO) –

Return type:

Dialect

class lector.csv.Reader(fp, encoding=None, dialect=None, preamble=None, log=True)[source]#

Bases: abc.ABC

Base class for CSV readers.

Parameters:
__call__#
analyze()[source]#

Infer all parameters required for reading a csv file.

decode(fp)[source]#

Make sure we have a text buffer.

Parameters:

fp (FileLike) –

Return type:

TextIO

classmethod detect_columns(buffer, dialect)[source]#

Extract column names from buffer pointing at header row.

Parameters:
Return type:

list[str]

detect_dialect(buffer)[source]#

Detect separator, quote character etc.

Parameters:

buffer (TextIO) –

Return type:

dict

detect_preamble(buffer)[source]#

Detect the number of junk lines at the start of the file.

Parameters:

buffer (TextIO) –

Return type:

int

abstract parse(*args, **kwds)[source]#

Parse the file pointer or text buffer. Args are forwarded to read().

Return type:

Any

read(*args, **kwds)[source]#
Return type:

Any