lector.csv.arrow#

Classes#

ArrowReader

Use base class detection methods to configure a pyarrow.csv.read_csv() call.

Functions#

clean_column_names(names)

Handle empty and duplicate column names.

transcode(fp[, codec_in, codec_out, errors])

Safely transcode any readable byte stream from decoder to encoder codecs.

Attributes#

MAX_MSG_LEN

SKIPPED_MSG_N_MAX

TypeDict

class lector.csv.arrow.ArrowReader(fp, encoding=None, dialect=None, preamble=None, log=True)[source]#

Bases: lector.csv.abc.Reader

Use base class detection methods to configure a pyarrow.csv.read_csv() call.

Parameters:
configure(format)[source]#
Parameters:

format (lector.csv.abc.Format) –

Return type:

dict

parse(types=None, timestamp_formats=None, null_values=None)[source]#

Invoke Arrow’s parser with inferred CSV format.

Parameters:
  • types (str | TypeDict | None) –

  • timestamp_formats (str | list[str] | None) –

  • null_values (str | collections.abc.Iterable[str] | None) –

Return type:

pyarrow.Table

skip_invalid_row(row)[source]#
Parameters:

row (pyarrow.csv.InvalidRow) –

Return type:

str

lector.csv.arrow.clean_column_names(names)[source]#

Handle empty and duplicate column names.

Parameters:

names (list[str]) –

Return type:

list[str]

lector.csv.arrow.transcode(fp, codec_in='utf-8', codec_out='utf-8', errors='replace')[source]#

Safely transcode any readable byte stream from decoder to encoder codecs.

Arrow only accepts byte streams and optional encoding, but has no option to automatically handle codec errors. It also doesn’t seem to like the interface of a Python recoder when the encoding is “utf-16” (rather than more specific “utf-16-le” or “utf-16-be”).

Parameters:
  • fp (lector.csv.abc.FileLike) –

  • codec_in (str) –

  • codec_out (str) –

Return type:

codecs.StreamRecoder

lector.csv.arrow.MAX_MSG_LEN = 200[source]#
lector.csv.arrow.SKIPPED_MSG_N_MAX = 20[source]#
lector.csv.arrow.TypeDict[source]#