lector.csv#
Subpackage for smart parsing of CSV files.
Helps deteting encoding, preambles (initial junk to skip), CSV dialects etc.
Submodules#
Classes#
Use base class detection methods to configure a pyarrow.csv.read_csv() call. |
|
An encoding detector using cchardet if the default utf-8 generates too many errors. |
|
A more convenient class for dialects than Python's built-in. |
|
Holds all parameters needed to successfully read a CSV file. |
|
Registry to manage preamble detectors. |
|
Use Python's built-in csv sniffer. |
|
Base class for CSV readers. |
- exception lector.csv.EmptyFileError[source]#
Bases:
ExceptionRaised when a binary file read() returns 0 bytes.
- class lector.csv.ArrowReader(fp, encoding=None, dialect=None, preamble=None, log=True)[source]#
Bases:
lector.csv.abc.ReaderUse base class detection methods to configure a pyarrow.csv.read_csv() call.
- Parameters:
fp (FileLike) –
encoding (str | lector.csv.encodings.EncodingDetector | None) –
dialect (dict | lector.csv.dialects.Dialect | lector.csv.dialects.DialectDetector | None) –
preamble (int | PreambleRegistry | None) –
log (bool) –
- configure(format)[source]#
- Parameters:
format (lector.csv.abc.Format) –
- Return type:
dict
- class lector.csv.Chardet[source]#
Bases:
EncodingDetectorAn encoding detector using cchardet if the default utf-8 generates too many errors.
- confidence_threshold: float = 0.6#
Minimum level of confidence to accept an encoding automatically detected by cchardet.
- error_threshold: float = 0.001#
A greater proportion of decoding errors than this will be considered a failed encoding.
- n_bytes: int#
Use this many bytes to detect encoding.
- class lector.csv.Dialect[source]#
A more convenient class for dialects than Python’s built-in.
The built-in Dialect is a class with class attributes only, and so instead of instances of that class, Python wants you to send references to subclasses around, which is, uhm, awkward to say the least (see below _to_builtin() for an example).
- delimiter: str = ','#
- double_quote: bool = True#
- escape_char: str | None#
- line_terminator: str = '\r\n'#
- quote_char: str = '"'#
- quoting: int#
- skip_initial_space: bool = False#
- class lector.csv.Format[source]#
Holds all parameters needed to successfully read a CSV file.
- columns: list[str] | None#
- dialect: lector.csv.dialects.Dialect | None#
- encoding: str | None = 'utf-8'#
- preamble: int | None = 0#
- class lector.csv.Preambles[source]#
Registry to manage preamble detectors.
- DETECTORS#
- classmethod detect(buffer, detectors=None, log=False)[source]#
Get result of first preamble detector matching the csv buffer.
Matching here means detecting more than 0 rows of preamble text, and result is the number of rows to skip.
If no detectors are provided (as ordered sequence), all registered detector classes are tried in registered order and using default parameters.
- Parameters:
buffer (TextIO) –
detectors (collections.abc.Iterable[PreambleDetector] | None) –
log (bool) –
- Return type:
int
- class lector.csv.PySniffer[source]#
Bases:
DialectDetectorUse Python’s built-in csv sniffer.
- delimiters: collections.abc.Iterable[str]#
- log: bool = False#
- n_rows: int#
- detect(buffer)[source]#
Detect a dialect we can read(!) a CSV with using the python sniffer.
Note that the sniffer is not reliable for detecting quoting, quotechar etc., but reasonable defaults are almost guaranteed to work with most parsers. E.g. the lineterminator is not even configurable in pyarrow’s csv reader, nor in pandas (python engine).
- Parameters:
buffer (TextIO) –
- Return type:
- class lector.csv.Reader(fp, encoding=None, dialect=None, preamble=None, log=True)[source]#
Bases:
abc.ABCBase class for CSV readers.
- Parameters:
fp (FileLike) –
encoding (str | lector.csv.encodings.EncodingDetector | None) –
dialect (dict | lector.csv.dialects.Dialect | lector.csv.dialects.DialectDetector | None) –
preamble (int | PreambleRegistry | None) –
log (bool) –
- __call__#
- decode(fp)[source]#
Make sure we have a text buffer.
- Parameters:
fp (FileLike) –
- Return type:
TextIO
- classmethod detect_columns(buffer, dialect)[source]#
Extract column names from buffer pointing at header row.
- Parameters:
buffer (TextIO) –
dialect (lector.csv.dialects.Dialect) –
- Return type:
list[str]
- detect_dialect(buffer)[source]#
Detect separator, quote character etc.
- Parameters:
buffer (TextIO) –
- Return type:
dict
- detect_preamble(buffer)[source]#
Detect the number of junk lines at the start of the file.
- Parameters:
buffer (TextIO) –
- Return type:
int