`lector.csv.encodings`#

Helpers to detecting character encodings in binary buffers.

Classes#

`Chardet`	An encoding detector using cchardet if the default utf-8 generates too many errors.
`EncodingDetector`	Base class specifying interface for all encoding detetors.

`decoding_errors`(bs, encoding[, prop])	The proportion of characters that couldn't be decoded correctly.
`detect_bom`(bs)	Detect encoding by looking for a BOM at the start of the file.

`BOMS`	Map BOM (Byte-order mark) to encoding.
`CODEC_ERR_CHAR`	Character representing non-codable bytes.
`MAX_INT32`	Cannot read more than this number of bytes at once to detect encoding.

class lector.csv.encodings.Chardet[source]#

Bases: EncodingDetector

An encoding detector using cchardet if the default utf-8 generates too many errors.

confidence_threshold: float = 0.6[source]#: Minimum level of confidence to accept an encoding automatically detected by cchardet.

error_threshold: float = 0.001[source]#: A greater proportion of decoding errors than this will be considered a failed encoding.

detect(buffer)[source]#

Somewhat ‘opinionated’ encoding detection.

Assumes utf-8 as most common encoding, falling back on cchardet detection, and if all else fails on windows-1250 if encoding is latin-like.

class lector.csv.encodings.EncodingDetector[source]#

Bases: abc.ABC

Base class specifying interface for all encoding detetors.

abstract detect(buffer)[source]#

Implement me.

lector.csv.encodings.decoding_errors(bs, encoding, prop=True)[source]#

The proportion of characters that couldn’t be decoded correctly.

Parameters:

Return type:

float

lector.csv.encodings.detect_bom(bs)[source]#

Detect encoding by looking for a BOM at the start of the file.

lector.csv.encodings.BOMS: dict[str, tuple[Literal, Ellipsis]][source]#: Map BOM (Byte-order mark) to encoding.

lector.csv.encodings.CODEC_ERR_CHAR = '�'[source]#: Character representing non-codable bytes.

lector.csv.encodings.MAX_INT32: int = 2147483647[source]#: Cannot read more than this number of bytes at once to detect encoding.