lector.csv.encodings#

Helpers to detecting character encodings in binary buffers.

Classes#

Chardet

An encoding detector using cchardet if the default utf-8 generates too many errors.

EncodingDetector

Base class specifying interface for all encoding detetors.

Functions#

decoding_errors(bs, encoding[, prop])

The proportion of characters that couldn't be decoded correctly.

detect_bom(bs)

Detect encoding by looking for a BOM at the start of the file.

Attributes#

BOMS

Map BOM (Byte-order mark) to encoding.

CODEC_ERR_CHAR

Character representing non-codable bytes.

MAX_INT32

Cannot read more than this number of bytes at once to detect encoding.

class lector.csv.encodings.Chardet[source]#

Bases: EncodingDetector

An encoding detector using cchardet if the default utf-8 generates too many errors.

confidence_threshold: float = 0.6[source]#

Minimum level of confidence to accept an encoding automatically detected by cchardet.

error_threshold: float = 0.001[source]#

A greater proportion of decoding errors than this will be considered a failed encoding.

n_bytes: int[source]#

Use this many bytes to detect encoding.

detect(buffer)[source]#

Somewhat ‘opinionated’ encoding detection.

Assumes utf-8 as most common encoding, falling back on cchardet detection, and if all else fails on windows-1250 if encoding is latin-like.

Parameters:

buffer (BinaryIO) –

Return type:

str

class lector.csv.encodings.EncodingDetector[source]#

Bases: abc.ABC

Base class specifying interface for all encoding detetors.

abstract detect(buffer)[source]#

Implement me.

Parameters:

buffer (BinaryIO) –

Return type:

str

lector.csv.encodings.decoding_errors(bs, encoding, prop=True)[source]#

The proportion of characters that couldn’t be decoded correctly.

Parameters:
  • bs (bytes) –

  • encoding (str) –

  • prop (bool) –

Return type:

float

lector.csv.encodings.detect_bom(bs)[source]#

Detect encoding by looking for a BOM at the start of the file.

Parameters:

bs (bytes) –

lector.csv.encodings.BOMS: dict[str, tuple[Literal, Ellipsis]][source]#

Map BOM (Byte-order mark) to encoding.

lector.csv.encodings.CODEC_ERR_CHAR = '�'[source]#

Character representing non-codable bytes.

lector.csv.encodings.MAX_INT32: int = 2147483647[source]#

Cannot read more than this number of bytes at once to detect encoding.