CSV Reader#

The CSV Reader has the simple task of detecting 3 properties of a CSV file:

  1. The text encoding (utf-8, latin-1 etc.)

  2. A potential preamble (initial lines to skip)

  3. The CSV dialect (delimiter etc.)

Lector provides an abstract base class and default implementations for each of the three detectors (see below).

A reader itself then simply receives instances of these detectors (or the results of the detection), and configures the parameters of a CSV parser accordingly. The main CSV parser in lector is pyarrow’s csv.read_csv(), as used in the ArrowReader. As an example for using alternative parsers we also include a PandasReader. Both implement the abstract Reader class.

File encodings#

An encoding detector in lector is any class having a detect() method that accepts a binary (bytes) buffer, and returns a string indicating the name of a Python codec, as the abstract base class requires:

@dataclass
class EncodingDetector(ABC):
    """Base class specifying interface for all encoding detetors."""

    @abstractmethod
    def detect(self, buffer: BinaryIO) -> str:
        """Implement me.""

The default implementation uses the cchardet library internally and has the following interface:

@dataclass
class Chardet(EncodingDetector):
    """An encoding detector using cchardet if the default utf-8 generates too many errors."""

    n_bytes: int = int(1e7)  # 10 MB
    """Use this many bytes to detect encoding."""
    error_threshold: float = 0.0
    """A greater proportion of decoding errors than this will be considered a failed encoding."""
    confidence_threshold: float = 0.6
    """Minimum level of confidence to accept an encoding automatically detected by cchardet."""

It reads a maximum of n_bytes bytes from the received buffer, and then in the following order:

  • Tries to identify an initial byte-order mark (BOM) indicating the file’s codec

  • Checks whether assuming utf-8 produces less than error_threshold decoding errors (and returns this codec if true)

  • Uses cchardet to detect the encoding. If cchardet’s confidence is greater than the confidence_threshold, returns the detected encoding. Otherwise it falls back on the windows-1250 codec as the windows/latin-like codec that most acts as a superset of special characters amongst related codecs.

Preambles#

By “preamble” lector understands initial lines in CSV files to be skipped, e.g. metadata that should not be interpreted part of the tabular data itself.

It is impossible to always detect arbitrary preambles from the CSV data itself. There are, however, common patterns amongst preambles written to CSV by certain sources. E.g. some exporters may separate the metadata from actual data by a line of delimiters only. Others may write metadata only that does not itself contain the delimiter used otherwise to separate fields in the tabular part.

Since it is essentially an open-ended exercise to detect arbitrary preambles, lector was designed to allow easy extension of the patterns to be detected. One simply implements a new subclass of PreambleDetector, and uses a decorator to register it with the preamble registry. Like so:

@Preambles.register
@dataclass
class MyPreamble(PreambleDetector):

    def detect(self, buffer: TextIO) -> int:
        ...

In this case the detector will receive an already decoded text buffer, and should return an integer indicating the number of lines to skip.

lector.csv.preambles.Brandwatch, and lector.csv.preambles.Fieldless are two detectors provided out of the box. The former checks for initial lines followed by a single line of commas only. The second checks for N initial lines containing a single field only, followed by at least one line containing multiple fields. It then returns N as the number of rows to skip.

lector.csv.preambles.Preambles.detect() is responsible for trying all implemented detectors in the order they have been registered and returns the first match (returning N > 0 lines to skip). This may provide too contraining in the long run and may change in the future so that the order is more easily configurable.

Dialects#

The CSV format is not in fact a strict standard, and there are a number of differences in how CSVs files can be generated. E.g. while the delimiter is usually the comma, it may also be a semi-colon, the tab or any other arbitrary character. To handle the delimiter appearing within fields, one may choose to quote such fields, or use a special escape character etc.

A CSV dialect is a set of parameters describing how to parse a CSV file, i.e. identifying the delimiter, quote character and so on. In Python’s csv module, it was decided unfortunately that to use such dialects one has to pass around subclasses of it, rather than instances. Since this is somewhat awkward, lector implements it’s own lector.csv.dialects.Dialect.

Instances of dialects are used as return values by dialect detectors in lector, the abstract base class of which is simply

@dataclass
class DialectDetector(ABC):
    """Base class for all dialect detectors."""

    @abstractmethod
    def detect(self, buffer: TextIO) -> Dialect:
        ...

Lector provides two default implementations. lector.csv.dialects.PySniffer uses the Python standard library’s CSV Sniffer internally and fixes up the result specifically for more robust parsing of CSVs.

Alternatively, if clevercsv has been installed as an optional dependency, lector wraps it inside the lector.csv.dialects.CleverCsv detector class. It can be used to trade-off speed against more robust dialect inference.

Readers#

Finally, a CSV Reader in lector simply receices an encoding (or encoding detector), a preamble (or preamble detector) and a dialect (or, wait, a dialect detector). The abstract base class for readers, lector.csv.abc.Reader, is essentially

class Reader(ABC):
    """Base class for CSV readers."""

    def __init__(
        self,
        fp: FileLike,
        encoding: str | EncodingDetector | None = None,
        dialect: dict | DialectDetector | None = None,
        preamble: int | PreambleRegistry | None = None,
        log: bool = True,
    ) -> None:
        self.fp = fp
        self.encoding = encoding or encodings.Chardet()
        self.dialect = dialect or dialects.PySniffer()
        self.preamble = preamble or Preambles
        self.log = log

def read(self, *args, **kwds) -> Any:
    try:
        self.analyze()
        result = self.parse(*args, **kwds)
        self.buffer.close()
        return result
    except Exception:
        raise

@abstractmethod
def parse(self, *args, **kwds) -> Any:
    """Parse the file pointer or text buffer. Args are forwarded to read()."""
    ...

The base class uses the provided detectors to infer (if necessary) all the information required to call a CSV parser. It wraps all inferred information in a lector.csv.abc.Format object, which Reader subclasses can then translate to a specific parser’s own parameters. E.g., the only thing the lector.csv.arrow.ArrowReader does, is translate a CSV Format, to arrow’s own csv.ReadOptions, csv.ParseOptions and csv.ConvertOptions objects.

If no parameters (other than a file pointer) are passed, a reader uses the default implementations of all detectors, which means that if no customization is needed, reading almost any CSV becomes simply:

from lector import ArrowReader

tbl = ArrowReader("/path/to/file.csv").read()