lector.csv.preambles#

Detectors of preambles in CSV files.

This is generally a chicken-and-egg-type situation. Do detect generic preambles robustly and efficiently, it would really help to know the CSV dialect, or at least the delimiter. But to detect the dialect/delimiter correctly, we need to ignore/(skip) the preamble. Detectors may therefore rely on (somtimes) overly simplistic heuristics implicitly assuming a certain dialect.

Classes#

Brandwatch

Detect CSV files exported from Brandwatch.

Fieldless

Detects initial rows that don't contain any delimited fields.

GoogleAds

In GoogleAds CSVs the garbage lines don't contain the separator (comma or tab).

PreambleDetector

Base class for detecting preambles (initial junk) in a CSV buffer.

Preambles

Registry to manage preamble detectors.

class lector.csv.preambles.Brandwatch[source]#

Bases: PreambleDetector

Detect CSV files exported from Brandwatch.

Brandwatch uses the comma as separator and includes a row of commas only to separate preamble texts from the CSV table as such.

detect(buffer)[source]#

Detect preamble and return number of lines to skip.

Parameters:

buffer (TextIO) –

Return type:

int

class lector.csv.preambles.Fieldless[source]#

Bases: PreambleDetector

Detects initial rows that don’t contain any delimited fields.

Tries parsing buffer using Python’s built-in csv functionality, assuming as delimiter the most frequent character amongst those configured via delimiters. Given this delimiter, the parser detects N initial lines containing a single field only, followed by at least one line containing multiple fields. N is then the number of rows to skip.

delimiters: str | list[str][source]#
detect(buffer)[source]#

Count consecutive initial fieldless rows given the most frequent delimiter.

Parameters:

buffer (TextIO) –

Return type:

int

detect_with_delimiter(buffer, delimiter)[source]#

Count how many consecutive initial fieldless rows we have given specific delimiter.

Parameters:
  • buffer (TextIO) –

  • delimiter (str) –

Return type:

int

class lector.csv.preambles.GoogleAds[source]#

Bases: Fieldless

In GoogleAds CSVs the garbage lines don’t contain the separator (comma or tab).

The only complications are that 1) GoogleAds has two CSV export formats: ‘Excel’ using tabs as separators and normal ‘CSV’ the comma; 2) A single column CSV wouldn’t have the separator either.

GoogleAds also seems to include two “totals” rows at the end, which we exclude here.

detect(buffer)[source]#

Count consecutive initial fieldless rows given the most frequent delimiter.

Parameters:

buffer (TextIO) –

Return type:

int

class lector.csv.preambles.PreambleDetector[source]#

Bases: abc.ABC

Base class for detecting preambles (initial junk) in a CSV buffer.

n_rows: int = 100[source]#
abstract detect(buffer)[source]#

Detect preamble and return number of lines to skip.

Parameters:

buffer (TextIO) –

Return type:

int

class lector.csv.preambles.Preambles[source]#

Registry to manage preamble detectors.

DETECTORS[source]#
classmethod detect(buffer, detectors=None, log=False)[source]#

Get result of first preamble detector matching the csv buffer.

Matching here means detecting more than 0 rows of preamble text, and result is the number of rows to skip.

If no detectors are provided (as ordered sequence), all registered detector classes are tried in registered order and using default parameters.

Parameters:
  • buffer (TextIO) –

  • detectors (collections.abc.Iterable[PreambleDetector] | None) –

  • log (bool) –

Return type:

int

classmethod register(registered)[source]#
Parameters:

registered (type) –

Return type:

type