Types#
Introduction#
Lector implements its own column type inference. It can be used by parsing a CSV file
with string
types only (preserving the original fields without modification),
and then auto-casting all columns to the most appropriate and efficient data type:
import io
import lector
from lector import ArrowReader, Autocast
from lector.log import schema_view
csv = """id,genre,metric,count,content,website,tags
1234982348728374,a,0.1,1,, http://www.graphext.com,"[a,b,c]"
,b,0.12,,"Natural language text is different from categorical data.", https://www.twitter.com,[d]
18446744073709551615,a,3.14,3,"The Project · Gutenberg » EBook « of Die Fürstin.",http://www.google.com,"['e', 'f']"
""".encode()
# High-level API uses automatic type-inference and casting by default
tbl = lector.read_csv(io.BytesIO(csv))
# Equivalent low-level API
tbl = ArrowReader(io.BytesIO(csv)).read(types="string")
tbl = Autocast().cast(tbl)
schema_view(tbl.schema)
Printing the table schema this way will produce the following output:
Schema
─────────────────────────────────────────────────────────
Column Type Meta
─────────────────────────────────────────────────────────
id uint64 {'semantic': 'number[UInt64]'}
genre dict<string> {'semantic': 'category'}
metric double {'semantic': 'number[double]'}
count uint8 {'semantic': 'number[UInt8]'}
content string {'semantic': 'text'}
website dict<string> {'semantic': 'url'}
tags list<string> {'semantic': 'list[category]'}
─────────────────────────────────────────────────────────
The schema view contains for each column in the table the arrow type it has been cast to, as well as some metadata about the semantic type lector has inferred. By semantic type we mean the kind of content the column contains, which may be different from (more specific than) the “physical” (arrow) type used to store it.
For example, the “website” column has been cast to arrow using a dictionary type with
string values (dict<string>
). However, lector has in fact inferred that the column
contains URLs, and simply selected the dictionary type as the best storage type for URLs.
Equally, the “content” column has been inferred to contain natural language text, and in
this case arrow’s string
type is used for storage. Also note that lector handles
types that have no equivalent in pandas. The “tags” column contains lists of strings, for
example, which lector has automatically parsed and cast to arrow’s list<string>
type.
For numeric columns lector has automatically identified the most efficient (least
memory-hungry) data types. The semantic metadata here is used to indicate pandas’
corresponding (potentially nullable extension) dtype
.
Using lector’s to_pandas()
function we can convert the arrow table to a pandas DataFrame
ensuring that all data is converted correctly, even when there is no corresponding
pandas type:
from lector.utils import to_pandas
df = to_pandas(tbl)
print(df)
print(df.dtypes)
produces
id genre metric count \
0 1234982348728374 a 0.10 1
1 <NA> b 0.12 <NA>
2 18446744073709551615 a 3.14 3
content website \
0 <NA> http://www.graphext.com
1 Natural language text is different from catego... https://www.twitter.com
2 The Project · Gutenberg » EBook « of Die Fürstin. http://www.google.com
tags
0 [a, b, c]
1 [d]
2 [e, f]
id UInt64
genre category
metric float64
count UInt8
content string
website category
tags object
dtype: object
Note that arrow’s tbl.to_pandas()
would have converted integer columns with
missing data to the float type, which is not save and may introduce erroneous data
(because of insufficient float precision when representing large integers). Lector
uses extension dtypes where necessary. Also note how all tag lists have been parsed
correctly, despite having various different representations in the CSV data (use of
quotes etc.). In pandas, the lists are representated by a column of numpy arrays.
Array Converters#
Converters
in Lector are responsible for inferring the semantic type of a column
(i.e. an arrow Array or ChunkedArray), identifying the corresponding storage type,
and potentially generating some useful metadata.
Lector implements one subclass of lector.types.abc.Converter
for each semantic
type. At the moment there are:
lector.types.numbers.Number
(number
)lector.types.lists.List
(list[number]
orlist[category]
)lector.types.strings.Text
(text
)lector.types.strings.Category
(category
)lector.types.strings.Url
(url
)
Their interface is simple. Minimally they have to accept a threshold
as
parameter, and must implement a convert()
method:
@dataclass
class Converter(ABC):
threshold: float = 1.0
@abstractmethod
def convert(self, arr: Array) -> Conversion | None:
...
@dataclass
class Conversion:
result: Array
meta: dict = field(default_factory=dict)
A specific converter (e.g. Number
) returns None
if the passed
array data is not compatible with the type (e.g. the values are not numeric).
Otherwise it will return a Conversion
object containing the correctly
cast array and potential metadata.
The converters can be used in two ways for casting a table of raw data: using
lector.types.cast.Autocast
to infer the best type automatically, or
using lector.types.cast.Cast
to specify the (semantic) type for each
column explicitly (see below for more information).
In both cases, exactly how types are used for inference and casting can be configured by
limiting or extending the list of allowed semantic types/converters
configuring each converter via parameters
The single common parameter for all converters is the threshold
. This
is used to identify the proportion of values in an array that have to be
valid according to the given type for it to return a cast result. I.e.
a converter (e.g. Number
) should return None
if the proportion
of valid (e.g. numeric-like) values is less than threshold
.
Automatic Table Cast#
For each column, the lector.types.cast.Autocast
simply tries each semantic
type (i.e. Converter
) in an ordered list. The first Converter
returning a
Conversion
(rather than None
) is used to cast the column. If no list of
converters is specified explicitly (as in the first example above), a default
order of all implemented converters with default parameters is used
(lector.types.cast.DEFAULT_CONVERTERS
).
As mentioned above, the Autocast
can be configured by passing an ordered list
of allowed converters, and by configuring the parameters of individual converters.
If that is not enough, and a more complicated cast strategy is required, one can
also implement a new subclass of lector.types.cast.CastStratregy
. The base
class takes care of iteration over columns and some other minor boilerplate so
that subclasses only have to implement the conversion of a single Array.
Explicit Table Cast#
lector.types.cast.Cast
is even simpler. Given a mapping of column names to
converters, it applies a specific converter to a specific column. If the conversion
is successful that column is cast, otherwise left as is. For example, given the CSV
data above:
from lector import Cast
from lector.types import Category, Number
strategy = Cast({
"id": Number(),
"genre": Category(max_cardinality=None),
# ...
})
# High-level API
tbl = lector.read_csv(io.BytesIO(csv), strategy=strategy)
# Low-level API
tbl = ArrowReader(io.BytesIO(csv)).read(types="string")
tbl = strategy.cast(tbl)
schema_view(tbl.schema)
produces
Schema
─────────────────────────────────────────────────────────
Column Type Meta
─────────────────────────────────────────────────────────
id uint64 {'semantic': 'number[UInt64]'}
genre dict<string> {'semantic': 'category'}
metric string
count string
content string
website string
tags string
─────────────────────────────────────────────────────────
I.e., only the two specified columns have been converted using the configured types.