Core functions¶
- schemashift.transform(path, config, n_rows=None)[source]¶
Apply a FormatConfig to a file and return the transformed data.
- Parameters:
path (
Path) – Path to the source file.config (
FormatConfig) – FormatConfig describing how to map columns.n_rows (
int|None) – If given, collect only the first n_rows rows (useful for previewing or validating a config without reading the whole file).
- Return type:
DataFrame- Returns:
A
polars.DataFramewith the transformed columns.- Raises:
DSLRuntimeError – When a DSL expression fails to evaluate.
ReaderError – When the file cannot be read.
- schemashift.smart_transform(path, registry, target_schema=None, llm=None, review_fn=None, auto_register=False, example_configs=None, max_retries=2, n_sample_rows=15, reader_config=None)[source]¶
Full detect-or-generate flow.
Try auto-detect from registry.
If miss and LLM available: generate config.
If review_fn provided: pass config + sample to reviewer.
If auto_register: save to registry.
Apply config; optionally validate against target_schema.
- Parameters:
path (
Path) – Source file path.registry (
Registry) – Registry to search and optionally register to.target_schema (
TargetSchema|None) – Required for LLM generation and output validation.llm (
BaseChatModel|None) – LangChain BaseChatModel.review_fn (
Callable[[FormatConfig,DataFrame],FormatConfig|None] |None) – callback(config, sample_df) -> config | None. None = reject.auto_register (
bool) – Register LLM-generated config automatically.example_configs (
list[FormatConfig] |None) – Example configs for LLM prompt.max_retries (
int) – Max LLM retries.n_sample_rows (
int) – Rows to sample for LLM.reader_config (
ReaderConfig|None) – Optional reader configuration forwarded to all file reads.
- Return type:
DataFrame- Returns:
Transformed
polars.DataFrame.- Raises:
FormatDetectionError – No match and no LLM.
ValueError – LLM needed but target_schema not provided.
LLMGenerationError – LLM fails after all retries.
ReviewRejectedError – review_fn returned None.
- schemashift.validate_config(config)[source]¶
Validate a FormatConfig by parsing all DSL expressions and checking dtypes.
- Parameters:
config (
FormatConfig) – The FormatConfig to validate.- Return type:
- Returns:
A list of error message strings. An empty list means the config is valid.
- schemashift.detect_format(file_columns, registry, min_score=0.4)[source]¶
Detect which registered config matches a file’s columns.
A config matches when the file’s column set is a superset of the config’s referenced source columns (i.e. every column the config needs is present). Matching configs are ranked by specificity score:
len(required) / len(file_columns).- Parameters:
- Return type:
- Returns:
The matching FormatConfig when exactly one config matches, or None when no configs match.
- Raises:
AmbiguousFormatError – When two or more configs match.
- schemashift.read_file(path, config=None)[source]¶
Read a file into a LazyFrame based on its extension.
Supported extensions: .csv, .tsv, .xlsx, .xls, .parquet, .json
- Parameters:
path (
Path) – Path to the file.config (
ReaderConfig|None) – Optional reader configuration (skip_rows, separator, encoding, etc.).
- Return type:
LazyFrame- Returns:
A Polars LazyFrame.
- Raises:
UnsupportedFileError – For unrecognised file extensions.
ReaderError – For any I/O or parsing failures.