Core functions

schemashift.transform(path, config, n_rows=None)[source]

Apply a FormatConfig to a file and return the transformed data.

Parameters:
  • path (Path) – Path to the source file.

  • config (FormatConfig) – FormatConfig describing how to map columns.

  • n_rows (int | None) – If given, collect only the first n_rows rows (useful for previewing or validating a config without reading the whole file).

Return type:

DataFrame

Returns:

A polars.DataFrame with the transformed columns.

Raises:
schemashift.smart_transform(path, registry, target_schema=None, llm=None, review_fn=None, auto_register=False, example_configs=None, max_retries=2, n_sample_rows=15, reader_config=None)[source]

Full detect-or-generate flow.

  1. Try auto-detect from registry.

  2. If miss and LLM available: generate config.

  3. If review_fn provided: pass config + sample to reviewer.

  4. If auto_register: save to registry.

  5. Apply config; optionally validate against target_schema.

Parameters:
  • path (Path) – Source file path.

  • registry (Registry) – Registry to search and optionally register to.

  • target_schema (TargetSchema | None) – Required for LLM generation and output validation.

  • llm (BaseChatModel | None) – LangChain BaseChatModel.

  • review_fn (Callable[[FormatConfig, DataFrame], FormatConfig | None] | None) – callback(config, sample_df) -> config | None. None = reject.

  • auto_register (bool) – Register LLM-generated config automatically.

  • example_configs (list[FormatConfig] | None) – Example configs for LLM prompt.

  • max_retries (int) – Max LLM retries.

  • n_sample_rows (int) – Rows to sample for LLM.

  • reader_config (ReaderConfig | None) – Optional reader configuration forwarded to all file reads.

Return type:

DataFrame

Returns:

Transformed polars.DataFrame.

Raises:
schemashift.validate_config(config)[source]

Validate a FormatConfig by parsing all DSL expressions and checking dtypes.

Parameters:

config (FormatConfig) – The FormatConfig to validate.

Return type:

list[str]

Returns:

A list of error message strings. An empty list means the config is valid.

schemashift.detect_format(file_columns, registry, min_score=0.4)[source]

Detect which registered config matches a file’s columns.

A config matches when the file’s column set is a superset of the config’s referenced source columns (i.e. every column the config needs is present). Matching configs are ranked by specificity score: len(required) / len(file_columns).

Parameters:
  • file_columns (list[str]) – Column names found in the file being inspected.

  • registry (Registry) – Registry to search for candidate configs.

  • min_score (float) – Minimum specificity score required for a match.

Return type:

FormatConfig | None

Returns:

The matching FormatConfig when exactly one config matches, or None when no configs match.

Raises:

AmbiguousFormatError – When two or more configs match.

schemashift.read_file(path, config=None)[source]

Read a file into a LazyFrame based on its extension.

Supported extensions: .csv, .tsv, .xlsx, .xls, .parquet, .json

Parameters:
  • path (Path) – Path to the file.

  • config (ReaderConfig | None) – Optional reader configuration (skip_rows, separator, encoding, etc.).

Return type:

LazyFrame

Returns:

A Polars LazyFrame.

Raises: