Config format reference¶

A FormatConfig is a JSON (or YAML) file that describes how to transform one specific source format into your target schema.

Full structure¶

{
  "name": "camstar_mes",
  "description": "Camstar MES lot movement export",
  "version": 1,
  "target_schema": "lot_movement",
  "reader": {
    "skip_rows": 0,
    "separator": ",",
    "encoding": "utf-8"
  },
  "columns": [
    { "target": "lot_id",        "source": "LOT_ID" },
    { "target": "wafer_count",   "source": "QTY", "dtype": "int32" },
    { "target": "track_in_time", "expr": "col(\"TRACKIN_DT\").str.to_datetime(\"%Y-%m-%d %H:%M:%S\")" },
    { "target": "hold_flag",     "expr": "col(\"HOLD_STATUS\") != \"NONE\"" },
    { "target": "data_source",   "constant": "camstar_mes" }
  ],
  "drop_unmapped": true
}

Top-level fields¶

name (required): Unique identifier for this config within the registry. Used by registry.get() and for auto-detection.
description: Optional human-readable description. Included in LLM-generated configs.
version: Integer version number. Defaults to 1.
target_schema: Name of the TargetSchema this config produces. Used for validation.
reader: Optional ReaderConfig controlling how the file is read. See Reader options.
columns (required): List of ColumnMapping objects. See Column mappings.
drop_unmapped: If true, columns not listed in columns are dropped from the output. Defaults to true.

Column mappings¶

Each entry in columns must have a target field and exactly one of source, expr, or constant.

Rename a column¶

{ "target": "lot_id", "source": "LOT_ID" }

Renames the source column as-is. No type conversion.

Apply a DSL expression¶

{ "target": "track_in_time", "expr": "col(\"TRACKIN_DT\").str.to_datetime(\"%Y-%m-%d %H:%M:%S\")" }

Evaluates the expression and assigns the result to target. See Expression DSL for the full expression reference.

Set a constant¶

{ "target": "data_source", "constant": "camstar_mes" }

Broadcasts a literal value to every row.

Common optional fields¶

dtype: Cast the result to this Polars dtype after the mapping. Accepted values: str, int32, int64, float32, float64, bool, date, datetime, duration.
fillna: Fill nulls in the output column with this value after the mapping is applied.

Reader options¶

The reader block controls low-level file reading.

skip_rows: Number of rows to skip before the header. Default: 0.
sheet_name: For Excel files: sheet name (string) or 1-based sheet index (integer). Default: first sheet.
separator: For CSV/TSV: field delimiter character. Default: ",".
encoding: File encoding. Default: "utf-8".

Supported file formats¶

Format	Notes
`.csv`	Lazy scan via Polars
`.tsv`	CSV with `separator="\t"`
`.parquet`	Lazy scan via Polars
`.xlsx` / `.xls`	Via `fastexcel` (calamine engine) — read eagerly then lazy
`.json`	Read eagerly then lazy

Validation¶

Use validate_config() to check that all DSL expressions parse correctly without running against real data:

import schemashift as ss

config = ss.FormatConfig.model_validate_json(open("my_config.json").read())
errors = ss.validate_config(config)
if errors:
    for err in errors:
        print(err)

Or from the CLI:

schemashift validate my_config.json