Documentation
¶
Overview ¶
Package fileprep re-exports fileparser types for backward compatibility.
Package fileprep provides preprocessing and validation for file formats supported by filesql (CSV, TSV, LTSV, JSON, JSONL, Parquet, Excel with gzip, bzip2, xz, zstd support).
fileprep complements filesql by providing data preprocessing before loading into SQLite. It uses struct tags for validation ("validate" tag) and preprocessing ("prep" tag).
Basic Usage ¶
type Record struct {
Name string `prep:"trim" validate:"required"`
Email string `prep:"trim,lowercase" validate:"email"`
Age int `validate:"gte=0,lte=150"`
}
file, _ := os.Open("data.csv")
defer file.Close()
var records []Record
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
reader, result, err := processor.Process(file, &records)
if err != nil {
log.Fatal(err)
}
// reader can be passed directly to filesql
// result.Errors contains validation errors with row/column information
// result.RowCount and result.ValidRowCount provide processing statistics
Streaming Output with ProcessToWriter ¶
For large datasets, ProcessToWriter writes preprocessed output directly to an io.Writer, avoiding the output buffer allocation:
var buf bytes.Buffer result, err := processor.ProcessToWriter(file, &records, &buf)
Memory Usage ¶
fileprep loads the entire file into memory for processing. This enables multi-pass operations (preprocessing then validation) but means memory usage scales with file size. For large files, use ProcessToWriter to reduce peak memory by avoiding the output buffer.
Format-specific limitations:
- XLSX: Only the first sheet is processed
- LTSV: Maximum line size is 10MB
- JSON/JSONL: Data has a single "data" column containing raw JSON strings
Supported File Formats ¶
fileprep supports the same formats as filesql:
- CSV (.csv)
- TSV (.tsv)
- LTSV (.ltsv)
- JSON (.json)
- JSONL (.jsonl)
- Parquet (.parquet)
- Excel (.xlsx)
All formats support compression:
- gzip (.gz)
- bzip2 (.bz2)
- xz (.xz)
- zstd (.zst)
- zlib (.z)
- snappy (.snappy)
- s2 (.s2)
- lz4 (.lz4)
Prep Tags ¶
The "prep" tag specifies preprocessing operations applied before validation:
- trim: Remove leading and trailing whitespace
- ltrim: Remove leading whitespace
- rtrim: Remove trailing whitespace
- lowercase: Convert to lowercase
- uppercase: Convert to uppercase
- default=value: Set default value if empty
Validate Tags ¶
The "validate" tag specifies validation rules (compatible with go-playground/validator):
- required: Field must not be empty
- email: Must be a valid email address
- url: Must be a valid URL
- And many more...
See https://pkg.go.dev/github.com/nao1215/fileprep for the complete list of supported validators.
Example ¶
package main
import (
"fmt"
"io"
"strings"
"github.com/nao1215/fileprep"
)
// User represents a user record with preprocessing and validation
type User struct {
Name string `prep:"trim" validate:"required"`
Email string `prep:"trim,lowercase" validate:"required"`
Age string
}
func main() {
// Sample CSV data with whitespace that needs trimming
csvData := `name,email,age
John Doe ,JOHN@EXAMPLE.COM,30
Jane Smith,jane@example.com,25
`
// Create a processor for CSV files
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
// Prepare a slice to hold the parsed records
var users []User
// Process the data
reader, result, err := processor.Process(strings.NewReader(csvData), &users)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
// Print processing results
fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)
// Print parsed records (with preprocessing applied)
for i, user := range users {
fmt.Printf("User %d: Name=%q, Email=%q\n", i+1, user.Name, user.Email)
}
// The reader can be passed to filesql
// For demonstration, we just read and show the preprocessed output
output, err := io.ReadAll(reader)
if err != nil {
fmt.Printf("Error reading output: %v\n", err)
return
}
fmt.Printf("Output for filesql:\n%s", output)
}
Output: Processed 2 rows, 2 valid User 1: Name="John Doe", Email="john@example.com" User 2: Name="Jane Smith", Email="jane@example.com" Output for filesql: name,email,age John Doe,john@example.com,30 Jane Smith,jane@example.com,25
Example (ComplexPrepAndValidation) ¶
Example_complexPrepAndValidation demonstrates comprehensive preprocessing and validation using multiple prep tags and validation rules for realistic data processing scenarios.
package main
import (
"fmt"
"io"
"strings"
"github.com/nao1215/fileprep"
)
func main() {
// Product represents a product record with comprehensive preprocessing and validation
type Product struct {
// Name: trim whitespace, require non-empty
Name string `prep:"trim" validate:"required"`
// SKU: trim, uppercase, require non-empty and alphanumeric
SKU string `prep:"trim,uppercase" validate:"required,alphanumeric"`
// Price: trim, set default if empty, validate as number (int or decimal)
Price string `prep:"trim,default=0.00" validate:"number"`
// Quantity: trim, coerce to int, validate as integer
Quantity string `prep:"trim,coerce=int" validate:"numeric"`
// Category: trim, lowercase, collapse multiple spaces
Category string `prep:"trim,lowercase,collapse_space"`
// Description: trim, strip HTML, truncate to 200 chars
Description string `prep:"trim,strip_html,truncate=200"`
// URL: trim, fix scheme (add https:// if missing)
URL string `prep:"trim,fix_scheme=https"`
// Tags: trim, replace semicolons with commas
Tags string `prep:"trim,replace=;:,"`
}
// Sample CSV with various data quality issues
csvData := `name,sku,price,quantity,category,description,url,tags
Widget Pro ,ABC123,19.99,100, Electronics ,<p>A <b>great</b> product!</p>,example.com/widget,tag1;tag2;tag3
Gadget Plus ,DEF456,29.99,50, home goods ,Simple description,https://example.com/gadget,electronics;sale
,ABC@#$,not_a_number,5,test,desc,url,tags
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var products []Product
reader, result, err := processor.Process(strings.NewReader(csvData), &products)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
// Show processing summary
fmt.Printf("Processing Summary:\n")
fmt.Printf(" Total rows: %d\n", result.RowCount)
fmt.Printf(" Valid rows: %d\n", result.ValidRowCount)
fmt.Printf(" Invalid rows: %d\n", result.InvalidRowCount())
// Show validation errors
if result.HasErrors() {
fmt.Printf("\nValidation Errors:\n")
for _, ve := range result.ValidationErrors() {
fmt.Printf(" Row %d, Field %q: %s\n", ve.Row, ve.Field, ve.Message)
}
}
// Show preprocessed records
fmt.Printf("\nPreprocessed Records:\n")
for i, p := range products {
if p.Name != "" { // Skip invalid rows for display
fmt.Printf(" [%d] Name=%q, SKU=%q, Price=%q, Category=%q\n",
i+1, p.Name, p.SKU, p.Price, p.Category)
}
}
// Show that URL scheme was fixed
fmt.Printf("\nURL Examples (scheme fixed):\n")
for i, p := range products {
if p.URL != "" && p.Name != "" {
fmt.Printf(" [%d] %s\n", i+1, p.URL)
}
}
// The reader can be used with filesql
_, _ = io.ReadAll(reader) //nolint:errcheck // Example code - ignoring error
}
Output: Processing Summary: Total rows: 3 Valid rows: 2 Invalid rows: 1 Validation Errors: Row 3, Field "Name": value is required Row 3, Field "SKU": value must contain only alphanumeric characters Row 3, Field "Price": value must be a valid number Preprocessed Records: [1] Name="Widget Pro", SKU="ABC123", Price="19.99", Category="electronics" [2] Name="Gadget Plus", SKU="DEF456", Price="29.99", Category="home goods" URL Examples (scheme fixed): [1] https://example.com/widget [2] https://example.com/gadget
Example (CrossFieldValidation) ¶
Example_crossFieldValidation demonstrates validation rules that compare values between different fields, such as password confirmation matching.
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
func main() {
// UserForm represents a user registration form with cross-field validation
type UserForm struct {
Username string `prep:"trim,lowercase" validate:"required"`
Password string `validate:"required"`
ConfirmPassword string `validate:"required,eqfield=Password"`
}
csvData := `username,password,confirm_password
Alice ,secret123,secret123
Bob,password1,wrongpass
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var users []UserForm
_, result, err := processor.Process(strings.NewReader(csvData), &users)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Valid: %d/%d\n", result.ValidRowCount, result.RowCount)
if result.HasErrors() {
fmt.Printf("Errors:\n")
for _, ve := range result.ValidationErrors() {
fmt.Printf(" Row %d, Field %q: %s\n", ve.Row, ve.Field, ve.Message)
}
}
}
Output: Valid: 1/2 Errors: Row 2, Field "ConfirmPassword": value must equal field Password
Example (DetailedErrorReporting) ¶
Example_detailedErrorReporting demonstrates precise error information including row number, column name, and specific validation failure reason.
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
func main() {
// Order represents an order with strict validation rules
type Order struct {
OrderID string `name:"order_id" validate:"required,uuid4"`
CustomerID string `name:"customer_id" validate:"required,numeric"`
Email string `validate:"required,email"`
Amount string `validate:"required,number,gt=0,lte=10000"`
Currency string `validate:"required,len=3,uppercase"`
Country string `validate:"required,alpha,len=2"`
OrderDate string `name:"order_date" validate:"required,datetime=2006-01-02"`
ShipDate string `name:"ship_date" validate:"datetime=2006-01-02,gtfield=OrderDate"`
IPAddress string `name:"ip_address" validate:"required,ip_addr"`
PromoCode string `name:"promo_code" validate:"alphanumeric"`
Quantity string `validate:"required,numeric,gte=1,lte=100"`
UnitPrice string `name:"unit_price" validate:"required,number,gt=0"`
TotalCheck string `name:"total_check" validate:"required,eqfield=Amount"`
}
// CSV with multiple validation errors
csvData := `order_id,customer_id,email,amount,currency,country,order_date,ship_date,ip_address,promo_code,quantity,unit_price,total_check
550e8400-e29b-41d4-a716-446655440000,12345,alice@example.com,500.00,USD,US,2024-01-15,2024-01-20,192.168.1.1,SAVE10,2,250.00,500.00
invalid-uuid,abc,not-an-email,-100,US,USA,2024/01/15,2024-01-10,999.999.999.999,PROMO-CODE-TOO-LONG!!,0,0,999
550e8400-e29b-41d4-a716-446655440001,,bob@test,50000,EURO,J1,not-a-date,,2001:db8::1,VALID20,101,-50,50000
123e4567-e89b-42d3-a456-426614174000,99999,charlie@company.com,1500.50,JPY,JP,2024-02-28,2024-02-25,10.0.0.1,VIP,5,300.10,1500.50
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var orders []Order
_, result, err := processor.Process(strings.NewReader(csvData), &orders)
if err != nil {
fmt.Printf("Fatal error: %v\n", err)
return
}
fmt.Printf("=== Validation Report ===\n")
fmt.Printf("Total rows: %d\n", result.RowCount)
fmt.Printf("Valid rows: %d\n", result.ValidRowCount)
fmt.Printf("Invalid rows: %d\n", result.RowCount-result.ValidRowCount)
fmt.Printf("Total errors: %d\n\n", len(result.ValidationErrors()))
if result.HasErrors() {
fmt.Println("=== Error Details ===")
for _, e := range result.ValidationErrors() {
fmt.Printf("Row %d, Column '%s': %s\n", e.Row, e.Column, e.Message)
}
}
}
Output: === Validation Report === Total rows: 4 Valid rows: 1 Invalid rows: 3 Total errors: 23 === Error Details === Row 2, Column 'order_id': value must be a valid UUID version 4 Row 2, Column 'customer_id': value must be numeric Row 2, Column 'email': value must be a valid email address Row 2, Column 'amount': value must be greater than 0 Row 2, Column 'currency': value must have exactly 3 characters Row 2, Column 'country': value must have exactly 2 characters Row 2, Column 'order_date': value must be a valid datetime in format: 2006-01-02 Row 2, Column 'ip_address': value must be a valid IP address Row 2, Column 'promo_code': value must contain only alphanumeric characters Row 2, Column 'quantity': value must be greater than or equal to 1 Row 2, Column 'unit_price': value must be greater than 0 Row 2, Column 'ship_date': value must be greater than field OrderDate Row 2, Column 'total_check': value must equal field Amount Row 3, Column 'customer_id': value is required Row 3, Column 'email': value must be a valid email address Row 3, Column 'amount': value must be less than or equal to 10000 Row 3, Column 'currency': value must have exactly 3 characters Row 3, Column 'country': value must contain only alphabetic characters Row 3, Column 'order_date': value must be a valid datetime in format: 2006-01-02 Row 3, Column 'quantity': value must be less than or equal to 100 Row 3, Column 'unit_price': value must be greater than 0 Row 3, Column 'ship_date': value must be greater than field OrderDate Row 4, Column 'ship_date': value must be greater than field OrderDate
Example (DetectFileType) ¶
Example_detectFileType demonstrates automatic file type detection from file paths.
package main
import (
"fmt"
"github.com/nao1215/fileprep"
)
func main() {
files := []string{
"data.csv",
"data.csv.gz",
"report.xlsx",
"logs.tsv.bz2",
"events.parquet",
"access.ltsv.zst",
"config.json",
"config.json.gz",
"events.jsonl",
"events.jsonl.zst",
}
for _, f := range files {
ft := fileprep.DetectFileType(f)
fmt.Printf("%s -> %s (compressed: %v)\n", f, ft, fileprep.IsCompressed(ft))
}
}
Output: data.csv -> CSV (compressed: false) data.csv.gz -> CSV (gzip) (compressed: true) report.xlsx -> XLSX (compressed: false) logs.tsv.bz2 -> TSV (bzip2) (compressed: true) events.parquet -> Parquet (compressed: false) access.ltsv.zst -> LTSV (zstd) (compressed: true) config.json -> JSON (compressed: false) config.json.gz -> JSON (gzip) (compressed: true) events.jsonl -> JSONL (compressed: false) events.jsonl.zst -> JSONL (zstd) (compressed: true)
Example (EmployeePreprocessing) ¶
Example_employeePreprocessing demonstrates the full power of fileprep: combining multiple preprocessors and validators to clean and validate real-world messy data.
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
func main() {
// Employee represents employee data with comprehensive preprocessing and validation
type Employee struct {
// ID: pad to 6 digits, must be numeric
EmployeeID string `name:"id" prep:"trim,pad_left=6:0" validate:"required,numeric,len=6"`
// Name: clean whitespace, required alphabetic with spaces
FullName string `name:"name" prep:"trim,collapse_space" validate:"required,alphaspace"`
// Email: normalize to lowercase, validate format
Email string `prep:"trim,lowercase" validate:"required,email"`
// Department: normalize case, must be one of allowed values
Department string `prep:"trim,uppercase" validate:"required,oneof=ENGINEERING SALES MARKETING HR"`
// Salary: keep only digits, validate range
Salary string `prep:"trim,keep_digits" validate:"required,numeric,gte=30000,lte=500000"`
// Phone: extract digits, validate E.164 format after adding country code
Phone string `prep:"trim,keep_digits,prefix=+1" validate:"e164"`
// Start date: validate datetime format
StartDate string `name:"start_date" prep:"trim" validate:"required,datetime=2006-01-02"`
// Manager ID: required only if department is not HR
ManagerID string `name:"manager_id" prep:"trim,pad_left=6:0" validate:"required_unless=Department HR"`
// Website: fix missing scheme, validate URL
Website string `prep:"trim,lowercase,fix_scheme=https" validate:"url"`
}
// Messy real-world CSV data
csvData := `id,name,email,department,salary,phone,start_date,manager_id,website
42, John Doe ,JOHN.DOE@COMPANY.COM,engineering,75000,5551234567,2023-01-15,1,company.com/john
7,Jane Smith,jane@COMPANY.com, Sales ,120000,5559876543,2022-06-01,2,WWW.LINKEDIN.COM/in/jane
123,Bob Wilson,bob.wilson@company.com,HR,45000,5551112222,2024-03-20,,
99,Alice Brown,alice@company.com,Marketing,88500,5554443333,2023-09-10,3,https://alice.dev
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var employees []Employee
_, result, err := processor.Process(strings.NewReader(csvData), &employees)
if err != nil {
fmt.Printf("Fatal error: %v\n", err)
return
}
fmt.Printf("=== Processing Result ===\n")
fmt.Printf("Total rows: %d, Valid rows: %d\n\n", result.RowCount, result.ValidRowCount)
for i, emp := range employees {
fmt.Printf("Employee %d:\n", i+1)
fmt.Printf(" ID: %s\n", emp.EmployeeID)
fmt.Printf(" Name: %s\n", emp.FullName)
fmt.Printf(" Email: %s\n", emp.Email)
fmt.Printf(" Department: %s\n", emp.Department)
fmt.Printf(" Salary: %s\n", emp.Salary)
fmt.Printf(" Phone: %s\n", emp.Phone)
fmt.Printf(" Start Date: %s\n", emp.StartDate)
fmt.Printf(" Manager ID: %s\n", emp.ManagerID)
if emp.Website != "" {
fmt.Printf(" Website: %s\n\n", emp.Website)
} else {
fmt.Printf(" Website: (none)\n\n")
}
}
}
Output: === Processing Result === Total rows: 4, Valid rows: 3 Employee 1: ID: 000042 Name: John Doe Email: john.doe@company.com Department: ENGINEERING Salary: 75000 Phone: +15551234567 Start Date: 2023-01-15 Manager ID: 000001 Website: https://company.com/john Employee 2: ID: 000007 Name: Jane Smith Email: jane@company.com Department: SALES Salary: 120000 Phone: +15559876543 Start Date: 2022-06-01 Manager ID: 000002 Website: https://www.linkedin.com/in/jane Employee 3: ID: 000123 Name: Bob Wilson Email: bob.wilson@company.com Department: HR Salary: 45000 Phone: +15551112222 Start Date: 2024-03-20 Manager ID: 000000 Website: (none) Employee 4: ID: 000099 Name: Alice Brown Email: alice@company.com Department: MARKETING Salary: 88500 Phone: +15554443333 Start Date: 2023-09-10 Manager ID: 000003 Website: https://alice.dev
Example (JsonProcessing) ¶
Example_jsonProcessing demonstrates processing JSON data. JSON arrays are parsed into rows, each containing the raw JSON element in a single "data" column.
package main
import (
"fmt"
"io"
"strings"
"github.com/nao1215/fileprep"
)
func main() {
type Record struct {
Data string `name:"data" validate:"required"`
}
jsonData := `[{"name":"Alice","age":30},{"name":"Bob","age":25}]`
processor := fileprep.NewProcessor(fileprep.FileTypeJSON)
var records []Record
reader, result, err := processor.Process(strings.NewReader(jsonData), &records)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)
for i, r := range records {
fmt.Printf("Row %d: %s\n", i+1, r.Data)
}
output, _ := io.ReadAll(reader) //nolint:errcheck // Example code
fmt.Printf("JSONL output:\n%s", output)
}
Output: Processed 2 rows, 2 valid Row 1: {"name":"Alice","age":30} Row 2: {"name":"Bob","age":25} JSONL output: {"name":"Alice","age":30} {"name":"Bob","age":25}
Example (ProcessToWriter) ¶
Example_processToWriter demonstrates ProcessToWriter, which writes preprocessed output directly to an io.Writer instead of buffering the entire output in memory.
package main
import (
"bytes"
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
func main() {
type Record struct {
Name string `prep:"trim" validate:"required"`
Email string `prep:"trim,lowercase"`
}
csvData := "name,email\n Alice ,ALICE@EXAMPLE.COM\n Bob ,BOB@EXAMPLE.COM\n"
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var records []Record
var buf bytes.Buffer
result, err := processor.ProcessToWriter(strings.NewReader(csvData), &records, &buf)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)
fmt.Printf("Output:\n%s", buf.String())
}
Output: Processed 2 rows, 2 valid Output: name,email Alice,alice@example.com Bob,bob@example.com
Example (Validation) ¶
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
// User represents a user record with preprocessing and validation
type User struct {
Name string `prep:"trim" validate:"required"`
Email string `prep:"trim,lowercase" validate:"required"`
Age string
}
func main() {
// CSV data with validation error (empty required name)
csvData := `name,email,age
,john@example.com,30
Jane,jane@example.com,25
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var users []User
_, result, err := processor.Process(strings.NewReader(csvData), &users)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
// Check for validation errors
if result.HasErrors() {
fmt.Printf("Found %d validation errors:\n", len(result.Errors))
for _, e := range result.ValidationErrors() {
fmt.Printf(" Row %d, Column %q: %s\n", e.Row, e.Column, e.Message)
}
}
fmt.Printf("Valid rows: %d/%d\n", result.ValidRowCount, result.RowCount)
}
Output: Found 1 validation errors: Row 1, Column "name": value is required Valid rows: 1/2
Index ¶
Examples ¶
Constants ¶
const ( FileTypeCSV = fileparser.CSV FileTypeTSV = fileparser.TSV FileTypeLTSV = fileparser.LTSV FileTypeParquet = fileparser.Parquet FileTypeXLSX = fileparser.XLSX FileTypeCSVGZ = fileparser.CSVGZ FileTypeCSVBZ2 = fileparser.CSVBZ2 FileTypeCSVXZ = fileparser.CSVXZ FileTypeCSVZSTD = fileparser.CSVZSTD FileTypeTSVGZ = fileparser.TSVGZ FileTypeTSVBZ2 = fileparser.TSVBZ2 FileTypeTSVXZ = fileparser.TSVXZ FileTypeTSVZSTD = fileparser.TSVZSTD FileTypeLTSVGZ = fileparser.LTSVGZ FileTypeLTSVBZ2 = fileparser.LTSVBZ2 FileTypeLTSVXZ = fileparser.LTSVXZ FileTypeLTSVZSTD = fileparser.LTSVZSTD FileTypeParquetGZ = fileparser.ParquetGZ FileTypeParquetBZ2 = fileparser.ParquetBZ2 FileTypeParquetXZ = fileparser.ParquetXZ FileTypeParquetZSTD = fileparser.ParquetZSTD FileTypeXLSXGZ = fileparser.XLSXGZ FileTypeXLSXBZ2 = fileparser.XLSXBZ2 FileTypeXLSXXZ = fileparser.XLSXXZ FileTypeXLSXZSTD = fileparser.XLSXZSTD // zlib compression formats (v0.2.0+) FileTypeCSVZLIB = fileparser.CSVZLIB FileTypeTSVZLIB = fileparser.TSVZLIB FileTypeLTSVZLIB = fileparser.LTSVZLIB FileTypeParquetZLIB = fileparser.ParquetZLIB FileTypeXLSXZLIB = fileparser.XLSXZLIB // snappy compression formats (v0.2.0+) FileTypeCSVSNAPPY = fileparser.CSVSNAPPY FileTypeTSVSNAPPY = fileparser.TSVSNAPPY FileTypeLTSVSNAPPY = fileparser.LTSVSNAPPY FileTypeParquetSNAPPY = fileparser.ParquetSNAPPY FileTypeXLSXSNAPPY = fileparser.XLSXSNAPPY // s2 compression formats (v0.2.0+) FileTypeCSVS2 = fileparser.CSVS2 FileTypeTSVS2 = fileparser.TSVS2 FileTypeLTSVS2 = fileparser.LTSVS2 FileTypeParquetS2 = fileparser.ParquetS2 FileTypeXLSXS2 = fileparser.XLSXS2 // lz4 compression formats (v0.2.0+) FileTypeCSVLZ4 = fileparser.CSVLZ4 FileTypeTSVLZ4 = fileparser.TSVLZ4 FileTypeLTSVLZ4 = fileparser.LTSVLZ4 FileTypeParquetLZ4 = fileparser.ParquetLZ4 FileTypeXLSXLZ4 = fileparser.XLSXLZ4 // JSON/JSONL file types (v0.5.0+) FileTypeJSON = fileparser.JSON FileTypeJSONL = fileparser.JSONL // JSON compression formats (v0.5.0+) FileTypeJSONGZ = fileparser.JSONGZ FileTypeJSONBZ2 = fileparser.JSONBZ2 FileTypeJSONXZ = fileparser.JSONXZ FileTypeJSONZSTD = fileparser.JSONZSTD FileTypeJSONZLIB = fileparser.JSONZLIB FileTypeJSONSNAPPY = fileparser.JSONSNAPPY FileTypeJSONS2 = fileparser.JSONS2 FileTypeJSONLZ4 = fileparser.JSONLZ4 // JSONL compression formats (v0.5.0+) FileTypeJSONLGZ = fileparser.JSONLGZ FileTypeJSONLBZ2 = fileparser.JSONLBZ2 FileTypeJSONLXZ = fileparser.JSONLXZ FileTypeJSONLZSTD = fileparser.JSONLZSTD FileTypeJSONLZLIB = fileparser.JSONLZLIB FileTypeJSONLSNAPPY = fileparser.JSONLSNAPPY FileTypeJSONLS2 = fileparser.JSONLS2 FileTypeJSONLLZ4 = fileparser.JSONLLZ4 FileTypeUnsupported = fileparser.Unsupported )
File type constants re-exported from fileparser for backward compatibility.
Variables ¶
var ( // ErrStructSlicePointer is returned when the value is not a pointer to a struct slice ErrStructSlicePointer = errors.New("value must be a pointer to a struct slice") // ErrUnsupportedFileType is returned when the file type is not supported ErrUnsupportedFileType = errors.New("unsupported file type") // ErrEmptyFile is returned when the file is empty ErrEmptyFile = errors.New("file is empty") // ErrInvalidTagFormat is returned when the tag format is invalid ErrInvalidTagFormat = errors.New("invalid tag format") // ErrInvalidJSONAfterPrep is returned when preprocessing destroys JSON structure // in the "data" column of a JSON/JSONL file. This is a hard error because // invalid JSON lines in JSONL output cause downstream parsers to fail. ErrInvalidJSONAfterPrep = errors.New("preprocessing produced invalid JSON") // ErrEmptyJSONOutput is returned when all JSON/JSONL rows are empty or invalid // after preprocessing, resulting in no output lines. An empty JSONL output is // unparseable by downstream consumers. ErrEmptyJSONOutput = errors.New("JSON/JSONL output has no valid rows after preprocessing") // ErrNilWriter is returned when a nil io.Writer is passed to ProcessToWriter. ErrNilWriter = errors.New("writer must not be nil") // ErrNilReader is returned when a nil io.Reader is passed to Process or ProcessToWriter. ErrNilReader = errors.New("reader must not be nil") )
Sentinel errors for fileprep
Functions ¶
func IsCompressed ¶ added in v0.3.0
IsCompressed returns true if the file type is compressed. This is a convenience wrapper around fileparser.IsCompressed.
Types ¶
type CrossFieldValidator ¶
type CrossFieldValidator interface {
// Validate checks if the source value is valid compared to the target value
// Returns empty string if validation passes, error message otherwise
Validate(srcValue, targetValue string) string
// Name returns the name of the validator for error reporting
Name() string
// TargetField returns the name of the field to compare against
TargetField() string
}
CrossFieldValidator defines the interface for validators that compare values across fields
type FileType ¶
type FileType = fileparser.FileType
FileType is an alias for fileparser.FileType for backward compatibility.
func DetectFileType ¶
DetectFileType detects file type from extension. This is a convenience wrapper around fileparser.DetectFileType.
type Option ¶ added in v0.5.0
type Option func(*Processor)
Option configures a Processor.
func WithStrictTagParsing ¶ added in v0.5.0
func WithStrictTagParsing() Option
WithStrictTagParsing enables strict tag parsing mode. When enabled, invalid tag arguments (e.g., "eq=abc" where a number is expected) return an error during Process() instead of being silently ignored.
Example:
processor := fileprep.NewProcessor(fileparser.CSV, fileprep.WithStrictTagParsing())
func WithValidRowsOnly ¶ added in v0.5.0
func WithValidRowsOnly() Option
WithValidRowsOnly configures the Processor to include only valid rows in the output io.Reader and struct slice. Rows that fail validation are excluded from the output but still counted in ProcessResult.RowCount and reported in ProcessResult.Errors.
Example:
processor := fileprep.NewProcessor(fileparser.CSV, fileprep.WithValidRowsOnly()) reader, result, err := processor.Process(input, &records) // reader contains only rows that passed all validations // result.RowCount includes all rows, result.ValidRowCount has valid count
type PrepError ¶
type PrepError struct {
Row int // 1-based row number
Column string // Column name
Field string // Struct field name
Tag string // The prep tag that failed
Message string // Human-readable error message
}
PrepError represents a preprocessing error.
Example:
for _, pe := range result.PrepErrors() {
fmt.Printf("Row %d, Column %q: %s (tag=%q)\n",
pe.Row, pe.Column, pe.Message, pe.Tag)
}
type Preprocessor ¶
type Preprocessor interface {
// Process applies preprocessing to the value and returns the result
Process(value string) string
// Name returns the name of the preprocessor for error reporting
Name() string
}
Preprocessor defines the interface for preprocessing values
type ProcessResult ¶
type ProcessResult struct {
// Errors contains all validation and preprocessing errors
Errors []error
// RowCount is the total number of data rows processed (excluding header)
RowCount int
// ValidRowCount is the number of rows that passed all validations
ValidRowCount int
// Columns contains the column names from the header
Columns []string
// OriginalFormat is the file type that was processed
OriginalFormat fileparser.FileType
// contains filtered or unexported fields
}
ProcessResult contains the results of processing a file.
Example:
reader, result, err := processor.Process(input, &records)
if result.HasErrors() {
for _, ve := range result.ValidationErrors() {
fmt.Printf("Row %d: %s\n", ve.Row, ve.Message)
}
}
fmt.Printf("Valid: %d/%d rows\n", result.ValidRowCount, result.RowCount)
func (*ProcessResult) HasErrors ¶
func (r *ProcessResult) HasErrors() bool
HasErrors returns true if there are any errors
func (*ProcessResult) InvalidRowCount ¶
func (r *ProcessResult) InvalidRowCount() int
InvalidRowCount returns the number of rows that failed validation
func (*ProcessResult) PrepErrors ¶
func (r *ProcessResult) PrepErrors() []*PrepError
PrepErrors returns only preprocessing errors
func (*ProcessResult) ValidationErrors ¶
func (r *ProcessResult) ValidationErrors() []*ValidationError
ValidationErrors returns only validation errors
type Processor ¶
type Processor struct {
// contains filtered or unexported fields
}
Processor handles preprocessing and validation of file data
func NewProcessor ¶
func NewProcessor(fileType fileparser.FileType, opts ...Option) *Processor
NewProcessor creates a new Processor for the specified file type. Options can be provided to configure behavior such as strict tag parsing and output filtering.
Example:
processor := fileprep.NewProcessor(fileparser.CSV)
var records []MyRecord
reader, result, err := processor.Process(input, &records)
// With options:
processor := fileprep.NewProcessor(fileparser.CSV,
fileprep.WithStrictTagParsing(),
fileprep.WithValidRowsOnly(),
)
func (*Processor) Process ¶
func (p *Processor) Process(input io.Reader, structSlicePointer any) (io.Reader, *ProcessResult, error)
Process reads from the input reader, applies preprocessing and validation, populates the struct slice, and returns an io.Reader with preprocessed data.
The returned io.Reader preserves the original file format:
- CSV input → CSV output
- TSV input → TSV output (tab-delimited)
- LTSV input → LTSV output (label:value format)
- JSON input → JSONL output (one JSON value per line)
- JSONL input → JSONL output (one JSON value per line)
- XLSX input → CSV output (tabular data)
- Parquet input → CSV output (tabular data)
The returned io.Reader can be passed directly to filesql.AddReader:
reader, result, err := processor.Process(input, &records) db.AddReader(reader, "table", parser.CSV)
For format information, use ProcessResult.OriginalFormat or cast to Stream:
stream := reader.(fileprep.Stream) fmt.Println(stream.Format()) // CSV, TSV, etc.
Example:
type User struct {
Name string `prep:"trim" validate:"required"`
Email string `prep:"trim,lowercase" validate:"email"`
Age string `validate:"numeric,min=0,max=150"`
}
csvData := "name,email,age\n John ,JOHN@EXAMPLE.COM,30\n"
processor := fileprep.NewProcessor(parser.CSV)
var users []User
reader, result, err := processor.Process(strings.NewReader(csvData), &users)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)
func (*Processor) ProcessToWriter ¶ added in v0.6.0
func (p *Processor) ProcessToWriter(input io.Reader, structSlicePointer any, w io.Writer) (*ProcessResult, error)
ProcessToWriter works like Process but writes the preprocessed output directly to w instead of buffering it in memory. This is useful for large datasets where holding the full output buffer is undesirable.
Example:
var buf bytes.Buffer result, err := processor.ProcessToWriter(input, &records, &buf)
type Stream ¶
type Stream interface {
io.Reader
// Format returns the actual output format of the stream data.
// For CSV/TSV/LTSV input, this matches the input format.
// For JSON/JSONL input, this returns JSONL since the output is JSONL-formatted.
// For XLSX/Parquet input, this returns CSV since the output is CSV-formatted.
Format() fileparser.FileType
// OriginalFormat returns the original input file type including compression
OriginalFormat() fileparser.FileType
}
Stream represents a preprocessed data stream with format information. It implements io.Reader and provides metadata about the file format.
type ValidationError ¶
type ValidationError struct {
Row int // 1-based row number (excluding header)
Column string // Column name
Field string // Struct field name
Value string // The value that failed validation
Tag string // The validation tag that failed
Message string // Human-readable error message
}
ValidationError represents a validation error with row and column information.
Example:
for _, ve := range result.ValidationErrors() {
fmt.Printf("Row %d, Column %q: %s (value=%q)\n",
ve.Row, ve.Column, ve.Message, ve.Value)
}
func (*ValidationError) Error ¶
func (e *ValidationError) Error() string
Error implements the error interface
type Validator ¶
type Validator interface {
// Validate checks if the value is valid and returns an error message if not
// Returns empty string if validation passes
Validate(value string) string
// Name returns the name of the validator for error reporting
Name() string
}
Validator defines the interface for validating values
