fileprep

package module
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 22, 2026 License: MIT Imports: 19 Imported by: 0

README

fileprep

Go Reference Go Report Card MultiPlatformUnitTest Coverage

日本語 | Español | Français | 한국어 | Русский | 中文

fileprep-logo

fileprep is a Go library for cleaning, normalizing, and validating structured data—CSV, TSV, LTSV, JSON, JSONL, Parquet, and Excel—through lightweight struct-tag rules, with seamless support for gzip, bzip2, xz, zstd, zlib, snappy, s2, and lz4 streams.

Why fileprep?

I developed nao1215/filesql, which allows you to execute SQL queries on files like CSV, TSV, LTSV, Parquet, and Excel. I also created nao1215/csv for CSV file validation.

While studying machine learning, I realized: "If I extend nao1215/csv to support the same file formats as nao1215/filesql, I could combine them to perform ETL-like operations." This idea led to the creation of fileprep—a library that bridges data preprocessing/validation with SQL-based file querying.

Features

  • Multiple file format support: CSV, TSV, LTSV, JSON (.json), JSONL (.jsonl), Parquet, Excel (.xlsx)
  • Compression support: gzip (.gz), bzip2 (.bz2), xz (.xz), zstd (.zst), zlib (.z), snappy (.snappy), s2 (.s2), lz4 (.lz4)
  • Name-based column binding: Fields auto-match snake_case column names, customizable via name tag
  • Struct tag-based preprocessing (prep tag): trim, lowercase, uppercase, default values
  • Struct tag-based validation (validate tag): required, omitempty, and more
  • Processor options: WithStrictTagParsing() for catching tag misconfigurations, WithValidRowsOnly() for filtering output
  • Seamless filesql integration: Returns io.Reader for direct use with filesql
  • Detailed error reporting: Row and column information for each error

Installation

go get github.com/nao1215/fileprep

Requirements

  • Go Version: 1.25 or later
  • Operating Systems:
    • Linux
    • macOS
    • Windows

Quick Start

package main

import (
    "fmt"
    "strings"

    "github.com/nao1215/fileprep"
)

// User represents a user record with preprocessing and validation
type User struct {
    Name  string `prep:"trim" validate:"required"`
    Email string `prep:"trim,lowercase"`
    Age   string
}

func main() {
    csvData := `name,email,age
  John Doe  ,JOHN@EXAMPLE.COM,30
Jane Smith,jane@example.com,25
`

    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
    var users []User

    reader, result, err := processor.Process(strings.NewReader(csvData), &users)
    if err != nil {
        fmt.Printf("Error: %v\n", err)
        return
    }

    fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)

    for _, user := range users {
        fmt.Printf("Name: %q, Email: %q\n", user.Name, user.Email)
    }

    // reader can be passed directly to filesql
    _ = reader
}

Output:

Processed 2 rows, 2 valid
Name: "John Doe", Email: "john@example.com"
Name: "Jane Smith", Email: "jane@example.com"

Gotchas

A few things worth knowing before you start.

JSON/JSONL → single "data" column. fileparser flattens each JSON array element or JSONL line into one column called "data". Your struct needs a field that maps to it:

type JSONRecord struct {
    Data string `name:"data" prep:"trim" validate:"required"`
}

Output is always compact JSONL. A prep tag that breaks JSON structure causes ErrInvalidJSONAfterPrep; all-empty output causes ErrEmptyJSONOutput.

Column matching is case-sensitive. Field UserName auto-converts to user_name. Headers spelled differently (User_Name, USERNAME, userName) won't match. Override with the name tag:

type Record struct {
    UserName string                 // matches "user_name" only
    Email    string `name:"EMAIL"`  // matches "EMAIL" exactly
}

Duplicate headers → first column wins. Given id,id,name, only the first id binds.

Missing columns → empty string. If a column is absent, the field gets "". Use validate:"required" to catch this.

Excel → first sheet only. Additional sheets in .xlsx are silently skipped.

Saving output memory → use ProcessToWriter. Process buffers the entire output in memory. ProcessToWriter skips that buffer and writes directly to any io.Writer. Note that input records are still loaded into memory for preprocessing; this only eliminates the output copy:

f, _ := os.Create("output.csv")
defer f.Close()

result, err := processor.ProcessToWriter(input, &records, f)

Advanced Examples

Complex Data Preprocessing and Validation

This example demonstrates the full power of fileprep: combining multiple preprocessors and validators to clean and validate real-world messy data.

package main

import (
    "fmt"
    "strings"

    "github.com/nao1215/fileprep"
)

// Employee represents employee data with comprehensive preprocessing and validation
type Employee struct {
    // ID: pad to 6 digits, must be numeric
    EmployeeID string `name:"id" prep:"trim,pad_left=6:0" validate:"required,numeric,len=6"`

    // Name: clean whitespace, required alphabetic with spaces
    FullName string `name:"name" prep:"trim,collapse_space" validate:"required,alphaspace"`

    // Email: normalize to lowercase, validate format
    Email string `prep:"trim,lowercase" validate:"required,email"`

    // Department: normalize case, must be one of allowed values
    Department string `prep:"trim,uppercase" validate:"required,oneof=ENGINEERING SALES MARKETING HR"`

    // Salary: keep only digits, validate range
    Salary string `prep:"trim,keep_digits" validate:"required,numeric,gte=30000,lte=500000"`

    // Phone: extract digits, validate E.164 format after adding country code
    Phone string `prep:"trim,keep_digits,prefix=+1" validate:"e164"`

    // Start date: validate datetime format
    StartDate string `name:"start_date" prep:"trim" validate:"required,datetime=2006-01-02"`

    // Manager ID: required only if department is not HR
    ManagerID string `name:"manager_id" prep:"trim,pad_left=6:0" validate:"required_unless=Department HR"`

    // Website: fix missing scheme, validate URL
    Website string `prep:"trim,lowercase,fix_scheme=https" validate:"url"`
}

func main() {
    // Messy real-world CSV data
    csvData := `id,name,email,department,salary,phone,start_date,manager_id,website
  42,  John   Doe  ,JOHN.DOE@COMPANY.COM,engineering,"$75,000",555-123-4567,2023-01-15,000001,company.com/john
7,Jane Smith,jane@COMPANY.com,  Sales  ,"$120,000",(555) 987-6543,2022-06-01,000002,WWW.LINKEDIN.COM/in/jane
123,Bob Wilson,bob.wilson@company.com,HR,45000,555.111.2222,2024-03-20,,
99,Alice Brown,alice@company.com,Marketing,$88500,555-444-3333,2023-09-10,000003,https://alice.dev
`

    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
    var employees []Employee

    _, result, err := processor.Process(strings.NewReader(csvData), &employees)
    if err != nil {
        fmt.Printf("Fatal error: %v\n", err)
        return
    }

    fmt.Printf("=== Processing Result ===\n")
    fmt.Printf("Total rows: %d, Valid rows: %d\n\n", result.RowCount, result.ValidRowCount)

    for i, emp := range employees {
        fmt.Printf("Employee %d:\n", i+1)
        fmt.Printf("  ID:         %s\n", emp.EmployeeID)
        fmt.Printf("  Name:       %s\n", emp.FullName)
        fmt.Printf("  Email:      %s\n", emp.Email)
        fmt.Printf("  Department: %s\n", emp.Department)
        fmt.Printf("  Salary:     %s\n", emp.Salary)
        fmt.Printf("  Phone:      %s\n", emp.Phone)
        fmt.Printf("  Start Date: %s\n", emp.StartDate)
        fmt.Printf("  Manager ID: %s\n", emp.ManagerID)
        fmt.Printf("  Website:    %s\n\n", emp.Website)
    }
}

Output:

=== Processing Result ===
Total rows: 4, Valid rows: 4

Employee 1:
  ID:         000042
  Name:       John Doe
  Email:      john.doe@company.com
  Department: ENGINEERING
  Salary:     75000
  Phone:      +15551234567
  Start Date: 2023-01-15
  Manager ID: 000001
  Website:    https://company.com/john

Employee 2:
  ID:         000007
  Name:       Jane Smith
  Email:      jane@company.com
  Department: SALES
  Salary:     120000
  Phone:      +15559876543
  Start Date: 2022-06-01
  Manager ID: 000002
  Website:    https://www.linkedin.com/in/jane

Employee 3:
  ID:         000123
  Name:       Bob Wilson
  Email:      bob.wilson@company.com
  Department: HR
  Salary:     45000
  Phone:      +15551112222
  Start Date: 2024-03-20
  Manager ID: 000000
  Website:

Employee 4:
  ID:         000099
  Name:       Alice Brown
  Email:      alice@company.com
  Department: MARKETING
  Salary:     88500
  Phone:      +15554443333
  Start Date: 2023-09-10
  Manager ID: 000003
  Website:    https://alice.dev
Detailed Error Reporting

When validation fails, fileprep provides precise error information including row number, column name, and specific validation failure reason.

package main

import (
    "fmt"
    "strings"

    "github.com/nao1215/fileprep"
)

// Order represents an order with strict validation rules
type Order struct {
    OrderID    string `name:"order_id" validate:"required,uuid4"`
    CustomerID string `name:"customer_id" validate:"required,numeric"`
    Email      string `validate:"required,email"`
    Amount     string `validate:"required,number,gt=0,lte=10000"`
    Currency   string `validate:"required,len=3,uppercase"`
    Country    string `validate:"required,alpha,len=2"`
    OrderDate  string `name:"order_date" validate:"required,datetime=2006-01-02"`
    ShipDate   string `name:"ship_date" validate:"datetime=2006-01-02,gtfield=OrderDate"`
    IPAddress  string `name:"ip_address" validate:"required,ip_addr"`
    PromoCode  string `name:"promo_code" validate:"alphanumeric"`
    Quantity   string `validate:"required,numeric,gte=1,lte=100"`
    UnitPrice  string `name:"unit_price" validate:"required,number,gt=0"`
    TotalCheck string `name:"total_check" validate:"required,eqfield=Amount"`
}

func main() {
    // CSV with multiple validation errors
    csvData := `order_id,customer_id,email,amount,currency,country,order_date,ship_date,ip_address,promo_code,quantity,unit_price,total_check
550e8400-e29b-41d4-a716-446655440000,12345,alice@example.com,500.00,USD,US,2024-01-15,2024-01-20,192.168.1.1,SAVE10,2,250.00,500.00
invalid-uuid,abc,not-an-email,-100,US,USA,2024/01/15,2024-01-10,999.999.999.999,PROMO-CODE-TOO-LONG!!,0,0,999
550e8400-e29b-41d4-a716-446655440001,,bob@test,50000,EURO,J1,not-a-date,,2001:db8::1,VALID20,101,-50,50000
123e4567-e89b-42d3-a456-426614174000,99999,charlie@company.com,1500.50,JPY,JP,2024-02-28,2024-02-25,10.0.0.1,VIP,5,300.10,1500.50
`

    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
    var orders []Order

    _, result, err := processor.Process(strings.NewReader(csvData), &orders)
    if err != nil {
        fmt.Printf("Fatal error: %v\n", err)
        return
    }

    fmt.Printf("=== Validation Report ===\n")
    fmt.Printf("Total rows:     %d\n", result.RowCount)
    fmt.Printf("Valid rows:     %d\n", result.ValidRowCount)
    fmt.Printf("Invalid rows:   %d\n", result.RowCount-result.ValidRowCount)
    fmt.Printf("Total errors:   %d\n\n", len(result.ValidationErrors()))

    if result.HasErrors() {
        fmt.Println("=== Error Details ===")
        for _, e := range result.ValidationErrors() {
            fmt.Printf("Row %d, Column '%s': %s\n", e.Row, e.Column, e.Message)
        }
    }
}

Output:

=== Validation Report ===
Total rows:     4
Valid rows:     1
Invalid rows:   3
Total errors:   23

=== Error Details ===
Row 2, Column 'order_id': value must be a valid UUID version 4
Row 2, Column 'customer_id': value must be numeric
Row 2, Column 'email': value must be a valid email address
Row 2, Column 'amount': value must be greater than 0
Row 2, Column 'currency': value must have exactly 3 characters
Row 2, Column 'country': value must have exactly 2 characters
Row 2, Column 'order_date': value must be a valid datetime in format: 2006-01-02
Row 2, Column 'ip_address': value must be a valid IP address
Row 2, Column 'promo_code': value must contain only alphanumeric characters
Row 2, Column 'quantity': value must be greater than or equal to 1
Row 2, Column 'unit_price': value must be greater than 0
Row 2, Column 'ship_date': value must be greater than field OrderDate
Row 2, Column 'total_check': value must equal field Amount
Row 3, Column 'customer_id': value is required
Row 3, Column 'email': value must be a valid email address
Row 3, Column 'amount': value must be less than or equal to 10000
Row 3, Column 'currency': value must have exactly 3 characters
Row 3, Column 'country': value must contain only alphabetic characters
Row 3, Column 'order_date': value must be a valid datetime in format: 2006-01-02
Row 3, Column 'quantity': value must be less than or equal to 100
Row 3, Column 'unit_price': value must be greater than 0
Row 3, Column 'ship_date': value must be greater than field OrderDate
Row 4, Column 'ship_date': value must be greater than field OrderDate

Preprocessing Tags (prep)

Multiple tags can be combined: prep:"trim,lowercase,default=N/A"

Basic Preprocessors
Tag Description Example
trim Remove leading/trailing whitespace prep:"trim"
ltrim Remove leading whitespace prep:"ltrim"
rtrim Remove trailing whitespace prep:"rtrim"
lowercase Convert to lowercase prep:"lowercase"
uppercase Convert to uppercase prep:"uppercase"
default=value Set default if empty prep:"default=N/A"
String Transformation
Tag Description Example
replace=old:new Replace all occurrences prep:"replace=;:,"
prefix=value Prepend string to value prep:"prefix=ID_"
suffix=value Append string to value prep:"suffix=_END"
truncate=N Limit to N characters prep:"truncate=100"
strip_html Remove HTML tags prep:"strip_html"
strip_newline Remove newlines (LF, CRLF, CR) prep:"strip_newline"
collapse_space Collapse multiple spaces into one prep:"collapse_space"
Character Filtering
Tag Description Example
remove_digits Remove all digits prep:"remove_digits"
remove_alpha Remove all alphabetic characters prep:"remove_alpha"
keep_digits Keep only digits prep:"keep_digits"
keep_alpha Keep only alphabetic characters prep:"keep_alpha"
trim_set=chars Remove specified characters from both ends prep:"trim_set=@#$"
Padding
Tag Description Example
pad_left=N:char Left-pad to N characters prep:"pad_left=5:0"
pad_right=N:char Right-pad to N characters prep:"pad_right=10: "
Advanced Preprocessors
Tag Description Example
normalize_unicode Normalize Unicode to NFC form prep:"normalize_unicode"
nullify=value Treat specific string as empty prep:"nullify=NULL"
coerce=type Type coercion (int, float, bool) prep:"coerce=int"
fix_scheme=scheme Add or fix URL scheme prep:"fix_scheme=https"
regex_replace=pattern:replacement Regex-based replacement prep:"regex_replace=\\d+:X"

Validation Tags (validate)

Multiple tags can be combined: validate:"required,email"

Basic Validators
Tag Description Example
required Field must not be empty validate:"required"
omitempty Skip subsequent validators if value is empty validate:"omitempty,email"
boolean Must be true, false, 0, or 1 validate:"boolean"
Character Type Validators
Tag Description Example
alpha ASCII alphabetic characters only validate:"alpha"
alphaunicode Unicode letters only validate:"alphaunicode"
alphaspace Alphabetic characters or spaces validate:"alphaspace"
alphanumeric ASCII alphanumeric characters validate:"alphanumeric"
alphanumunicode Unicode letters or digits validate:"alphanumunicode"
numeric Valid integer validate:"numeric"
number Valid number (integer or decimal) validate:"number"
ascii ASCII characters only validate:"ascii"
printascii Printable ASCII characters (0x20-0x7E) validate:"printascii"
multibyte Contains multibyte characters validate:"multibyte"
Numeric Comparison Validators
Tag Description Example
eq=N Value equals N validate:"eq=100"
ne=N Value not equals N validate:"ne=0"
gt=N Value greater than N validate:"gt=0"
gte=N Value greater than or equal to N validate:"gte=1"
lt=N Value less than N validate:"lt=100"
lte=N Value less than or equal to N validate:"lte=99"
min=N Value at least N validate:"min=0"
max=N Value at most N validate:"max=100"
len=N Exactly N characters validate:"len=10"
String Validators
Tag Description Example
oneof=a b c Value is one of the allowed values validate:"oneof=active inactive"
lowercase Value is all lowercase validate:"lowercase"
uppercase Value is all uppercase validate:"uppercase"
eq_ignore_case=value Case-insensitive equality validate:"eq_ignore_case=yes"
ne_ignore_case=value Case-insensitive not equal validate:"ne_ignore_case=no"
String Content Validators
Tag Description Example
startswith=prefix Value starts with prefix validate:"startswith=http"
startsnotwith=prefix Value does not start with prefix validate:"startsnotwith=_"
endswith=suffix Value ends with suffix validate:"endswith=.com"
endsnotwith=suffix Value does not end with suffix validate:"endsnotwith=.tmp"
contains=substr Value contains substring validate:"contains=@"
containsany=chars Value contains any of the chars validate:"containsany=abc"
containsrune=r Value contains the rune validate:"containsrune=@"
excludes=substr Value does not contain substring validate:"excludes=admin"
excludesall=chars Value does not contain any of the chars validate:"excludesall=<>"
excludesrune=r Value does not contain the rune validate:"excludesrune=$"
Format Validators
Tag Description Example
email Valid email address validate:"email"
uri Valid URI validate:"uri"
url Valid URL validate:"url"
http_url Valid HTTP or HTTPS URL validate:"http_url"
https_url Valid HTTPS URL validate:"https_url"
url_encoded URL encoded string validate:"url_encoded"
datauri Valid data URI validate:"datauri"
datetime=layout Valid datetime matching Go layout validate:"datetime=2006-01-02"
uuid Valid UUID (any version) validate:"uuid"
uuid3 Valid UUID version 3 validate:"uuid3"
uuid4 Valid UUID version 4 validate:"uuid4"
uuid5 Valid UUID version 5 validate:"uuid5"
ulid Valid ULID validate:"ulid"
e164 Valid E.164 phone number validate:"e164"
latitude Valid latitude (-90 to 90) validate:"latitude"
longitude Valid longitude (-180 to 180) validate:"longitude"
hexadecimal Valid hexadecimal string validate:"hexadecimal"
hexcolor Valid hex color code validate:"hexcolor"
rgb Valid RGB color validate:"rgb"
rgba Valid RGBA color validate:"rgba"
hsl Valid HSL color validate:"hsl"
hsla Valid HSLA color validate:"hsla"
Network Validators
Tag Description Example
ip_addr Valid IP address (v4 or v6) validate:"ip_addr"
ip4_addr Valid IPv4 address validate:"ip4_addr"
ip6_addr Valid IPv6 address validate:"ip6_addr"
cidr Valid CIDR notation validate:"cidr"
cidrv4 Valid IPv4 CIDR validate:"cidrv4"
cidrv6 Valid IPv6 CIDR validate:"cidrv6"
mac Valid MAC address validate:"mac"
fqdn Valid fully qualified domain name validate:"fqdn"
hostname Valid hostname (RFC 952) validate:"hostname"
hostname_rfc1123 Valid hostname (RFC 1123) validate:"hostname_rfc1123"
hostname_port Valid hostname:port validate:"hostname_port"
Cross-Field Validators
Tag Description Example
eqfield=Field Value equals another field validate:"eqfield=Password"
nefield=Field Value not equals another field validate:"nefield=OldPassword"
gtfield=Field Value greater than another field validate:"gtfield=MinPrice"
gtefield=Field Value >= another field validate:"gtefield=StartDate"
ltfield=Field Value less than another field validate:"ltfield=MaxPrice"
ltefield=Field Value <= another field validate:"ltefield=EndDate"
fieldcontains=Field Value contains another field's value validate:"fieldcontains=Keyword"
fieldexcludes=Field Value excludes another field's value validate:"fieldexcludes=Forbidden"
Conditional Required Validators
Tag Description Example
required_if=Field value Required if field equals value validate:"required_if=Status active"
required_unless=Field value Required unless field equals value validate:"required_unless=Type guest"
required_with=Field Required if field is present validate:"required_with=Email"
required_without=Field Required if field is absent validate:"required_without=Phone"

Examples:

type User struct {
    Role    string
    // Profile is required when Role is "admin", optional for other roles
    Profile string `validate:"required_if=Role admin"`
    // Bio is required unless Role is "guest"
    Bio     string `validate:"required_unless=Role guest"`
}

type Contact struct {
    Email string
    Phone string
    // Name is required when Email is non-empty
    Name  string `validate:"required_with=Email"`
    // At least one of Email or BackupEmail must be provided
    BackupEmail string `validate:"required_without=Email"`
}

Supported File Formats

Format Extension Compressed Extensions
CSV .csv .csv.gz, .csv.bz2, .csv.xz, .csv.zst, .csv.z, .csv.snappy, .csv.s2, .csv.lz4
TSV .tsv .tsv.gz, .tsv.bz2, .tsv.xz, .tsv.zst, .tsv.z, .tsv.snappy, .tsv.s2, .tsv.lz4
LTSV .ltsv .ltsv.gz, .ltsv.bz2, .ltsv.xz, .ltsv.zst, .ltsv.z, .ltsv.snappy, .ltsv.s2, .ltsv.lz4
JSON .json .json.gz, .json.bz2, .json.xz, .json.zst, .json.z, .json.snappy, .json.s2, .json.lz4
JSONL .jsonl .jsonl.gz, .jsonl.bz2, .jsonl.xz, .jsonl.zst, .jsonl.z, .jsonl.snappy, .jsonl.s2, .jsonl.lz4
Excel .xlsx .xlsx.gz, .xlsx.bz2, .xlsx.xz, .xlsx.zst, .xlsx.z, .xlsx.snappy, .xlsx.s2, .xlsx.lz4
Parquet .parquet .parquet.gz, .parquet.bz2, .parquet.xz, .parquet.zst, .parquet.z, .parquet.snappy, .parquet.s2, .parquet.lz4
Supported Compression Formats
Format Extension Library Notes
gzip .gz compress/gzip Standard library
bzip2 .bz2 compress/bzip2 Standard library
xz .xz github.com/ulikunitz/xz Pure Go
zstd .zst github.com/klauspost/compress/zstd Pure Go, high performance
zlib .z compress/zlib Standard library
snappy .snappy github.com/klauspost/compress/snappy Pure Go, high performance
s2 .s2 github.com/klauspost/compress/s2 Snappy-compatible, faster
lz4 .lz4 github.com/pierrec/lz4/v4 Pure Go

Note on Parquet compression: The external compression (.parquet.gz, etc.) is for the container file itself. Parquet files may also use internal compression (Snappy, GZIP, LZ4, ZSTD) which is handled transparently by the parquet-go library.

Integration with filesql

// Process file with preprocessing and validation
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var records []MyRecord

reader, result, err := processor.Process(file, &records)
if err != nil {
    return err
}

// Check for validation errors
if result.HasErrors() {
    for _, e := range result.ValidationErrors() {
        log.Printf("Row %d, Column %s: %s", e.Row, e.Column, e.Message)
    }
}

// Pass preprocessed data to filesql using Builder pattern
ctx := context.Background()
builder := filesql.NewBuilder().
    AddReader(reader, "my_table", filesql.FileTypeCSV)

validatedBuilder, err := builder.Build(ctx)
if err != nil {
    return err
}

db, err := validatedBuilder.Open(ctx)
if err != nil {
    return err
}
defer db.Close()

// Execute SQL queries on preprocessed data
rows, err := db.QueryContext(ctx, "SELECT * FROM my_table WHERE age > 20")

Processor Options

NewProcessor accepts functional options to customize behavior:

WithStrictTagParsing

By default, invalid tag arguments (e.g., eq=abc where a number is expected) are silently ignored. Enable strict mode to catch these misconfigurations:

processor := fileprep.NewProcessor(fileprep.FileTypeCSV, fileprep.WithStrictTagParsing())
var records []MyRecord

// Returns an error if any tag argument is invalid (e.g., "eq=abc", "truncate=xyz")
_, _, err := processor.Process(input, &records)
WithValidRowsOnly

By default, the output includes all rows (valid and invalid). Use WithValidRowsOnly to filter the output to only valid rows:

processor := fileprep.NewProcessor(fileprep.FileTypeCSV, fileprep.WithValidRowsOnly())
var records []MyRecord

reader, result, err := processor.Process(input, &records)
// reader contains only rows that passed all validations
// records contains only valid structs
// result.RowCount includes all rows; result.ValidRowCount has the valid count
// result.Errors still reports all validation failures

Options can be combined:

processor := fileprep.NewProcessor(fileprep.FileTypeCSV,
    fileprep.WithStrictTagParsing(),
    fileprep.WithValidRowsOnly(),
)

Design Considerations

Name-Based Column Binding

Struct fields are mapped to file columns by name, not by position. Field names are automatically converted to snake_case to match column headers. Column order in the file does not matter.

type User struct {
    UserName string `name:"user"`       // matches "user" column (not "user_name")
    Email    string `name:"mail_addr"`  // matches "mail_addr" column (not "email")
    Age      string                     // matches "age" column (auto snake_case)
}

If your LTSV keys use hyphens (user-id) or Parquet/XLSX columns use camelCase (userId), use the name tag to specify the exact column name.

See Gotchas for case-sensitivity rules, duplicate header behavior, and missing column handling.

Memory Usage

fileprep loads the entire file into memory for processing. This enables random access and multi-pass operations but has implications for large files:

File Size Approx. Memory Recommendation
< 100 MB ~2-3x file size Direct processing
100-500 MB ~500 MB - 1.5 GB Monitor memory, consider chunking
> 500 MB > 1.5 GB Split files or use streaming alternatives

For compressed inputs (gzip, bzip2, xz, zstd, zlib, snappy, s2, lz4), memory usage is based on decompressed size.

Performance

Benchmark results processing CSV files with a complex struct containing 21 columns. Each field uses multiple preprocessing and validation tags:

Preprocessing tags used: trim, lowercase, uppercase, keep_digits, pad_left, strip_html, strip_newline, collapse_space, truncate, fix_scheme, default

Validation tags used: required, alpha, numeric, email, uuid, ip_addr, url, oneof, min, max, len, printascii, ascii, eqfield

Records Time Memory Allocs/op
100 0.6 ms 0.9 MB 7,654
1,000 6.1 ms 9.6 MB 74,829
10,000 69 ms 101 MB 746,266
50,000 344 ms 498 MB 3,690,281
# Quick benchmark (100 and 1,000 records)
make bench

# Full benchmark (all sizes including 50,000 records)
make bench-all

Tested on AMD Ryzen AI MAX+ 395, Go 1.24, Linux. Results vary by hardware.

Contributing

Contributions are welcome! Please see the Contributing Guide for more details.

Support

If you find this project useful, please consider:

  • Giving it a star on GitHub - it helps others discover the project
  • Becoming a sponsor - your support keeps the project alive and motivates continued development

Your support, whether through stars, sponsorships, or contributions, is what drives this project forward. Thank you!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Documentation

Overview

Package fileprep re-exports fileparser types for backward compatibility.

Package fileprep provides preprocessing and validation for file formats supported by filesql (CSV, TSV, LTSV, JSON, JSONL, Parquet, Excel with gzip, bzip2, xz, zstd support).

fileprep complements filesql by providing data preprocessing before loading into SQLite. It uses struct tags for validation ("validate" tag) and preprocessing ("prep" tag).

Basic Usage

type Record struct {
    Name  string `prep:"trim" validate:"required"`
    Email string `prep:"trim,lowercase" validate:"email"`
    Age   int    `validate:"gte=0,lte=150"`
}

file, _ := os.Open("data.csv")
defer file.Close()

var records []Record
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
reader, result, err := processor.Process(file, &records)
if err != nil {
    log.Fatal(err)
}

// reader can be passed directly to filesql
// result.Errors contains validation errors with row/column information
// result.RowCount and result.ValidRowCount provide processing statistics

Streaming Output with ProcessToWriter

For large datasets, ProcessToWriter writes preprocessed output directly to an io.Writer, avoiding the output buffer allocation:

var buf bytes.Buffer
result, err := processor.ProcessToWriter(file, &records, &buf)

Memory Usage

fileprep loads the entire file into memory for processing. This enables multi-pass operations (preprocessing then validation) but means memory usage scales with file size. For large files, use ProcessToWriter to reduce peak memory by avoiding the output buffer.

Format-specific limitations:

  • XLSX: Only the first sheet is processed
  • LTSV: Maximum line size is 10MB
  • JSON/JSONL: Data has a single "data" column containing raw JSON strings

Supported File Formats

fileprep supports the same formats as filesql:

  • CSV (.csv)
  • TSV (.tsv)
  • LTSV (.ltsv)
  • JSON (.json)
  • JSONL (.jsonl)
  • Parquet (.parquet)
  • Excel (.xlsx)

All formats support compression:

  • gzip (.gz)
  • bzip2 (.bz2)
  • xz (.xz)
  • zstd (.zst)
  • zlib (.z)
  • snappy (.snappy)
  • s2 (.s2)
  • lz4 (.lz4)

Prep Tags

The "prep" tag specifies preprocessing operations applied before validation:

  • trim: Remove leading and trailing whitespace
  • ltrim: Remove leading whitespace
  • rtrim: Remove trailing whitespace
  • lowercase: Convert to lowercase
  • uppercase: Convert to uppercase
  • default=value: Set default value if empty

Validate Tags

The "validate" tag specifies validation rules (compatible with go-playground/validator):

  • required: Field must not be empty
  • email: Must be a valid email address
  • url: Must be a valid URL
  • And many more...

See https://pkg.go.dev/github.com/nao1215/fileprep for the complete list of supported validators.

Example
package main

import (
	"fmt"
	"io"
	"strings"

	"github.com/nao1215/fileprep"
)

// User represents a user record with preprocessing and validation
type User struct {
	Name  string `prep:"trim" validate:"required"`
	Email string `prep:"trim,lowercase" validate:"required"`
	Age   string
}

func main() {
	// Sample CSV data with whitespace that needs trimming
	csvData := `name,email,age
  John Doe  ,JOHN@EXAMPLE.COM,30
Jane Smith,jane@example.com,25
`

	// Create a processor for CSV files
	processor := fileprep.NewProcessor(fileprep.FileTypeCSV)

	// Prepare a slice to hold the parsed records
	var users []User

	// Process the data
	reader, result, err := processor.Process(strings.NewReader(csvData), &users)
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	// Print processing results
	fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)

	// Print parsed records (with preprocessing applied)
	for i, user := range users {
		fmt.Printf("User %d: Name=%q, Email=%q\n", i+1, user.Name, user.Email)
	}

	// The reader can be passed to filesql
	// For demonstration, we just read and show the preprocessed output
	output, err := io.ReadAll(reader)
	if err != nil {
		fmt.Printf("Error reading output: %v\n", err)
		return
	}
	fmt.Printf("Output for filesql:\n%s", output)

}
Output:
Processed 2 rows, 2 valid
User 1: Name="John Doe", Email="john@example.com"
User 2: Name="Jane Smith", Email="jane@example.com"
Output for filesql:
name,email,age
John Doe,john@example.com,30
Jane Smith,jane@example.com,25
Example (ComplexPrepAndValidation)

Example_complexPrepAndValidation demonstrates comprehensive preprocessing and validation using multiple prep tags and validation rules for realistic data processing scenarios.

package main

import (
	"fmt"
	"io"
	"strings"

	"github.com/nao1215/fileprep"
)

func main() {
	// Product represents a product record with comprehensive preprocessing and validation
	type Product struct {
		// Name: trim whitespace, require non-empty
		Name string `prep:"trim" validate:"required"`
		// SKU: trim, uppercase, require non-empty and alphanumeric
		SKU string `prep:"trim,uppercase" validate:"required,alphanumeric"`
		// Price: trim, set default if empty, validate as number (int or decimal)
		Price string `prep:"trim,default=0.00" validate:"number"`
		// Quantity: trim, coerce to int, validate as integer
		Quantity string `prep:"trim,coerce=int" validate:"numeric"`
		// Category: trim, lowercase, collapse multiple spaces
		Category string `prep:"trim,lowercase,collapse_space"`
		// Description: trim, strip HTML, truncate to 200 chars
		Description string `prep:"trim,strip_html,truncate=200"`
		// URL: trim, fix scheme (add https:// if missing)
		URL string `prep:"trim,fix_scheme=https"`
		// Tags: trim, replace semicolons with commas
		Tags string `prep:"trim,replace=;:,"`
	}

	// Sample CSV with various data quality issues
	csvData := `name,sku,price,quantity,category,description,url,tags
  Widget Pro  ,ABC123,19.99,100,  Electronics   ,<p>A <b>great</b> product!</p>,example.com/widget,tag1;tag2;tag3
 Gadget Plus ,DEF456,29.99,50,  home  goods  ,Simple description,https://example.com/gadget,electronics;sale
  ,ABC@#$,not_a_number,5,test,desc,url,tags
`

	processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
	var products []Product

	reader, result, err := processor.Process(strings.NewReader(csvData), &products)
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	// Show processing summary
	fmt.Printf("Processing Summary:\n")
	fmt.Printf("  Total rows: %d\n", result.RowCount)
	fmt.Printf("  Valid rows: %d\n", result.ValidRowCount)
	fmt.Printf("  Invalid rows: %d\n", result.InvalidRowCount())

	// Show validation errors
	if result.HasErrors() {
		fmt.Printf("\nValidation Errors:\n")
		for _, ve := range result.ValidationErrors() {
			fmt.Printf("  Row %d, Field %q: %s\n", ve.Row, ve.Field, ve.Message)
		}
	}

	// Show preprocessed records
	fmt.Printf("\nPreprocessed Records:\n")
	for i, p := range products {
		if p.Name != "" { // Skip invalid rows for display
			fmt.Printf("  [%d] Name=%q, SKU=%q, Price=%q, Category=%q\n",
				i+1, p.Name, p.SKU, p.Price, p.Category)
		}
	}

	// Show that URL scheme was fixed
	fmt.Printf("\nURL Examples (scheme fixed):\n")
	for i, p := range products {
		if p.URL != "" && p.Name != "" {
			fmt.Printf("  [%d] %s\n", i+1, p.URL)
		}
	}

	// The reader can be used with filesql
	_, _ = io.ReadAll(reader) //nolint:errcheck // Example code - ignoring error

}
Output:
Processing Summary:
  Total rows: 3
  Valid rows: 2
  Invalid rows: 1

Validation Errors:
  Row 3, Field "Name": value is required
  Row 3, Field "SKU": value must contain only alphanumeric characters
  Row 3, Field "Price": value must be a valid number

Preprocessed Records:
  [1] Name="Widget Pro", SKU="ABC123", Price="19.99", Category="electronics"
  [2] Name="Gadget Plus", SKU="DEF456", Price="29.99", Category="home goods"

URL Examples (scheme fixed):
  [1] https://example.com/widget
  [2] https://example.com/gadget
Example (CrossFieldValidation)

Example_crossFieldValidation demonstrates validation rules that compare values between different fields, such as password confirmation matching.

package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileprep"
)

func main() {
	// UserForm represents a user registration form with cross-field validation
	type UserForm struct {
		Username        string `prep:"trim,lowercase" validate:"required"`
		Password        string `validate:"required"`
		ConfirmPassword string `validate:"required,eqfield=Password"`
	}

	csvData := `username,password,confirm_password
  Alice  ,secret123,secret123
Bob,password1,wrongpass
`

	processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
	var users []UserForm

	_, result, err := processor.Process(strings.NewReader(csvData), &users)
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Valid: %d/%d\n", result.ValidRowCount, result.RowCount)

	if result.HasErrors() {
		fmt.Printf("Errors:\n")
		for _, ve := range result.ValidationErrors() {
			fmt.Printf("  Row %d, Field %q: %s\n", ve.Row, ve.Field, ve.Message)
		}
	}

}
Output:
Valid: 1/2
Errors:
  Row 2, Field "ConfirmPassword": value must equal field Password
Example (DetailedErrorReporting)

Example_detailedErrorReporting demonstrates precise error information including row number, column name, and specific validation failure reason.

package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileprep"
)

func main() {
	// Order represents an order with strict validation rules
	type Order struct {
		OrderID    string `name:"order_id" validate:"required,uuid4"`
		CustomerID string `name:"customer_id" validate:"required,numeric"`
		Email      string `validate:"required,email"`
		Amount     string `validate:"required,number,gt=0,lte=10000"`
		Currency   string `validate:"required,len=3,uppercase"`
		Country    string `validate:"required,alpha,len=2"`
		OrderDate  string `name:"order_date" validate:"required,datetime=2006-01-02"`
		ShipDate   string `name:"ship_date" validate:"datetime=2006-01-02,gtfield=OrderDate"`
		IPAddress  string `name:"ip_address" validate:"required,ip_addr"`
		PromoCode  string `name:"promo_code" validate:"alphanumeric"`
		Quantity   string `validate:"required,numeric,gte=1,lte=100"`
		UnitPrice  string `name:"unit_price" validate:"required,number,gt=0"`
		TotalCheck string `name:"total_check" validate:"required,eqfield=Amount"`
	}

	// CSV with multiple validation errors
	csvData := `order_id,customer_id,email,amount,currency,country,order_date,ship_date,ip_address,promo_code,quantity,unit_price,total_check
550e8400-e29b-41d4-a716-446655440000,12345,alice@example.com,500.00,USD,US,2024-01-15,2024-01-20,192.168.1.1,SAVE10,2,250.00,500.00
invalid-uuid,abc,not-an-email,-100,US,USA,2024/01/15,2024-01-10,999.999.999.999,PROMO-CODE-TOO-LONG!!,0,0,999
550e8400-e29b-41d4-a716-446655440001,,bob@test,50000,EURO,J1,not-a-date,,2001:db8::1,VALID20,101,-50,50000
123e4567-e89b-42d3-a456-426614174000,99999,charlie@company.com,1500.50,JPY,JP,2024-02-28,2024-02-25,10.0.0.1,VIP,5,300.10,1500.50
`

	processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
	var orders []Order

	_, result, err := processor.Process(strings.NewReader(csvData), &orders)
	if err != nil {
		fmt.Printf("Fatal error: %v\n", err)
		return
	}

	fmt.Printf("=== Validation Report ===\n")
	fmt.Printf("Total rows:     %d\n", result.RowCount)
	fmt.Printf("Valid rows:     %d\n", result.ValidRowCount)
	fmt.Printf("Invalid rows:   %d\n", result.RowCount-result.ValidRowCount)
	fmt.Printf("Total errors:   %d\n\n", len(result.ValidationErrors()))

	if result.HasErrors() {
		fmt.Println("=== Error Details ===")
		for _, e := range result.ValidationErrors() {
			fmt.Printf("Row %d, Column '%s': %s\n", e.Row, e.Column, e.Message)
		}
	}

}
Output:
=== Validation Report ===
Total rows:     4
Valid rows:     1
Invalid rows:   3
Total errors:   23

=== Error Details ===
Row 2, Column 'order_id': value must be a valid UUID version 4
Row 2, Column 'customer_id': value must be numeric
Row 2, Column 'email': value must be a valid email address
Row 2, Column 'amount': value must be greater than 0
Row 2, Column 'currency': value must have exactly 3 characters
Row 2, Column 'country': value must have exactly 2 characters
Row 2, Column 'order_date': value must be a valid datetime in format: 2006-01-02
Row 2, Column 'ip_address': value must be a valid IP address
Row 2, Column 'promo_code': value must contain only alphanumeric characters
Row 2, Column 'quantity': value must be greater than or equal to 1
Row 2, Column 'unit_price': value must be greater than 0
Row 2, Column 'ship_date': value must be greater than field OrderDate
Row 2, Column 'total_check': value must equal field Amount
Row 3, Column 'customer_id': value is required
Row 3, Column 'email': value must be a valid email address
Row 3, Column 'amount': value must be less than or equal to 10000
Row 3, Column 'currency': value must have exactly 3 characters
Row 3, Column 'country': value must contain only alphabetic characters
Row 3, Column 'order_date': value must be a valid datetime in format: 2006-01-02
Row 3, Column 'quantity': value must be less than or equal to 100
Row 3, Column 'unit_price': value must be greater than 0
Row 3, Column 'ship_date': value must be greater than field OrderDate
Row 4, Column 'ship_date': value must be greater than field OrderDate
Example (DetectFileType)

Example_detectFileType demonstrates automatic file type detection from file paths.

package main

import (
	"fmt"

	"github.com/nao1215/fileprep"
)

func main() {
	files := []string{
		"data.csv",
		"data.csv.gz",
		"report.xlsx",
		"logs.tsv.bz2",
		"events.parquet",
		"access.ltsv.zst",
		"config.json",
		"config.json.gz",
		"events.jsonl",
		"events.jsonl.zst",
	}

	for _, f := range files {
		ft := fileprep.DetectFileType(f)
		fmt.Printf("%s -> %s (compressed: %v)\n", f, ft, fileprep.IsCompressed(ft))
	}

}
Output:
data.csv -> CSV (compressed: false)
data.csv.gz -> CSV (gzip) (compressed: true)
report.xlsx -> XLSX (compressed: false)
logs.tsv.bz2 -> TSV (bzip2) (compressed: true)
events.parquet -> Parquet (compressed: false)
access.ltsv.zst -> LTSV (zstd) (compressed: true)
config.json -> JSON (compressed: false)
config.json.gz -> JSON (gzip) (compressed: true)
events.jsonl -> JSONL (compressed: false)
events.jsonl.zst -> JSONL (zstd) (compressed: true)
Example (EmployeePreprocessing)

Example_employeePreprocessing demonstrates the full power of fileprep: combining multiple preprocessors and validators to clean and validate real-world messy data.

package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileprep"
)

func main() {
	// Employee represents employee data with comprehensive preprocessing and validation
	type Employee struct {
		// ID: pad to 6 digits, must be numeric
		EmployeeID string `name:"id" prep:"trim,pad_left=6:0" validate:"required,numeric,len=6"`
		// Name: clean whitespace, required alphabetic with spaces
		FullName string `name:"name" prep:"trim,collapse_space" validate:"required,alphaspace"`
		// Email: normalize to lowercase, validate format
		Email string `prep:"trim,lowercase" validate:"required,email"`
		// Department: normalize case, must be one of allowed values
		Department string `prep:"trim,uppercase" validate:"required,oneof=ENGINEERING SALES MARKETING HR"`
		// Salary: keep only digits, validate range
		Salary string `prep:"trim,keep_digits" validate:"required,numeric,gte=30000,lte=500000"`
		// Phone: extract digits, validate E.164 format after adding country code
		Phone string `prep:"trim,keep_digits,prefix=+1" validate:"e164"`
		// Start date: validate datetime format
		StartDate string `name:"start_date" prep:"trim" validate:"required,datetime=2006-01-02"`
		// Manager ID: required only if department is not HR
		ManagerID string `name:"manager_id" prep:"trim,pad_left=6:0" validate:"required_unless=Department HR"`
		// Website: fix missing scheme, validate URL
		Website string `prep:"trim,lowercase,fix_scheme=https" validate:"url"`
	}

	// Messy real-world CSV data
	csvData := `id,name,email,department,salary,phone,start_date,manager_id,website
42,  John   Doe  ,JOHN.DOE@COMPANY.COM,engineering,75000,5551234567,2023-01-15,1,company.com/john
7,Jane Smith,jane@COMPANY.com,  Sales  ,120000,5559876543,2022-06-01,2,WWW.LINKEDIN.COM/in/jane
123,Bob Wilson,bob.wilson@company.com,HR,45000,5551112222,2024-03-20,,
99,Alice Brown,alice@company.com,Marketing,88500,5554443333,2023-09-10,3,https://alice.dev
`

	processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
	var employees []Employee

	_, result, err := processor.Process(strings.NewReader(csvData), &employees)
	if err != nil {
		fmt.Printf("Fatal error: %v\n", err)
		return
	}

	fmt.Printf("=== Processing Result ===\n")
	fmt.Printf("Total rows: %d, Valid rows: %d\n\n", result.RowCount, result.ValidRowCount)

	for i, emp := range employees {
		fmt.Printf("Employee %d:\n", i+1)
		fmt.Printf("  ID:         %s\n", emp.EmployeeID)
		fmt.Printf("  Name:       %s\n", emp.FullName)
		fmt.Printf("  Email:      %s\n", emp.Email)
		fmt.Printf("  Department: %s\n", emp.Department)
		fmt.Printf("  Salary:     %s\n", emp.Salary)
		fmt.Printf("  Phone:      %s\n", emp.Phone)
		fmt.Printf("  Start Date: %s\n", emp.StartDate)
		fmt.Printf("  Manager ID: %s\n", emp.ManagerID)
		if emp.Website != "" {
			fmt.Printf("  Website:    %s\n\n", emp.Website)
		} else {
			fmt.Printf("  Website:    (none)\n\n")
		}
	}

}
Output:
=== Processing Result ===
Total rows: 4, Valid rows: 3

Employee 1:
  ID:         000042
  Name:       John Doe
  Email:      john.doe@company.com
  Department: ENGINEERING
  Salary:     75000
  Phone:      +15551234567
  Start Date: 2023-01-15
  Manager ID: 000001
  Website:    https://company.com/john

Employee 2:
  ID:         000007
  Name:       Jane Smith
  Email:      jane@company.com
  Department: SALES
  Salary:     120000
  Phone:      +15559876543
  Start Date: 2022-06-01
  Manager ID: 000002
  Website:    https://www.linkedin.com/in/jane

Employee 3:
  ID:         000123
  Name:       Bob Wilson
  Email:      bob.wilson@company.com
  Department: HR
  Salary:     45000
  Phone:      +15551112222
  Start Date: 2024-03-20
  Manager ID: 000000
  Website:    (none)

Employee 4:
  ID:         000099
  Name:       Alice Brown
  Email:      alice@company.com
  Department: MARKETING
  Salary:     88500
  Phone:      +15554443333
  Start Date: 2023-09-10
  Manager ID: 000003
  Website:    https://alice.dev
Example (JsonProcessing)

Example_jsonProcessing demonstrates processing JSON data. JSON arrays are parsed into rows, each containing the raw JSON element in a single "data" column.

package main

import (
	"fmt"
	"io"
	"strings"

	"github.com/nao1215/fileprep"
)

func main() {
	type Record struct {
		Data string `name:"data" validate:"required"`
	}

	jsonData := `[{"name":"Alice","age":30},{"name":"Bob","age":25}]`

	processor := fileprep.NewProcessor(fileprep.FileTypeJSON)
	var records []Record

	reader, result, err := processor.Process(strings.NewReader(jsonData), &records)
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)
	for i, r := range records {
		fmt.Printf("Row %d: %s\n", i+1, r.Data)
	}

	output, _ := io.ReadAll(reader) //nolint:errcheck // Example code
	fmt.Printf("JSONL output:\n%s", output)

}
Output:
Processed 2 rows, 2 valid
Row 1: {"name":"Alice","age":30}
Row 2: {"name":"Bob","age":25}
JSONL output:
{"name":"Alice","age":30}
{"name":"Bob","age":25}
Example (ProcessToWriter)

Example_processToWriter demonstrates ProcessToWriter, which writes preprocessed output directly to an io.Writer instead of buffering the entire output in memory.

package main

import (
	"bytes"
	"fmt"
	"strings"

	"github.com/nao1215/fileprep"
)

func main() {
	type Record struct {
		Name  string `prep:"trim" validate:"required"`
		Email string `prep:"trim,lowercase"`
	}

	csvData := "name,email\n  Alice  ,ALICE@EXAMPLE.COM\n  Bob  ,BOB@EXAMPLE.COM\n"
	processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
	var records []Record

	var buf bytes.Buffer
	result, err := processor.ProcessToWriter(strings.NewReader(csvData), &records, &buf)
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)
	fmt.Printf("Output:\n%s", buf.String())
}
Output:
Processed 2 rows, 2 valid
Output:
name,email
Alice,alice@example.com
Bob,bob@example.com
Example (Validation)
package main

import (
	"fmt"
	"strings"

	"github.com/nao1215/fileprep"
)

// User represents a user record with preprocessing and validation
type User struct {
	Name  string `prep:"trim" validate:"required"`
	Email string `prep:"trim,lowercase" validate:"required"`
	Age   string
}

func main() {
	// CSV data with validation error (empty required name)
	csvData := `name,email,age
,john@example.com,30
Jane,jane@example.com,25
`

	processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
	var users []User

	_, result, err := processor.Process(strings.NewReader(csvData), &users)
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	// Check for validation errors
	if result.HasErrors() {
		fmt.Printf("Found %d validation errors:\n", len(result.Errors))
		for _, e := range result.ValidationErrors() {
			fmt.Printf("  Row %d, Column %q: %s\n", e.Row, e.Column, e.Message)
		}
	}

	fmt.Printf("Valid rows: %d/%d\n", result.ValidRowCount, result.RowCount)

}
Output:
Found 1 validation errors:
  Row 1, Column "name": value is required
Valid rows: 1/2

Index

Examples

Constants

View Source
const (
	FileTypeCSV         = fileparser.CSV
	FileTypeTSV         = fileparser.TSV
	FileTypeLTSV        = fileparser.LTSV
	FileTypeParquet     = fileparser.Parquet
	FileTypeXLSX        = fileparser.XLSX
	FileTypeCSVGZ       = fileparser.CSVGZ
	FileTypeCSVBZ2      = fileparser.CSVBZ2
	FileTypeCSVXZ       = fileparser.CSVXZ
	FileTypeCSVZSTD     = fileparser.CSVZSTD
	FileTypeTSVGZ       = fileparser.TSVGZ
	FileTypeTSVBZ2      = fileparser.TSVBZ2
	FileTypeTSVXZ       = fileparser.TSVXZ
	FileTypeTSVZSTD     = fileparser.TSVZSTD
	FileTypeLTSVGZ      = fileparser.LTSVGZ
	FileTypeLTSVBZ2     = fileparser.LTSVBZ2
	FileTypeLTSVXZ      = fileparser.LTSVXZ
	FileTypeLTSVZSTD    = fileparser.LTSVZSTD
	FileTypeParquetGZ   = fileparser.ParquetGZ
	FileTypeParquetBZ2  = fileparser.ParquetBZ2
	FileTypeParquetXZ   = fileparser.ParquetXZ
	FileTypeParquetZSTD = fileparser.ParquetZSTD
	FileTypeXLSXGZ      = fileparser.XLSXGZ
	FileTypeXLSXBZ2     = fileparser.XLSXBZ2
	FileTypeXLSXXZ      = fileparser.XLSXXZ
	FileTypeXLSXZSTD    = fileparser.XLSXZSTD

	// zlib compression formats (v0.2.0+)
	FileTypeCSVZLIB     = fileparser.CSVZLIB
	FileTypeTSVZLIB     = fileparser.TSVZLIB
	FileTypeLTSVZLIB    = fileparser.LTSVZLIB
	FileTypeParquetZLIB = fileparser.ParquetZLIB
	FileTypeXLSXZLIB    = fileparser.XLSXZLIB

	// snappy compression formats (v0.2.0+)
	FileTypeCSVSNAPPY     = fileparser.CSVSNAPPY
	FileTypeTSVSNAPPY     = fileparser.TSVSNAPPY
	FileTypeLTSVSNAPPY    = fileparser.LTSVSNAPPY
	FileTypeParquetSNAPPY = fileparser.ParquetSNAPPY
	FileTypeXLSXSNAPPY    = fileparser.XLSXSNAPPY

	// s2 compression formats (v0.2.0+)
	FileTypeCSVS2     = fileparser.CSVS2
	FileTypeTSVS2     = fileparser.TSVS2
	FileTypeLTSVS2    = fileparser.LTSVS2
	FileTypeParquetS2 = fileparser.ParquetS2
	FileTypeXLSXS2    = fileparser.XLSXS2

	// lz4 compression formats (v0.2.0+)
	FileTypeCSVLZ4     = fileparser.CSVLZ4
	FileTypeTSVLZ4     = fileparser.TSVLZ4
	FileTypeLTSVLZ4    = fileparser.LTSVLZ4
	FileTypeParquetLZ4 = fileparser.ParquetLZ4
	FileTypeXLSXLZ4    = fileparser.XLSXLZ4

	// JSON/JSONL file types (v0.5.0+)
	FileTypeJSON  = fileparser.JSON
	FileTypeJSONL = fileparser.JSONL

	// JSON compression formats (v0.5.0+)
	FileTypeJSONGZ     = fileparser.JSONGZ
	FileTypeJSONBZ2    = fileparser.JSONBZ2
	FileTypeJSONXZ     = fileparser.JSONXZ
	FileTypeJSONZSTD   = fileparser.JSONZSTD
	FileTypeJSONZLIB   = fileparser.JSONZLIB
	FileTypeJSONSNAPPY = fileparser.JSONSNAPPY
	FileTypeJSONS2     = fileparser.JSONS2
	FileTypeJSONLZ4    = fileparser.JSONLZ4

	// JSONL compression formats (v0.5.0+)
	FileTypeJSONLGZ     = fileparser.JSONLGZ
	FileTypeJSONLBZ2    = fileparser.JSONLBZ2
	FileTypeJSONLXZ     = fileparser.JSONLXZ
	FileTypeJSONLZSTD   = fileparser.JSONLZSTD
	FileTypeJSONLZLIB   = fileparser.JSONLZLIB
	FileTypeJSONLSNAPPY = fileparser.JSONLSNAPPY
	FileTypeJSONLS2     = fileparser.JSONLS2
	FileTypeJSONLLZ4    = fileparser.JSONLLZ4

	FileTypeUnsupported = fileparser.Unsupported
)

File type constants re-exported from fileparser for backward compatibility.

Variables

View Source
var (
	// ErrStructSlicePointer is returned when the value is not a pointer to a struct slice
	ErrStructSlicePointer = errors.New("value must be a pointer to a struct slice")
	// ErrUnsupportedFileType is returned when the file type is not supported
	ErrUnsupportedFileType = errors.New("unsupported file type")
	// ErrEmptyFile is returned when the file is empty
	ErrEmptyFile = errors.New("file is empty")
	// ErrInvalidTagFormat is returned when the tag format is invalid
	ErrInvalidTagFormat = errors.New("invalid tag format")
	// ErrInvalidJSONAfterPrep is returned when preprocessing destroys JSON structure
	// in the "data" column of a JSON/JSONL file. This is a hard error because
	// invalid JSON lines in JSONL output cause downstream parsers to fail.
	ErrInvalidJSONAfterPrep = errors.New("preprocessing produced invalid JSON")
	// ErrEmptyJSONOutput is returned when all JSON/JSONL rows are empty or invalid
	// after preprocessing, resulting in no output lines. An empty JSONL output is
	// unparseable by downstream consumers.
	ErrEmptyJSONOutput = errors.New("JSON/JSONL output has no valid rows after preprocessing")
	// ErrNilWriter is returned when a nil io.Writer is passed to ProcessToWriter.
	ErrNilWriter = errors.New("writer must not be nil")
	// ErrNilReader is returned when a nil io.Reader is passed to Process or ProcessToWriter.
	ErrNilReader = errors.New("reader must not be nil")
)

Sentinel errors for fileprep

Functions

func IsCompressed added in v0.3.0

func IsCompressed(ft FileType) bool

IsCompressed returns true if the file type is compressed. This is a convenience wrapper around fileparser.IsCompressed.

Types

type CrossFieldValidator

type CrossFieldValidator interface {
	// Validate checks if the source value is valid compared to the target value
	// Returns empty string if validation passes, error message otherwise
	Validate(srcValue, targetValue string) string
	// Name returns the name of the validator for error reporting
	Name() string
	// TargetField returns the name of the field to compare against
	TargetField() string
}

CrossFieldValidator defines the interface for validators that compare values across fields

type FileType

type FileType = fileparser.FileType

FileType is an alias for fileparser.FileType for backward compatibility.

func DetectFileType

func DetectFileType(path string) FileType

DetectFileType detects file type from extension. This is a convenience wrapper around fileparser.DetectFileType.

type Option added in v0.5.0

type Option func(*Processor)

Option configures a Processor.

func WithStrictTagParsing added in v0.5.0

func WithStrictTagParsing() Option

WithStrictTagParsing enables strict tag parsing mode. When enabled, invalid tag arguments (e.g., "eq=abc" where a number is expected) return an error during Process() instead of being silently ignored.

Example:

processor := fileprep.NewProcessor(fileparser.CSV, fileprep.WithStrictTagParsing())

func WithValidRowsOnly added in v0.5.0

func WithValidRowsOnly() Option

WithValidRowsOnly configures the Processor to include only valid rows in the output io.Reader and struct slice. Rows that fail validation are excluded from the output but still counted in ProcessResult.RowCount and reported in ProcessResult.Errors.

Example:

processor := fileprep.NewProcessor(fileparser.CSV, fileprep.WithValidRowsOnly())
reader, result, err := processor.Process(input, &records)
// reader contains only rows that passed all validations
// result.RowCount includes all rows, result.ValidRowCount has valid count

type PrepError

type PrepError struct {
	Row     int    // 1-based row number
	Column  string // Column name
	Field   string // Struct field name
	Tag     string // The prep tag that failed
	Message string // Human-readable error message
}

PrepError represents a preprocessing error.

Example:

for _, pe := range result.PrepErrors() {
    fmt.Printf("Row %d, Column %q: %s (tag=%q)\n",
        pe.Row, pe.Column, pe.Message, pe.Tag)
}

func (*PrepError) Error

func (e *PrepError) Error() string

Error implements the error interface

type Preprocessor

type Preprocessor interface {
	// Process applies preprocessing to the value and returns the result
	Process(value string) string
	// Name returns the name of the preprocessor for error reporting
	Name() string
}

Preprocessor defines the interface for preprocessing values

type ProcessResult

type ProcessResult struct {
	// Errors contains all validation and preprocessing errors
	Errors []error
	// RowCount is the total number of data rows processed (excluding header)
	RowCount int
	// ValidRowCount is the number of rows that passed all validations
	ValidRowCount int
	// Columns contains the column names from the header
	Columns []string
	// OriginalFormat is the file type that was processed
	OriginalFormat fileparser.FileType
	// contains filtered or unexported fields
}

ProcessResult contains the results of processing a file.

Example:

reader, result, err := processor.Process(input, &records)
if result.HasErrors() {
    for _, ve := range result.ValidationErrors() {
        fmt.Printf("Row %d: %s\n", ve.Row, ve.Message)
    }
}
fmt.Printf("Valid: %d/%d rows\n", result.ValidRowCount, result.RowCount)

func (*ProcessResult) HasErrors

func (r *ProcessResult) HasErrors() bool

HasErrors returns true if there are any errors

func (*ProcessResult) InvalidRowCount

func (r *ProcessResult) InvalidRowCount() int

InvalidRowCount returns the number of rows that failed validation

func (*ProcessResult) PrepErrors

func (r *ProcessResult) PrepErrors() []*PrepError

PrepErrors returns only preprocessing errors

func (*ProcessResult) ValidationErrors

func (r *ProcessResult) ValidationErrors() []*ValidationError

ValidationErrors returns only validation errors

type Processor

type Processor struct {
	// contains filtered or unexported fields
}

Processor handles preprocessing and validation of file data

func NewProcessor

func NewProcessor(fileType fileparser.FileType, opts ...Option) *Processor

NewProcessor creates a new Processor for the specified file type. Options can be provided to configure behavior such as strict tag parsing and output filtering.

Example:

processor := fileprep.NewProcessor(fileparser.CSV)
var records []MyRecord
reader, result, err := processor.Process(input, &records)

// With options:
processor := fileprep.NewProcessor(fileparser.CSV,
    fileprep.WithStrictTagParsing(),
    fileprep.WithValidRowsOnly(),
)

func (*Processor) Process

func (p *Processor) Process(input io.Reader, structSlicePointer any) (io.Reader, *ProcessResult, error)

Process reads from the input reader, applies preprocessing and validation, populates the struct slice, and returns an io.Reader with preprocessed data.

The returned io.Reader preserves the original file format:

  • CSV input → CSV output
  • TSV input → TSV output (tab-delimited)
  • LTSV input → LTSV output (label:value format)
  • JSON input → JSONL output (one JSON value per line)
  • JSONL input → JSONL output (one JSON value per line)
  • XLSX input → CSV output (tabular data)
  • Parquet input → CSV output (tabular data)

The returned io.Reader can be passed directly to filesql.AddReader:

reader, result, err := processor.Process(input, &records)
db.AddReader(reader, "table", parser.CSV)

For format information, use ProcessResult.OriginalFormat or cast to Stream:

stream := reader.(fileprep.Stream)
fmt.Println(stream.Format()) // CSV, TSV, etc.

Example:

type User struct {
    Name  string `prep:"trim" validate:"required"`
    Email string `prep:"trim,lowercase" validate:"email"`
    Age   string `validate:"numeric,min=0,max=150"`
}

csvData := "name,email,age\n  John  ,JOHN@EXAMPLE.COM,30\n"
processor := fileprep.NewProcessor(parser.CSV)
var users []User
reader, result, err := processor.Process(strings.NewReader(csvData), &users)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)

func (*Processor) ProcessToWriter added in v0.6.0

func (p *Processor) ProcessToWriter(input io.Reader, structSlicePointer any, w io.Writer) (*ProcessResult, error)

ProcessToWriter works like Process but writes the preprocessed output directly to w instead of buffering it in memory. This is useful for large datasets where holding the full output buffer is undesirable.

Example:

var buf bytes.Buffer
result, err := processor.ProcessToWriter(input, &records, &buf)

type Stream

type Stream interface {
	io.Reader
	// Format returns the actual output format of the stream data.
	// For CSV/TSV/LTSV input, this matches the input format.
	// For JSON/JSONL input, this returns JSONL since the output is JSONL-formatted.
	// For XLSX/Parquet input, this returns CSV since the output is CSV-formatted.
	Format() fileparser.FileType
	// OriginalFormat returns the original input file type including compression
	OriginalFormat() fileparser.FileType
}

Stream represents a preprocessed data stream with format information. It implements io.Reader and provides metadata about the file format.

type ValidationError

type ValidationError struct {
	Row     int    // 1-based row number (excluding header)
	Column  string // Column name
	Field   string // Struct field name
	Value   string // The value that failed validation
	Tag     string // The validation tag that failed
	Message string // Human-readable error message
}

ValidationError represents a validation error with row and column information.

Example:

for _, ve := range result.ValidationErrors() {
    fmt.Printf("Row %d, Column %q: %s (value=%q)\n",
        ve.Row, ve.Column, ve.Message, ve.Value)
}

func (*ValidationError) Error

func (e *ValidationError) Error() string

Error implements the error interface

type Validator

type Validator interface {
	// Validate checks if the value is valid and returns an error message if not
	// Returns empty string if validation passes
	Validate(value string) string
	// Name returns the name of the validator for error reporting
	Name() string
}

Validator defines the interface for validating values

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL