Skip to content

izure1/dataply

Repository files navigation

node.js workflow Performance Benchmark License

Dataply

Warning

Dataply is currently in Alpha version. It is experimental and not yet suitable for production use. Internal data structures and file formats are subject to change at any time.

Dataply is a lightweight, high-performance Record Store designed for Node.js. It focuses on storing arbitrary data and providing an auto-generated Primary Key (PK) for ultra-fast retrieval, while supporting core enterprise features like MVCC, WAL, and atomic transactions.

Key Features

Dataply provides essential features for high-performance data management:

  • Identity-Based Access: Manage records through auto-generated Primary Keys for ultra-fast retrieval.
  • High-Performance B+Tree: Asynchronous B+Tree structure optimizes both lookups and insertions.
  • MVCC & Isolation: Snapshot isolation via Multi-Version Concurrency Control (MVCC) enables non-blocking reads.
  • Reliability (WAL): Write-Ahead Logging (WAL) ensures data integrity and automatic crash recovery.
  • Atomic Transactions: Full support for ACID-compliant Commit and Rollback operations.
  • Efficient Storage: Fixed-size page management with LRU-based page caching and Free List space optimization (Bitmap-based management is deprecated).
  • Type Safety: Comprehensive TypeScript definitions for a seamless developer experience.

Installation

Prerequisites

  • Node.js: v18.0.0 or higher
npm install dataply

Quick Start

import { Dataply } from 'dataply'

// Open Dataply instance
const dataply = new Dataply('./data.db', {
  wal: './data.db.wal'
})

async function main() {
  // Initialization (Required)
  await dataply.init()

  // Insert data
  const pk = await dataply.insert('Hello, Dataply!')
  console.log(`Inserted row with PK: ${pk}`)

  // Update data
  await dataply.update(pk, 'Updated Data')
  console.log(`Updated row with PK: ${pk}`)

  // Select data
  const data = await dataply.select(pk)
  console.log(`Read data: ${data}`)

  // Delete data
  await dataply.delete(pk)
  console.log(`Deleted row with PK: ${pk}`)

  // Close dataply
  await dataply.close()
}

main()

Integration Example (Express.js)

Dataply's auto-generated Primary Key (PK) is perfect for use as a unique identifier in web applications.

import express from 'express'
import { Dataply } from 'dataply'

const app = express()
const db = new Dataply('./web.db')

app.use(express.json())

app.post('/posts', async (req, res) => {
  // Dataply returns a numeric PK immediately after insertion
  const pk = await db.insert(JSON.stringify(req.body))
  res.status(201).json({ id: pk, message: 'Post created!' })
})

app.get('/posts/:id', async (req, res) => {
  const data = await db.select(Number(req.params.id))
  if (!data) return res.status(404).send('Not Found')
  res.json(JSON.parse(data.toString()))
})

// Initialize DB before starting server
db.init().then(() => {
  app.listen(3000, () => console.log('Server running on http://localhost:3000'))
})

Tip

For more advanced usage like search and optimization, check the Technical Structure Guide.

Transaction Management

Explicit Transactions

You can group multiple operations into a single unit of work to ensure atomicity.

const tx = dataply.createTransaction()

try {
  await dataply.insert('Data 1', tx)
  await dataply.update(pk, 'Updated Data', tx)
  
  await tx.commit() // Persist changes to disk and clear WAL on success
} catch (error) {
  await tx.rollback() // Revert all changes on failure (Undo)
}

Global Transactions

You can perform atomic operations across multiple Dataply instances using the GlobalTransaction class. This uses a 2-Phase Commit (2PC) mechanism to ensure that either all instances commit successfully or all are rolled back.

import { Dataply, GlobalTransaction } from 'dataply'

const db1 = new Dataply('./db1.db', { wal: './db1.wal' })
const db2 = new Dataply('./db2.db', { wal: './db2.wal' })

await db1.init()
await db2.init()

const tx1 = db1.createTransaction()
const tx2 = db2.createTransaction()

const globalTx = new GlobalTransaction()
globalTx.add(tx1)
globalTx.add(tx2)

try {
  await db1.insert('Data for DB1', tx1)
  await db2.insert('Data for DB2', tx2)
  
  // Commit transactions across all instances
  // Note: This is a best-effort atomic commit.
  await globalTx.commit() 
} catch (error) {
  await globalTx.rollback()
}

Auto-Transaction

If you omit the tx argument, Dataply creates an internal transaction for each operation.

  • Security: Atomicity is guaranteed even for single operations.
  • Optimization Tip: For bulk operations, use an explicit transaction to significantly reduce I/O overhead and increase performance.

API Reference

Dataply Class

constructor(file: string, options?: DataplyOptions): Dataply

Opens a database file. If the file does not exist, it creates and initializes a new one.

  • options.pageSize: Size of a page (Default: 8192, must be a power of 2)
  • options.pageCacheCapacity: Maximum number of pages to keep in memory (Default: 10000)
  • options.wal: Path to the WAL file. If omitted, WAL is disabled.
  • options.pagePreallocationCount: The number of pages to preallocate when creating a new page (Default: 1000).
  • options.walCheckpointThreshold: The total number of pages written to the WAL before automatically clearing it (Default: 1000).

async init(): Promise<void>

Initializes the instance. Must be called before performing any CRUD operations.

async insert(data: string | Uint8Array, tx?: Transaction): Promise<number>

Inserts new data. Returns the Primary Key (PK) of the created row.

async insertAsOverflow(data: string | Uint8Array, tx?: Transaction): Promise<number>

Forcibly inserts data into an overflow page, even if it could fit within a standard data page. Returns the Primary Key (PK).

async insertBatch(dataList: (string | Uint8Array)[], tx?: Transaction): Promise<number[]>

Inserts multiple rows at once. This is significantly faster than multiple individual inserts as it minimizes internal transaction overhead.

async select(pk: number, asRaw?: boolean, tx?: Transaction): Promise<string | Uint8Array | null>

Retrieves data based on the PK. Returns Uint8Array if asRaw is true.

async selectMany(pks: number[] | Float64Array, asRaw?: boolean, tx?: Transaction): Promise<(string | Uint8Array | null)[]>

Retrieves multiple data records in batch based on the provided PKs. This is more efficient than individual select calls for multiple lookups.

async update(pk: number, data: string | Uint8Array, tx?: Transaction): Promise<void>

Updates existing data.

async delete(pk: number, tx?: Transaction): Promise<void>

Marks data as deleted.

async getMetadata(tx?: Transaction): Promise<DataplyMetadata>

Returns the current metadata of the dataply, including pageSize, pageCount, and rowCount.

createTransaction(): Transaction

Creates a new transaction instance.

async close(): Promise<void>

Closes the file handles and shuts down safely.

Transaction Class

async commit(): Promise<void>

Permanently reflects all changes made during the transaction to disk and releases locks.

async rollback(): Promise<void>

Cancels all changes made during the transaction and restores the original state.

GlobalTransaction Class

add(tx: Transaction): void

Registers a transaction from a Dataply instance to the global unit.

async commit(): Promise<void>

Executes a coordinated commit across all registered transactions. Note that without a prepare phase, this is a best-effort atomic commit.

async rollback(): Promise<void>

Rolls back all registered transactions simultaneously.

Extending Dataply

If you want to extend Dataply's functionality, use the DataplyAPI class. Unlike the standard Dataply class, DataplyAPI provides direct access to internal components like PageFileSystem or RowTableEngine, offering much more flexibility for custom implementations.

For a detailed guide and examples on how to extend Dataply using Hooks, see Extending Dataply Guide.

Using DataplyAPI

import { DataplyAPI } from 'dataply'

class CustomDataply extends DataplyAPI {
  // Leverage internal protected members (pfs, rowTableEngine, etc.)
  async getInternalStats() {
    return {
      pageSize: this.options.pageSize,
      // Custom internal logic here
    }
  }
}

const custom = new CustomDataply('./data.db')
await custom.init()

const stats = await custom.getInternalStats()
console.log(stats)

Internal Architecture

Dataply implements the core principles of high-performance storage systems in a lightweight and efficient manner.

For a detailed visual guide on Dataply's internal architecture, class diagrams, and transaction flow, please refer to the Architecture Guide.

1. Architectural Principles

  • Layered Architecture: Clear separation of concerns between API, Engine, Page System, and I/O Strategy.
  • MVCC & Snapshot Isolation: Separation of read/write paths using Undo Snapshots.
  • WAL-based Durability: Sequential log writing for reliability and crash recovery.

2. Page-Based Storage and Caching

  • Fixed-size Pages: All data is managed in fixed-size units (default 8KB) called pages.
  • Page Cache: Minimizes disk I/O by caching frequently accessed pages in memory (LRU Strategy).
  • Dirty Page Tracking: Tracks modified pages (Dirty) to synchronize them with disk efficiently only at the time of commit.
  • Free List Management: Efficiently tracks the allocation and deallocation of pages using a Free List (stack-like structure), facilitating fast space reclamation and reuse. (The older Bitmap-based mechanism is deprecated but remains for backward compatibility). For more details on this mechanism, see Page Reclamation and Reuse Guide.
  • Detailed Structure: For technical details on the physical layout, see structure.md.

Page & Row Layout

Dataply uses a Slotted Page architecture to manage records efficiently:

  • Pages: Consists of a 100-byte header (containing type, id, checksum, etc.) and a body where rows are stored. Slot offsets are stored at the end of the page to track row positions.
  • Rows: Each row has a 9-byte header (flags, size, PK) followed by the actual data. Large records automatically trigger Overflow Pages to handle data exceeding page capacity.
  • Keys & Identifiers: Uses a 6-byte Primary Key (PK) for logical mapping and a 6-byte Record Identifier (RID) (Slot + Page ID) for direct physical addressing.

3. MVCC and Snapshot Isolation

  • Non-blocking Reads: Read operations are not blocked by write operations.
  • Undo Log: When a transaction modifies a page, it keeps the original data in an Undo Buffer. Other transactions trying to read the same page are served this snapshot to ensure consistent reads.
  • Rollback Mechanism: Upon transaction failure, the Undo Buffer is used to instantly restore pages to their original state.

4. WAL (Write-Ahead Logging) and Crash Recovery

  • Performance and Reliability: All changes are recorded in a sequential log file (WAL) before being written to the actual data file. This converts random writes into sequential writes for better performance and ensures data integrity.
  • Crash Recovery: When restarting after an unexpected shutdown, Dataply reads the WAL to automatically replay (Redo) any changes that weren't yet reflected in the data file.

5. Concurrency Control and Indexing

  • Page-level Locking: Prevents data contention by controlling sequential access to pages through the LockManager.
  • B+Tree Index: Uses a B+Tree structure guaranteeing $O(\log N)$ performance for maximized PK lookup efficiency.

Performance

Dataply is optimized for high-speed data processing. Automated benchmarks are executed on every push to the main branch to ensure consistent performance.

Performance Trend

You can view the real-time performance trend and detailed metrics on our Performance Dashboard.

Tip

Continuous Monitoring: We use github-action-benchmark to monitor performance changes. For every PR, a summary of the performance impact is automatically commented to help maintain high efficiency.

Limitations

As Dataply is currently in Alpha, there are several limitations to keep in mind:

  • PK-Only Access: Data can only be retrieved or modified using the Primary Key. No secondary indexes or complex query logic are available yet.
  • No SQL Support: This is a low-level Record Store. It does not support SQL or any higher-level query language.
  • Memory Usage: The Page cache size is controlled by pageCacheCapacity, but excessive use of large records should be handled with care.

Q&A

Q: Why should I use Dataply instead of a simple JSON file?

While JSON is simple, Dataply is designed for scalable and reliable data management:

Feature JSON File Approach Dataply Record Store
Memory usage Loads entire file into RAM Constant memory via page-based I/O
Search speed Linear scan (O(N)) B+Tree Index lookups (O(log N))
Integrity High risk of corruption on crash Protected by WAL and Transactions
Concurrency Single-user only Multi-user via MVCC & Locking

Q: What can I build with Dataply?

Dataply is a low-level record store that provides high-performance ACID persistence. You can use it to build:

  • Simple Websites: Create forums or blogs using local files without complex database setup.
  • Post Identity Management: The Primary Key (PK) automatically generated and returned during insert can be directly used as a unique URL ID for posts (e.g., /post/1024).
  • Custom Storage Engines: Implement domain-specific document databases, caching layers, or log collectors.

Q: Can I extend Dataply to implement a full-featured database?

Absolutely! By leveraging DataplyAPI, you can implement custom indexing (like secondary indexes), query parsers, and complex data schemas. Dataply handles the difficult aspects of transaction management, crash recovery (WAL), and concurrency control, letting you focus on your database's unique features.

Q: How many rows can be inserted per page?

Dataply uses a 2-byte slots for data positioning within a page. This allows for a theoretical maximum of 65,536 ($2^{16}$) rows per page.

Q: What is the total maximum number of rows a database can hold?

With $2^{32}$ possible pages and $2^{16}$ rows per page, the theoretical limit is 281 trillion ($2^{48}$) rows. In practice, the limit is typically governed by the physical storage size (approx. 32TB for default settings).

Q: Is there a maximum database file size limit?

Using 4-byte (unsigned int) Page IDs and the default 8KB page size, Dataply can manage up to 32TB of data ($2^{32} \times 8KB$).

Q: Is WAL (Write-Ahead Logging) mandatory?

It is optional. While disabling WAL can improve write performance by reducing synchronous I/O, it is highly recommended for any production-like environment to ensure data integrity and automatic recovery after a system crash.

Q: How does Dataply ensure data consistency during concurrent access?

Dataply utilizes a combination of page-level locking and MVCC (Multi-Version Concurrency Control). This allows for Snapshot Isolation, meaning readers can access a consistent state of the data without being blocked by ongoing write operations.

Contributing

Contributions are welcome! Since Dataply is currently in its Alpha stage, your feedback, bug reports, and feature suggestions are invaluable for shaping the future of this project.

  • Report Bugs: If you find a bug, please open an issue with detailed steps to reproduce.
  • Suggest Features: Have an idea for a new feature? We'd love to hear it!
  • Submit PRs: Feel free to submit Pull Requests for bug fixes or improvements. Please ensure your code follows the existing style and includes appropriate tests.

License

MIT

About

A lightweight storage engine for Node.js with support for MVCC, WAL.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors