Contents Menu Expand Light mode Dark mode Skip to content

Webinar: Why Industrial IoT Data Breaks Traditional Databases — and What to Do About It

Register Now
  • Product
    • Editions
      • CrateDB Cloud
      • CrateDB Enterprise
      • CrateDB OSS
    • Features
      • Overview
      • High cardinality
      • SQL syntax
      • Integrations
      • Security
    • Data models
      • Time-series
      • Document/JSON
      • Vector
      • Full-text
      • Spatial
      • Relational
  • Solutions
    • By use cases | Real-time
      • Industrial Analytics
      • AI operations
      • Application analytics
    • By industry
      • Manufacturing
      • Energy
      • FMCG
      • Logistics
      • Oil, Gas & Mining
      • Transportation
      • SaaS
      • Media & Entertainment
  • Resources
    • Customer stories
    • Academy
    • Asset library
    • Blog
    • Guides
    • Events
  • Developer
    • Documentation
    • Drivers and tools
    • Community
    • GitHub
    • Support
  • Pricing
  • Login
  • Get Started
    • Overview
      • Solutions and use cases
        • Time series data
          • Fundamentals
            • Generate time series data
              • Generate time series data from the command line
              • Generate time series data using Python
              • Generate time series data using Node.js
              • Generate time series data using Go
            • Normalize time series data intervals
            • Analyzing weather data
            • Analyzing device readings with metadata integration
          • Advanced analysis
          • Video tutorials
        • Industrial big data
          • Azure IoT
          • Machine Learning
          • ABB insights
          • Rauch insights
          • SPGo! insights
          • TGW insights
        • Long-term store
          • Automatic retention and expiration
        • Real-time raw-data analytics
          • Bitmovin insights
        • Machine learning
    • Getting Started
      • Video learning
      • Data modelling
        • Relational data
        • JSON data
        • Time series data
        • Geospatial data
        • Full-text data
        • Vector data
        • Primary key strategies
      • Query capabilities
        • Aggregations
        • Ad-hoc queries
        • Search
        • AI integration
        • Performance
      • Import data
      • Sample applications

    Build

    • Load data into CrateDB
      • Load and Export (ETL)
      • Change Data Capture (CDC)
      • Metrics, telemetry, and logs
    • Connect / Drivers
      • General information
      • Applications
      • Software Testing
      • C#
      • Elixir
      • Erlang
        • Erlang ODBC
        • Erlang epgsql
      • F#
      • Go
        • pgx
        • pq
        • KSQL
      • Groovy
      • Java
        • PostgreSQL JDBC
        • CrateDB JDBC
        • Hibernate / JPA
        • jOOQ
        • Software testing
      • JavaScript
        • node-postgres
        • node-crate
      • Julia
      • Kotlin
      • Perl
      • PHP
        • AMPHP PostgreSQL
        • PostgreSQL PDO
        • CrateDB PDO
        • CrateDB DBAL
      • Python
        • crate-python
        • sqlalchemy-cratedb
        • Conecta
        • cratedb-async
        • micropython-cratedb
        • psycopg2
        • psycopg3
        • aiopg
        • asyncpg
        • ConnectorX
        • Records
        • turbodbc
      • R
      • Ruby
      • Rust
      • Scala
      • ODBC
        • C#
        • Erlang
        • Python
        • Visual Basic
      • Visual Basic
      • Zig
      • Natural language
    • Integrations
      • Categories
        • Business Intelligence
        • Data Lineage
        • Data Visualization
        • Programming Frameworks
        • Migrations
          • Rockset
            • Migrate Queries
      • Airflow / Astronomer
        • Getting started
        • Import Parquet files
        • Import stock market data
        • Export to S3
        • Data retention policy
        • Hot/cold data retention
      • AMQP
        • Usage
      • Arrow
        • Import Parquet files
      • Atlan
      • AWS Lambda
      • Azure Functions
        • Tutorial
      • Balena
        • Usage
      • Cluvio
        • Usage
      • collectd
        • Usage with collectd
        • Usage with Telegraf
      • Conecta
      • Coreflux
        • Usage
      • Dapr
        • Usage
      • Dask
        • Usage
      • Databricks
        • Azure Databricks
      • DataGrip
      • Datashader
      • DBeaver
      • dbt
        • Usage
      • Debezium
        • Tutorial
      • Django
        • Settings
        • Models
        • Fields
        • Scalar functions
      • dlt
        • Usage
      • DMS (AWS Database Migration Service)
      • DynamoDB
      • Estuary
      • Explo
      • Flink
      • Gradio
      • Grafana
        • Tutorial
      • HiveMQ
        • Usage
      • Hop
      • Iceberg
      • InfluxDB
        • Usage
        • Cloud to Cloud
        • Data Model
      • ingestr
      • JMeter
      • Kafka
        • Using Kafka with Python
        • Using Confluent Kafka Connect
      • Kestra
        • Usage
      • Kinesis
      • LangChain
        • Usage
      • LlamaIndex
        • Text-to-SQL synopsis
        • Text-to-SQL usage
      • Locust
        • Tutorial
      • Marquez
        • Usage
      • Model Context Protocol (MCP)
        • CrateDB MCP Server
        • Community servers
      • Meltano
      • Metabase
        • Usage
      • MindsDB
      • MLflow
      • MongoDB
        • Usage
        • Cloud to Cloud
        • MongoDB’s data model
      • Mosquitto
        • Usage
      • MQTT
      • MySQL and MariaDB
        • Usage
        • Use CSV
      • n8n
      • NiFi
        • Usage
      • Node-RED
        • Tutorial
      • OpenTelemetry
        • Collector Usage
        • Telegraf Usage
      • Oracle
        • Usage
      • pandas
        • Starter tutorial
        • Jupyter tutorial
        • Efficient ingest
      • Plotly and Dash
      • Polars
      • PostgreSQL
        • Usage
      • Power BI
        • Power BI Desktop
        • Power BI Service
      • Prefect
        • Usage
      • Prometheus
        • Usage
      • PyCaret
      • PyViz
      • QueryZen
      • R
        • Tutorial
      • Rill
        • Usage
      • RisingWave
        • Stream processing from Iceberg tables to CrateDB using RisingWave
      • rsyslog
        • Usage
      • scikit-learn
      • Spark
        • Usage
      • SQL Server
      • StatsD
        • Usage
      • Streamlit
      • StreamSets
        • Usage
      • Superset / Preset
        • Usage
        • Sandbox
      • Tableau
      • Telegraf
        • Usage
      • TensorFlow
        • Tutorial
      • Terraform
        • Usage
      • Trino
        • Usage
    • All Features
      • Highlights
      • SQL
      • Document Store
        • Tutorial
      • Relational / JOINs
      • Search: FTS, Geo, Vector, Hybrid
        • Full-Text Search
          • Full-text Search Options
          • Analyzers, Tokenizers, and Filters
          • Tutorial
          • Indexing Text for Both Effective Search and Accurate Analysis
        • Geospatial Search
        • Vector Search
        • Hybrid Search
      • BLOB Store
      • Clustering
      • Snapshots
      • Cloud Native
      • Storage Layer
        • Indexing and storage in CrateDB
      • Hybrid Index
      • Advanced Querying
        • Recurrent queries
      • Generated Columns
      • Server-Side Cursors
      • Foreign data wrappers
      • User-Defined Functions
      • Cross-Cluster Replication
        • Usage

    Operations

    • Installation
      • Debian, Ubuntu
      • Red Hat, SUSE
      • Windows
      • Tarball
      • Container setup
        • Docker
        • Kubernetes
          • CrateDB and Kubernetes
          • Run CrateDB with Kubernetes Operator
      • Cloud hosting
        • Amazon AWS
          • CrateDB on Amazon EC2
          • Deploy using Terraform
          • Using Amazon S3 as a snapshot repository
        • Microsoft Azure
          • CrateDB on Azure VMs
          • Deploy using Terraform
      • Configuration settings
      • Multi-node setup
      • Multi-zone setup
    • Administration
      • Bootstrap checks
      • User management
      • Going into production
      • Monitoring and diagnostics
        • Prometheus and Grafana
        • Prometheus JMX Exporter
        • Prometheus SQL Exporter
      • Memory configuration
      • Circuit breaker
      • Troubleshooting
        • System Tables
        • CrateDB Flight Recorder (CFR)
        • Java Flight Recorder (JFR)
        • The jcmd Utility
          • Using jcmd with CrateDB on Docker
          • Java Flight Recorder (JFR)
        • The crate-node command
      • Scaling
        • Expand
        • On-Demand
        • Autoscale
        • On Kubernetes
      • Upgrading
        • Guidelines
        • Rolling Upgrade
        • Full Restart Upgrade
    • Performance guides
      • Sharding and partitioning 101
      • Sharding recommendations
      • Scaling
      • Storage
      • Fast Inserts
        • Insert Methods
        • Bulk Inserts
        • Parallel Inserts
        • Configuration Tuning for Inserts
        • Testing Insert Performance
      • Fast Selects
      • Query Optimization 101

    References

  • CrateDB Cloud
    • CrateDB
      • Tools

      • Admin UI
        • CrateDB CLI
          • Cloud CLI
            • CrateDB MCP
            • CrateDB Toolkit
            • Support
            • Community

            Vector Search¶

            Vector search on machine learning embeddings: CrateDB is all you need.

            Overview

            CrateDB can be used as a vector database (VDBMS) for storing and retrieving vector embeddings.

            CrateDB’s FLOAT_VECTOR data type implements a vector store and the k-nearest neighbor (kNN) search algorithm to find vectors that are similar to a query vector. This works by using its accompanying KNN_MATCH and VECTOR_SIMILARITY functions to perform HNSW-based semantic similarity search, also known as vector search.

            About

            Vector search leverages machine learning (ML) to capture the meaning and context of unstructured data, including text and images, transforming it into a numeric representation.

            Frequently used for semantic search, vector search finds similar data using approximate nearest neighbor (ANN) algorithms. Compared to traditional keyword search, vector search yields more relevant results and executes faster.

            Feature vectors are computed from raw data via ML methods such as feature extraction, word embeddings, or deep neural networks.

            Details

            CrateDB uses Lucene as a storage layer, so it inherits the implementation and concepts of Lucene Vector Search, in the same spirit as Elasticsearch.

            To learn more details about what’s inside, please refer to the HNSW graph search algorithm, how Lucene implemented it, how Elasticsearch now also builds on it, and why effectively Lucene Is All You Need.

            While Elasticsearch uses a query DSL based on JSON, in CrateDB, you can work with Lucene Vector Search using SQL.

            Reference Manual

            • FLOAT_VECTOR

            • KNN_MATCH

            • VECTOR_SIMILARITY

            Related

            • SQL

            • Full-Text Search

            • Geospatial Search

            • Hybrid Search

            • Machine learning

            • Advanced Querying

            SQL Semantic Search Machine Learning ML Embeddings Vector Store

            Synopsis¶

            Store and query word embeddings using similarity search based on Euclidean distance.

            DDL

            CREATE TABLE word_embeddings (
              text STRING PRIMARY KEY,
              embedding FLOAT_VECTOR(4)
            );
            

            DML

            INSERT INTO word_embeddings (text, embedding)
            VALUES
              ('Exploring the cosmos', [0.1, 0.5, -0.2, 0.8]),
              ('Discovering moon', [0.2, 0.4, 0.1, 0.7]),
              ('Discovering galaxies', [0.2, 0.4, 0.2, 0.9]),
              ('Sending the mission', [0.5, 0.9, -0.1, -0.7])
            ;
            

            DQL

            WITH param AS
              (SELECT [0.3, 0.6, 0.0, 0.9] AS sv)
            SELECT
              text,
              VECTOR_SIMILARITY(embedding, (SELECT sv FROM param))
                AS score
            FROM
              word_embeddings
            WHERE
              KNN_MATCH(embedding, (SELECT sv FROM param), 2)
            ORDER BY
              score DESC;
            

            Result

            +----------------------+-----------+
            | text                 |     score |
            +----------------------+-----------+
            | Discovering galaxies | 0.9174312 |
            | Exploring the cosmos | 0.9090909 |
            | Discovering moon     | 0.9090909 |
            | Sending the mission  | 0.2702703 |
            +----------------------+-----------+
            SELECT 4 rows in set (0.078 sec)
            

            Usage¶

            Working with vector data in CrateDB.

            Pure SQL

            CrateDB’s vector store features are available through SQL and can be used by any application speaking it. The fundamental data type of FLOAT_VECTOR is a plain array of floating point numbers, as such it will be communicated through CrateDB’s HTTP and PostgreSQL interfaces.

            Framework Integrations

            CrateDB supports applications using the vector data type through corresponding framework adapters. The page about Machine learning illustrates all of them, covering both topics about machine learning operations (MLOps), and vector database operations (similarity search).

            Learn¶

            Learn how to set up your database for vector search, how to create the relevant indices, and how to semantically query your data efficiently. A few must-reads for anyone looking to make sense of large volumes of unstructured text data.

            Tutorials

            Vector Support and KNN Search through SQL

            The addition of vector support and KNN search makes CrateDB the optimal multi-model database for all types of data. Whether it is structured, semi-structured, or unstructured data, CrateDB stands as the all-in-one solution, capable of handling diverse data types with ease.

            In this feature-focused blog post, we will introduce how CrateDB can be used as a vector database and how the vector store is implemented. We will also explore the possibilities of the K-Nearest Neighbors (KNN) search, and demonstrate vector capabilities with easy-to-follow examples.

            Blog

            Introduction
            Vector Store
            SQL

            Retrieval Augmented Generation (RAG) with CrateDB and SQL

            This notebook illustrates CrateDB’s vector store using pure SQL on behalf of an example exercising a RAG workflow.

            It uses the white-paper Time series data in manufacturing as input data, generates embeddings using OpenAI’s ChatGPT, stores them into a table using FLOAT_VECTOR(1536), and queries it using the KNN_MATCH and VECTOR_SIMILARITY functions.

            Notebook on GitHub Notebook on Colab Notebook on Binder

            Fundamentals
            Vector Store
            LangChain
            pandas
            SQL

            Technologies

            Support for Vector Search in Apache Lucene

            Uwe Schindler talks at Berlin Buzzwords 2023 about the new vector search features of Lucene 9, and about the journey of implementing HNSW from 2016 to 2021.

            • Uwe Schindler - What’s coming next with Apache Lucene?

             

            Fundamentals Lucene Vector Search

            See also

            Features: Advanced Querying • Full-Text Search

            Domains: Industrial big data • Machine learning • Time series data

            Product: Relational Database • Vector Database

            Next
            Hybrid Search
            Previous
            Geospatial Search
              Feedback

              Suggest improvement

              Edit page

              View page source

            On this page
            • Vector Search
              • Synopsis
              • Usage
              • Learn
            • Imprint
            • Contact
            • Legal
            Follow us
            Follow us on X Follow us on LinkedIn Follow us on Facebook Follow us on Instagram Follow us on Facebook