Orignally published on 2021-09-28 10:18:45 by floridanewstimes.com
Suppose you want to implement a music service that behaves like Spotify and find a song that resembles your favorite song. what should I do? One way is to classify each song by some characteristics, store those “vectors” in an indexed database, and search the database to find a description vector for the songs that are “near” your favorite. .. That is, you can perform a vector similarity search.
What is a vector similarity search?
Vector similarity search usually has four components. Vector embedding that captures the key characteristics of the original object, such as songs, images, and text. A distance metric that represents the “closeness” between vectors. Search algorithm; A database that holds vectors and supports vector searches using indexes.
What is vector embedding?
Vector embedding is essentially a feature vector, as will be understood in the following context. Machine learning When Deep learning.. These can be defined by performing functional engineering manually or using the output of the model.
For example, text strings use neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, explainable knowledge-based methods, and explicit representations in terms of the context in which words are displayed. It can be converted to word embedding (feature vector). .. Common models for training and using word embedding include: word2vec (Google), Gloves (Stanford), ELMo (Allen Institute / University of Washington), BERT (Google), and fastText (Facebook).
Images are often embedded by capturing the output of a convolutional neural network (CNN) model or transformer model. These models automatically reduce the dimensions of the feature vector by rolling (“convolving”) pixel patches together to feature and downsampling using a pool layer.
Product recommendations may be based on embedding words or phrases in the product description, embedding images of the product, or both. The audio embedding may be based on the Fourier transform of the audio (which gives the spectrum). A description of the composer, genre, artist, tempo, rhythm, and loudness. Or by both spectrum and keywords. As this field is evolving rapidly, I think there are new embedding technologies in many application areas.
What is a distance metric?
We usually think of distance as a two-dimensional or three-dimensional straight line. Vector embedding is often more than 10 dimensions, and 1,000 dimensions is not uncommon at all. The general formula for distance is named after Hermann Minkowski, who is best known (at least to physicists) by formulating Einstein’s special theory of relativity as four-dimensional space-time. .. NS Minkowski weighing (Or distance) is a generalization of both the Euclidean distance (direct straight line) and the Manhattan distance (jagged lines like walking city blocks).
NS Euclidean distanceIs also known as the L2 distance or L2 norm and is the most common metric used in clustering algorithms. Another metric, cosine similarity, is often used for text processing. The orientation of the embedded vectors is important, but the distance between the vectors is not.
What are the algorithms that can perform vector similarity searches?
In general, the K-nearest neighbor (KNN) algorithm may give a good answer to a vector search problem. The main problem with KNN is its high computational cost in terms of both processor and memory usage.
KNN alternatives include: Approximate nearest neighbor (ANN) Search algorithm and ANN variations, Space partition tree and graph (SPTAG). SPTAG has been released to open source by Microsoft Research and Bing. Similar variations of ANN released to open source by Facebook are: Facebook AI similarity search (((Faith). Product quantizer And that IndexIVFPQ The index helps speed up Faiss and some other ANN variants.As I said earlier Vector database In many cases, vector indexes are created to improve search speed.
Faiss was built to search multimedia documents similar to query documents in a billion vector database.For evaluation purposes, the developer Deep1B, A collection of 1 billion images. With Faiss, you can customize vector preprocessing, database partitioning, and vector encoding (product quantization) so that your dataset fits in the available RAM. Faiss is implemented separately on the CPU and GPU. On the CPU, Faiss can achieve a 40% recall score on a billion image datasets in 2ms, converting to 500 queries per second per core. On Pascal-class Nvidia GPUs, Faiss searches 20 times faster than the CPU.
SPTAG uses a slightly different method, but was created for a similar purpose. Bing has vectorized over 150 billion data indexed by search engines to improve results over traditional keyword matching. Vectorized data includes single words, letters, web page snippets, complete queries, and other media. The creators of SPTAG are based on a previous study of ANN at Microsoft Research Asia. Query-driven iterative neighborhood graph search, And implemented both the kd-tree (suitable for index building) algorithm and the balanced k-means tree (good search accuracy) algorithm. The search starts with some random seeds and continues repeatedly in the tree and graph.
Pinecone Is a fully managed vector database with an API that makes it easy to add vector searches to your production applications. Pinecone’s similarity search service is distributed, serverless, persistent, consistent, sharded, and replicated to many nodes. Pinecone can handle billions of vector embeddings and can perform similarity searches in Python or Java applications and notebooks.
Pinecone claims a delay of less than 50ms, even with billions of items and thousands of queries per second. Runs on a hardened AWS infrastructure. The data is stored in a separate container and encrypted during the transfer.
What is a vector search application?
In addition to the image search shown by Facebook and the semantic text search implemented by Microsoft Bing, vector similarity search is useful for many use cases. Examples include product recommendations, FAQ answers, personalization, voice search, deduplication, and threat detection in IT event logs.
Copyright © 2021 IDG Communications, Inc.