Pinecone, a serverless vector database for machine learning, leaves stealth with $10M funding

Feb. 18, 2021, 1:36 a.m.

TL;DR

Pinecone, a new startup from the people who helped launch Amazon SageMaker, has created a vector database that generates data in a specialized format to build faster machine learning applications, something previously only accessible to computers, and Larger organizations. Today, the company came out of caution with a new product and announced an initial investment of $10 million led by Wing Venture Capital.

There is a management layer to track all of this and manage data transfer between source locations

Key Facts

1
Contains all the data structures and algorithms that allow them to index large amounts of high-dimensional vector data
2
Converts data into the machine learning format
3
Pinecone is created to make technology available to any business
4
Vectors are ubiquitous in machine learning

Details

Edo Liberty, the company's co-founder, says he founded the company out of this fundamental belief that the industry was being held back by the lack of broader access to this type of database.

"The data that a machine learning model expects is not a JSON record, it is a high-dimensional vector that is a list of characteristics or what is called embedding, which is a numerical representation of the elements or objects of the world. This format is much more semantically rich and actionable for machine learning," he explained.

He says this is a concept widely understood by data scientists and supported by research. Still, until now, only the most extensive and technically superior companies like Google or Pinterest could take advantage of this difference.

Liberty and his team created Pinecone to make this kind of technology available to any business.

The startup spent the last few years building the solution, which consists of three main components, the main piece is a vector engine to convert the data into this ingestible machine learning format.

Liberty says that this is the piece of technology that contains all the data structures and algorithms that allow them to index substantial amounts of high-dimensional vector data and search through it efficiently and accurately.

The second is a cloud-hosted system to apply all of that converted data to the machine learning model while handling things like index lookups and pre and post-processing - everything a data science team needs to run a machine learning project scale, with very high workloads and throughputs.

There is a management layer to track all of this and manage data transfer between source locations.

A classic example Liberty uses is an e-commerce recommendation engine. While this has been a standard part of online sales for years, he believes that using a vectorized data approach will give much more accurate recommendations. He says that data science research data confirms this.

"It used to be that implementing something like a recommendation engine was actually incredibly complex and if you have access to a production-grade database, 90% of the difficulty and heavy lifting in creating those solutions disappear, and that is why we are building this. We believe it is the new standard," he said.

Finally, Pinecone has its language and supports the type of CRUD operations typical of databases.

However, it doesn't use SQL-clone typical of other forms of databases. How then do you get documents created after a particular data that has a type of keyword?