LESSWRONG
LW

Machine Learning (ML)AI
Frontpage

-1

[ Question ]

Vector search on a large dataset?

by camsdixon
10th Nov 2023
1 min read
A
1
2

-1

Machine Learning (ML)AI
Frontpage

-1

Vector search on a large dataset?
2leogao
2gwern
New Answer
New Comment

1 Answers sorted by
top scoring

leogao

Nov 11, 2023

20

10k vectors is pretty small. you should be able to get away with packing all of your vectors into a matrix in e.g pytorch and doing a simple matrix vector product . should take ~tens of milliseconds.

Add Comment
[-]gwern2y20

That won't help if his embeddings are bad:

if the data is similar enough.

He's not complaining about slowness, otherwise he'd say something like 'we tried FAISS and it was still too many milliseconds per lookup'. If his embeddings are bad, then even an exact nearest-neighbors lookup by brute-forcing every possible match (which, as you say, is more feasible than people usually think) won't help. You'll get the same bad answer.

Reply
Moderation Log
More from camsdixon
View more
Curated and popular this week
A
1
0

Does anyone have advice or could recommend books on how to accomplish vector search with large datasets?

What I've found so far is vector DB's are very bad at larger datasets, even to the order of 10,000's of vectors if the data is similar enough. Some ideas we've gone through so far:

  1. Smaller chunks & larger chunks -> Spacy & others
  2. Summarizing chunks using the summarization as an embedding and the actual chunk as the metadata
  3. Traditional search with vector search on top ( this seems the best ) but still runs into issues with chunks cutting off at the wrong locations.
  4. Of course all the "traditional" systems like langchain etc.

Any help, ideas, or recommendations on where I can read would be very much appreciated!