Arvix Query

The Alexindria project embedded all the abstract in Arxiv. They used the InstructorXL model to encode the text. This is a snapshot of the contents in the past. They took the embeddings along with the original content and put it in a parquet files. The complete data with all the abstracts is 6.8GB, which is impractical for this workshop. Instead we are just going to use one of the parquet files (340MB).

Steps to get ready

For some reason I can't get the devcontainer to download this file so please do it manually in the terminal

wget -qb https://github.com/thesteve0/arvix-query/releases/download/0.1/abstracts_1.parquet

We run the arxiv import.py which will create the table with the pg_vector extension and then move the data into the database
We need to run the SQL commands in sql_commands.sql to create the HNSW index
Now you can run a straight vector similary search using arxiv-vector-query. There are multiple types of queries in that file and you can change the abstract text you are sending
arxiv-hybrid-query combines the results from PostgreSQL full text search and a pg_vector search to provide a combination of similarity and exact word search

There is a typo in the project which I kept here as well. It is not Arvix it is Arxiv.

Caveat: This is not production level code. It was written to make teaching easy - there are a bunch of bad practices used. PLEASE do not copy and paste any of this directly into your production code

Here is the Arxiv site: http://arxiv.org/

Original Data is here: https://alex.macrocosm.so/download

We load one of the parquet files from here. The original file containing all the abstracts was 6.8GB. This smaller file is enough to demonstrate the concept (it is only 340MB): https://github.com/thesteve0/arvix-query/releases/download/0.1/abstracts_1.parquet

Then use arxiv-import.py to import from the parquet file. Make sure to update connections parameters as needed

After import we need to run the commands in sql_commands.sql to build the index

Data was originally encoded using this model https://huggingface.co/hkunlp/instructor-xl

Here is the instruction and sentence needed for the Instructor embedding. Change the sentence to look for other abstracts with similar content

sentence = "habitat corridors"
instruction  = "Represent the Research Paper abstract for retrieval; Input:"

This project requires a running Postgresql instance with the pgvector extension.
https://github.com/pgvector/pgvector

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.devcontainer		.devcontainer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arxiv-hybrid-query.py		arxiv-hybrid-query.py
arxiv-import.py		arxiv-import.py
axiv-vector-query-with-timings.py		axiv-vector-query-with-timings.py
axiv-vector-query.py		axiv-vector-query.py
requirements.txt		requirements.txt
sql_commands.sql		sql_commands.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arvix Query

Steps to get ready

About

Releases 1

Packages

Languages

License

thesteve0/arvix-query

Folders and files

Latest commit

History

Repository files navigation

Arvix Query

Steps to get ready

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages