Hybrid Query Benchmark

Download Dataset & Dataset Format

In this page, we will introduce how to download the dataset and the dataset format.

Download

You can get the datasets from the GitHub Release Page.
If you just want to get one dataset, you can download it from the following links:

The dataset is stored in the form of a zip file. After downloading, you can unzip it and get the dataset. For more details about data format, please refer to the following section.

Single table fungi dataset

This dataset involves a single table. After unzipping the dataset, you will get the following files:

single_table_fungi_dataset
├── answers
│   ├── learn_knn.csv
│   └── test_knn.csv
├── bool_filter
│   ├── learn_filter.csv
│   └── test_filter.csv
├── metadata
│   └── base_metadata.csv
└── vectors
    ├── base_vectors.npy
    ├── learn_vectors.npy
    └── test_vectors.npy

There are four folders in the dataset, including answers, bool_filter, metadata and vectors. Now, let's explain the content of each folder.

answers: This folder contains the reference results of learn queries and test queries. There are two files in this folder, including learn_knn.csv and test_knn.csv. The former is the reference results of learn queries, and the latter is the reference results of test queries. Both of the files are in the form of a CSV file. They contain the following columns:
- id: type is int, the id of the query.
- max_distance: type is float. This number is the L2 distance between query vector and the 100-th nearest vector. Since there may be multiple vectors with the same distance as the query vector, this property is used to determine whether the query result returned by the system is also a correct result.
- L2_nearest_ids: type is list, each element in the list is from the id column in base_metadata.csv, and the list is sorted by L2 distance in ascending order.
bool_filter: This folder contains the boolean expressions of learn queries and test queries. There are two files in this folder, including learn_filter.csv and test_filter.csv. The columns of the two files are the same, including:
- id: type is int, the id of the query.
- expression_num: type is int, the number of sub-expressions in a boolean expression.
- bool_expression: type is str, the boolean expression of the query.
metadata: This folder contains the metadata of the dataset. There is only one file in this folder, including base_metadata.csv. There are 295,938 rows in this file. The columns of the file are as follows:
- id: type is int, the id of the base vector and corresponding metadata.
- year: type is int.
- month: type is int.
- day: type is int. The three columns represent when the images were taken in the Danish Fungi dataset.
- countryCode: type is str, country where the images were token.
- scientificName: type is str, the scientific name the fungi in original image.
- Substrate: type is str, the substrate of the fungi in original image.
- Latitude: type is float, the latitude that the image token in.
- Longitude: type is float, the longitude that the image token in.
- Habitat: type is str, the habitat of the fungi in original image.
- poisonous: type is bool, whether the fungi specie is poisonous to humans.
When creating the table, remember to add the additional column embedding to store the feature vectors in the base_vectors.npy file. Here is an example SQL script to create the table for this dataset:
```
DROP TABLE if exists `fungis`;
CREATE TABLE `fungis` (
    id int,
    year int,
    month int,
    day int,
    countryCode char(2),
    scientificName varchar(110),
    Substrate varchar(50),
    Latitude float,
    Longitude float,
    Habitat varchar(60),
    poisonous boolean,
    embedding vector(768),
    primary key (id)
);
```
After creating the table, inserting the data and creating necessary indices, you can use the following SQL to query the data and use the returned ids to calculate the metric.
```
SELECT
    id
FROM
    fungis
WHERE
    poisonous == 0
    and Substrate != 'cones'
    or Habitat == 'nan'
    and day >= 26
ORDER BY
    L2Distance(input, embedding)
LIMIT
    100;
```
vectors: This folder contains the base vectors, learn vectors and test vectors. There are three files in this folder, including base_vectors.npy, learn_vectors.npy and test_vectors.npy. The three files are in the form of a numpy array. The shape of the three files is (295938, 768), (60832, 768) and (60225, 768) respectively, which is the original train, validation and test split in Danish Fungi dataset. The first dimension of the three files is the number of vectors, and the second dimension is the dimension of the vector. The vectors are extracted from the images in the Danish Fungi dataset using the pretrained vit_small_patch16_224 model in timm library. The rows in the numpy array correspond to the rows in the corresponding CSV file, based on which the database can be built and queries made.

Multi table fungi dataset

This dataset involves multiple tables. After unzipping the dataset, you will get the following files:

multi_table_movie_dataset
├── answers
│   ├── learn_knn.csv
│   └── test_knn.csv
├── bool_filter
│   ├── learn_filter.csv
│   └── test_filter.csv
├── metadata
│   ├── conversations.csv
│   ├── genres.csv
│   ├── movies.csv
│   ├── movies_genres.csv
│   ├── speakers.csv
│   └── utterances.csv
└── vectors
    ├── base_vectors.npy
    ├── learn_vectors.npy
    └── test_vectors.npy

Similar to the single table fungi dataset, there are four folders in the dataset, including answers, bool_filter, metadata and vectors. Now, let's explain the content of each folder.

answers: This folder contains the reference results of learn queries and test queries. There are two files in this folder, including learn_knn.csv and test_knn.csv. The former is the reference results of learn queries, and the latter is the reference results of test queries. Both of the files are in the form of a CSV file. They contain the following columns:
- id: type is int, the id of the query.
- max_distance: type is float. This number is the L2 distance between query vector and the 100-th nearest vector. Since there may be multiple vectors with the same distance as the query vector, this property is used to determine whether the query result returned by the system is also a correct result.
- L2_nearest_ids: type is list, each element in the list is from the utterance_id column in utterances.csv, and the list is sorted by L2 distance in ascending order.
bool_filter: This folder contains the boolean expressions of learn queries and test queries. There are two files in this folder, including learn_filter.csv and test_filter.csv. Since this dataset contains multiple tables, the query involves multiple tables join and the boolean expressions are more complex than the single table dataset, and the columns of the two files are as follows:
- id: type is int, the id of the query.
- select_clause: type is str.
- join_clause: type is str.
- where_clause: type is str. Concatenating these three columns together gives us an SQL-style query, and user can easily translate it into the syntax required by other databases. Here is an example query:
```
SELECT
    DISTINCT utterances.utterance_id
FROM
    utterances
    JOIN conversations ON utterances.conversation_id = conversations.conversation_id
    JOIN movies ON conversations.movie_id = movies.movie_id
    JOIN movies_genres ON movies_genres.movie_id = movies.movie_id
WHERE
    conversations.movie_id != 351
    OR movies_genres.genre_id = 23
    OR utterances.reply_to = -1
    OR conversations.conversation_id > 203908
    AND conversations.conversation_id != 606452
    AND conversations.movie_id = 206;
```
metadata: This folder contains the metadata of the dataset. There is only 6 files in this folder, including conversations.csv, genres.csv, movies.csv, movies_genres.csv, speakers.csv and utterances.csv. Now, let's explain the content of each file.
- conversations.csv: This file contains the metadata of conversations. The columns of the file are as follows:
  - conversation_id: type is int, the id of the conversation.
  - movie_id: type is int, the id of the movie.
- genres.csv: This file contains the metadata of genres. The columns of the file are as follows:
  - genre_id: type is int, the id of the genre.
  - genre: type is str, the name of the genre.
- movies.csv: This file contains the metadata of movies. The columns of the file are as follows:
  - movie_id: type is int, the id of the movie.
  - movie_name: type is str, the name of the movie.
  - release_year: type is int, the release year of the movie.
  - rating: type is float, the rating of the movie.
  - votes: type is int, the number of votes of the movie.
- movies_genres.csv: This file contains the metadata of the relationship between movies and genres. The columns of the file are as follows:
  - movie_id: type is int, the id of the movie.
  - genre_id: type is int, the id of the genre.
- speakers.csv: This file contains the metadata of speakers. The columns of the file are as follows:
  - speaker_id: type is int, the id of the speaker.
  - character_name: type is str, the name of the character.
  - movie_id: type is int, the id of the movie.
  - gender: type is char, the gender of the speaker in the movie.
  - credit_pos: type is int, the credit position of the speaker in the movie.
- utterances.csv: This file contains the metadata of utterances. The columns of the file are as follows:
  - utterance_id: type is int, the id of the utterance.
  - conversation_id: type is int, the id of the conversation that the utterance corresponding to.
  - speaker_id: type is int, the id of the speaker.
  - reply_to: type is int, the id of the utterance that the current utterance replies to, -1 means that this utterance is first utterance in the conversation.
  When creating the table, remember to add the additional column embedding in table utterances to store the feature vectors in the base_vectors.npy file. Here is an example SQL script to create tables for this dataset:
```
DROP TABLE if exists `movies`;
CREATE TABLE `movies` (
    movie_id int,
    movie_name varchar(100),
    release_year int,
    rating float,
    votes int,
    primary key (movie_id)
);

DROP TABLE if exists `genres`;
CREATE TABLE `genres` (
    genre_id int,
    genre varchar(20),
    primary key (genre_id)
);

DROP TABLE if exists `movies_genres`;
CREATE TABLE `movies_genres` (
    movie_id int references movies(movie_id),
    genre_id int references genres(genre_id),
    foreign key (movie_id) references movies(movie_id),
    foreign key (genre_id) references genres(genre_id)
);

DROP TABLE if exists `speakers`;
CREATE TABLE `speakers` (
    speaker_id int,
    character_name varchar(100),
    movie_id int references movies(movie_id),
    gender char(1),
    credit_pos int,
    primary key (speaker_id),
    foreign key (movie_id) references movies(movie_id)
);

DROP TABLE if exists `conversations`;
CREATE TABLE `conversations` (
    conversation_id int,
    movie_id int references movies(movie_id),
    primary key (conversation_id),
    foreign key (movie_id) references movies(movie_id)
);

DROP TABLE if exists `utterances`;
CREATE TABLE `utterances` (
    utterance_id int,
    conversation_id int references conversations(conversation_id),
    speaker_id int references speakers(speaker_id),
    reply_to int,
    embedding vector(768),
    primary key (utterance_id),
    foreign key (conversation_id) references conversations(conversation_id),
    foreign key (speaker_id) references speakers(speaker_id)
);
```
After creating above tables, inserting the data and creating necessary indices, you can use the following SQL to query the data and use the returned ids to calculate the metric.
```
SELECT
    DISTINCT utterances.utterance_id
FROM
    utterances
    JOIN conversations ON utterances.conversation_id = conversations.conversation_id
WHERE
    utterances.speaker_id > 8935
    AND utterances.reply_to = -1;
ORDER BY
    L2Distance(input, embedding)
LIMIT
    100;
```
The DISTINCT in the select clause is because the ids may be duplicated after the JOIN clause (a movie may belong to more than one genre).
vectors: This folder contains the base vectors, learn vectors and test vectors. There are three files in this folder, including base_vectors.npy, learn_vectors.npy and test_vectors.npy. The three files are in the form of a numpy array. The shape of the three files is (284713, 768), (10000, 768) and (10000, 768) respectively. The first dimension of the three files is the number of vectors, and the second dimension is the dimension of the vector. The vectors are extracted from the texts in the Cornell Movie-Dialogs Corpus using the pretrained bert-base-uncased model in transformers library. The rows in the numpy array correspond to the rows in the corresponding csv file, based on which the database can be built and queries made.