Download Dataset & Dataset Format
In this page, we will introduce how to download the dataset and the dataset format.
Download
You can get the datasets from the GitHub
Release Page.
If you just want to get one dataset, you can download it from the following links:
The dataset is stored in the form of a zip file. After downloading, you can unzip it and get the dataset. For more details about data format, please refer to the following section.
Single table fungi dataset
This dataset involves a single table. After unzipping the dataset, you will get the following files:
single_table_fungi_dataset
├── answers
│ ├── learn_knn.csv
│ └── test_knn.csv
├── bool_filter
│ ├── learn_filter.csv
│ └── test_filter.csv
├── metadata
│ └── base_metadata.csv
└── vectors
├── base_vectors.npy
├── learn_vectors.npy
└── test_vectors.npy
There are four folders in the dataset, including answers
, bool_filter
,
metadata
and vectors
. Now, let's explain the content of each folder.
-
answers: This folder contains the reference results of learn queries and test queries. There are
two files in this folder, including
learn_knn.csv
andtest_knn.csv
. The former is the reference results of learn queries, and the latter is the reference results of test queries. Both of the files are in the form of a CSV file. They contain the following columns:id
: type isint
, the id of the query.max_distance
: type isfloat
. This number is the L2 distance between query vector and the 100-th nearest vector. Since there may be multiple vectors with the same distance as the query vector, this property is used to determine whether the query result returned by the system is also a correct result.L2_nearest_ids
: type islist
, each element in the list is from theid
column inbase_metadata.csv
, and the list is sorted by L2 distance in ascending order.
-
bool_filter: This folder contains the boolean expressions of learn queries and test queries.
There
are two files in this folder, including
learn_filter.csv
andtest_filter.csv
. The columns of the two files are the same, including:id
: type isint
, the id of the query.expression_num
: type isint
, the number of sub-expressions in a boolean expression.bool_expression
: type isstr
, the boolean expression of the query.
-
metadata: This folder contains the metadata of the dataset. There is only one file in this
folder, including
base_metadata.csv
. There are 295,938 rows in this file. The columns of the file are as follows:id
: type isint
, the id of the base vector and corresponding metadata.year
: type isint
.month
: type isint
.day
: type isint
. The three columns represent when the images were taken in the Danish Fungi dataset.countryCode
: type isstr
, country where the images were token.scientificName
: type isstr
, the scientific name the fungi in original image.Substrate
: type isstr
, the substrate of the fungi in original image.Latitude
: type isfloat
, the latitude that the image token in.Longitude
: type isfloat
, the longitude that the image token in.Habitat
: type isstr
, the habitat of the fungi in original image.poisonous
: type isbool
, whether the fungi specie is poisonous to humans.
embedding
to store the feature vectors in thebase_vectors.npy
file. Here is an example SQL script to create the table for this dataset:DROP TABLE if exists `fungis`; CREATE TABLE `fungis` ( id int, year int, month int, day int, countryCode char(2), scientificName varchar(110), Substrate varchar(50), Latitude float, Longitude float, Habitat varchar(60), poisonous boolean, embedding vector(768), primary key (id) );
SELECT id FROM fungis WHERE poisonous == 0 and Substrate != 'cones' or Habitat == 'nan' and day >= 26 ORDER BY L2Distance(input, embedding) LIMIT 100;
- vectors: This folder contains the base vectors, learn vectors and test vectors. There are three files in this folder, including
base_vectors.npy
,learn_vectors.npy
andtest_vectors.npy
. The three files are in the form of a numpy array. The shape of the three files is(295938, 768)
,(60832, 768)
and(60225, 768)
respectively, which is the original train, validation and test split in Danish Fungi dataset. The first dimension of the three files is the number of vectors, and the second dimension is the dimension of the vector. The vectors are extracted from the images in the Danish Fungi dataset using the pretrainedvit_small_patch16_224
model in timm library. The rows in the numpy array correspond to the rows in the corresponding CSV file, based on which the database can be built and queries made.Multi table fungi dataset
This dataset involves multiple tables. After unzipping the dataset, you will get the following files:
multi_table_movie_dataset ├── answers │ ├── learn_knn.csv │ └── test_knn.csv ├── bool_filter │ ├── learn_filter.csv │ └── test_filter.csv ├── metadata │ ├── conversations.csv │ ├── genres.csv │ ├── movies.csv │ ├── movies_genres.csv │ ├── speakers.csv │ └── utterances.csv └── vectors ├── base_vectors.npy ├── learn_vectors.npy └── test_vectors.npy
Similar to the single table fungi dataset, there are four folders in the dataset, including
answers
,bool_filter
,metadata
andvectors
. Now, let's explain the content of each folder.-
answers: This folder contains the reference results of learn queries and test queries. There are
two files in this folder, including
learn_knn.csv
andtest_knn.csv
. The former is the reference results of learn queries, and the latter is the reference results of test queries. Both of the files are in the form of a CSV file. They contain the following columns:id
: type isint
, the id of the query.max_distance
: type isfloat
. This number is the L2 distance between query vector and the 100-th nearest vector. Since there may be multiple vectors with the same distance as the query vector, this property is used to determine whether the query result returned by the system is also a correct result.L2_nearest_ids
: type islist
, each element in the list is from theutterance_id
column inutterances.csv
, and the list is sorted by L2 distance in ascending order.
-
bool_filter: This folder contains the boolean expressions of learn queries and test queries.
There are two files in this folder, including
learn_filter.csv
andtest_filter.csv
. Since this dataset contains multiple tables, the query involves multiple tables join and the boolean expressions are more complex than the single table dataset, and the columns of the two files are as follows:id
: type isint
, the id of the query.select_clause
: type isstr
.join_clause
: type isstr
.where_clause
: type isstr
. Concatenating these three columns together gives us an SQL-style query, and user can easily translate it into the syntax required by other databases. Here is an example query:SELECT DISTINCT utterances.utterance_id FROM utterances JOIN conversations ON utterances.conversation_id = conversations.conversation_id JOIN movies ON conversations.movie_id = movies.movie_id JOIN movies_genres ON movies_genres.movie_id = movies.movie_id WHERE conversations.movie_id != 351 OR movies_genres.genre_id = 23 OR utterances.reply_to = -1 OR conversations.conversation_id > 203908 AND conversations.conversation_id != 606452 AND conversations.movie_id = 206;
-
metadata: This folder contains the metadata of the dataset. There is only 6 files in this
folder, including
conversations.csv
,genres.csv
,movies.csv
,movies_genres.csv
,speakers.csv
andutterances.csv
. Now, let's explain the content of each file.-
conversations.csv
: This file contains the metadata of conversations. The columns of the file are as follows:conversation_id
: type isint
, the id of the conversation.movie_id
: type isint
, the id of the movie.
-
genres.csv
: This file contains the metadata of genres. The columns of the file are as follows:genre_id
: type isint
, the id of the genre.genre
: type isstr
, the name of the genre.
-
movies.csv
: This file contains the metadata of movies. The columns of the file are as follows:movie_id
: type isint
, the id of the movie.movie_name
: type isstr
, the name of the movie.release_year
: type isint
, the release year of the movie.rating
: type isfloat
, the rating of the movie.votes
: type isint
, the number of votes of the movie.
-
movies_genres.csv
: This file contains the metadata of the relationship between movies and genres. The columns of the file are as follows:movie_id
: type isint
, the id of the movie.genre_id
: type isint
, the id of the genre.
-
speakers.csv
: This file contains the metadata of speakers. The columns of the file are as follows:speaker_id
: type isint
, the id of the speaker.character_name
: type isstr
, the name of the character.movie_id
: type isint
, the id of the movie.gender
: type ischar
, the gender of the speaker in the movie.credit_pos
: type isint
, the credit position of the speaker in the movie.
-
utterances.csv
: This file contains the metadata of utterances. The columns of the file are as follows:utterance_id
: type isint
, the id of the utterance.conversation_id
: type isint
, the id of the conversation that the utterance corresponding to.speaker_id
: type isint
, the id of the speaker.reply_to
: type isint
, the id of the utterance that the current utterance replies to,-1
means that this utterance is first utterance in the conversation.
embedding
in tableutterances
to store the feature vectors in thebase_vectors.npy
file. Here is an example SQL script to create tables for this dataset:DROP TABLE if exists `movies`; CREATE TABLE `movies` ( movie_id int, movie_name varchar(100), release_year int, rating float, votes int, primary key (movie_id) ); DROP TABLE if exists `genres`; CREATE TABLE `genres` ( genre_id int, genre varchar(20), primary key (genre_id) ); DROP TABLE if exists `movies_genres`; CREATE TABLE `movies_genres` ( movie_id int references movies(movie_id), genre_id int references genres(genre_id), foreign key (movie_id) references movies(movie_id), foreign key (genre_id) references genres(genre_id) ); DROP TABLE if exists `speakers`; CREATE TABLE `speakers` ( speaker_id int, character_name varchar(100), movie_id int references movies(movie_id), gender char(1), credit_pos int, primary key (speaker_id), foreign key (movie_id) references movies(movie_id) ); DROP TABLE if exists `conversations`; CREATE TABLE `conversations` ( conversation_id int, movie_id int references movies(movie_id), primary key (conversation_id), foreign key (movie_id) references movies(movie_id) ); DROP TABLE if exists `utterances`; CREATE TABLE `utterances` ( utterance_id int, conversation_id int references conversations(conversation_id), speaker_id int references speakers(speaker_id), reply_to int, embedding vector(768), primary key (utterance_id), foreign key (conversation_id) references conversations(conversation_id), foreign key (speaker_id) references speakers(speaker_id) );
SELECT DISTINCT utterances.utterance_id FROM utterances JOIN conversations ON utterances.conversation_id = conversations.conversation_id WHERE utterances.speaker_id > 8935 AND utterances.reply_to = -1; ORDER BY L2Distance(input, embedding) LIMIT 100;
DISTINCT
in the select clause is because the ids may be duplicated after the JOIN clause (a movie may belong to more than one genre). -
-
vectors: This folder contains the base vectors, learn vectors and test vectors. There are three
files in this folder, including
base_vectors.npy
,learn_vectors.npy
andtest_vectors.npy
. The three files are in the form of a numpy array. The shape of the three files is(284713, 768)
,(10000, 768)
and(10000, 768)
respectively. The first dimension of the three files is the number of vectors, and the second dimension is the dimension of the vector. The vectors are extracted from the texts in the Cornell Movie-Dialogs Corpus using the pretrainedbert-base-uncased
model in transformers library. The rows in the numpy array correspond to the rows in the corresponding csv file, based on which the database can be built and queries made.