Tutorial: Build a Scalable Question Answering System
Last Updated: April 26, 2023
- Level: Beginner
- Time to complete: 20 minutes
- Nodes Used:
ElasticsearchDocumentStore
,BM25Retriever
,FARMReader
- Goal: After completing this tutorial, you’ll have built a scalable search system that runs on text files and can answer questions about Game of Thrones. You’ll then be able to expand this system for your needs.
Overview
Learn how to set up a question answering system that can search through complex knowledge bases and highlight answers to questions such as “Who is the father of Arya Stark?”. In this tutorial, we’ll work on a set of Wikipedia pages about Game of Thrones, but you can adapt it to search through internal wikis or a collection of financial reports, for example.
This tutorial introduces you to all the concepts needed to build such a question answering system. It also uses Haystack components, such as indexing pipelines, querying pipelines, and DocumentStores backed by external database services.
Let’s learn how to build a question answering system and discover more about the marvelous seven kingdoms!
Preparing the Colab Environment
Installing Haystack
To start, let’s install the latest release of Haystack with pip
:
%%bash
pip install --upgrade pip
pip install farm-haystack[colab,preprocessing,elasticsearch]
Enabling Telemetry
Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.
from haystack.telemetry import tutorial_running
tutorial_running(3)
Set the logging level to INFO:
import logging
logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
Initializing the ElasticsearchDocumentStore
A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. Here, we’re using the
ElasticsearchDocumentStore
which connects to a running Elasticsearch service. It’s a fast and scalable text-focused storage option. This service runs independently from Haystack and persists even after the Haystack program has finished running. To learn more about the DocumentStore and the different types of external databases that we support, see
DocumentStore.
- Download, extract, and set the permissions for the Elasticsearch installation image:
%%bash
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2
- Start the server:
%%bash --bg
sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch
If you are working in an environment where Docker is available, you can also start Elasticsearch using Docker. You can do this manually, or using our
launch_es()
utility function.
- Wait 30 seconds for the server to fully start up:
import time
time.sleep(30)
- Initialize the ElasticsearchDocumentStore:
from haystack.utils import launch_es
launch_es()
import os
from haystack.document_stores import ElasticsearchDocumentStore
# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")
document_store = ElasticsearchDocumentStore(
host=host,
username="",
password="",
index="document"
)
ElasticsearchDocumentStore is up and running and ready to store the Documents.
Indexing Documents with a Pipeline
The next step is adding the files to the DocumentStore. The indexing pipeline turns your files into Document objects and writes them to the DocumentStore. Our indexing pipeline will have two nodes: TextConverter
, which turns .txt
files into Haystack Document
objects, and PreProcessor
, which cleans and splits the text within a Document
.
Once we combine these nodes into a pipeline, the pipeline will ingest .txt
file paths, preprocess them, and write them into the DocumentStore.
- Download 517 articles from the Game of Thrones Wikipedia. You can find them in data/build_a_scalable_question_answering_system as a set of .txt files.
from haystack.utils import fetch_archive_from_http
doc_dir = "data/build_a_scalable_question_answering_system"
fetch_archive_from_http(
url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt3.zip",
output_dir=doc_dir
)
- Initialize the pipeline, TextConverter, and PreProcessor:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor
indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
clean_whitespace=True,
clean_header_footer=True,
clean_empty_lines=True,
split_by="word",
split_length=200,
split_overlap=20,
split_respect_sentence_boundary=True,
)
To learn more about the parameters of the PreProcessor
, see
Usage. To understand why document splitting is important for your question answering system’s performance, see
Document Length.
- Add the nodes into an indexing pipeline. You should provide the
name
orname
s of preceding nodes as theinput
argument. Note that in an indexing pipeline, the input to the first node isFile
.
import os
indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])
- Run the indexing pipeline to write the text data into the DocumentStore:
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)
The code in this tutorial uses Game of Thrones data, but you can also supply your own .txt
files and index them in the same way.
As an alternative, you can cast you text data into
Document objects and write them into the DocumentStore using
DocumentStore.write_documents()
.
Now that the Documents are in the DocumentStore, let’s initialize the nodes we want to use in our query pipeline.
Initializing the Retriever
Our query pipeline is going to use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only those that are relevant to the question. This tutorial uses the BM25Retriever. This is the recommended Retriever for a question answering system like the one we’re creating. For more Retriever options, see Retriever.
from haystack.nodes import BM25Retriever
retriever = BM25Retriever(document_store=document_store)
The BM25Retriever is initialized and ready for the pipeline.
Initializing the Reader
Our query pipeline also needs a Reader, so we’ll initialize it next. A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. This tutorials uses a FARMReader with a base-sized RoBERTa question answering model called
deepset/roberta-base-squad2
. It’s a good all-round model to start with. To find a model that’s best for your use case, see
Models.
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
Creating the Retriever-Reader Pipeline
You can combine the Reader and Retriever in a querying pipeline using the Pipeline
class. The combination of the two speeds up processing because the Reader only processes the Documents that it received from the Retriever.
Initialize the Pipeline
object and add the Retriever and Reader as nodes. You should provide the name
or name
s of preceding nodes as the input argument. Note that in a querying pipeline, the input to the first node is Query
.
from haystack import Pipeline
querying_pipeline = Pipeline()
querying_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
querying_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])
That’s it! Your pipeline’s ready to answer your questions!
Asking a Question
- Use the pipeline’s
run()
method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using thetop-k
parameter. To learn more about setting arguments, see Arguments. To understand the importance of thetop-k
parameter, see Choosing the Right top-k Values.
prediction = querying_pipeline.run(
query="Who is the father of Arya Stark?",
params={
"Retriever": {"top_k": 10},
"Reader": {"top_k": 5}
}
)
Here are some questions you could try out:
- Who is the father of Arya Stark?
- Who created the Dothraki vocabulary?
- Who is the sister of Sansa?
- Print out the answers the pipeline returns:
from pprint import pprint
pprint(prediction)
- Simplify the printed answers:
from haystack.utils import print_answers
print_answers(
prediction,
details="minimum" ## Choose from `minimum`, `medium` and `all`
)
And there you have it! Congratulations on building a scalable machine learning based question answering system!
Next Steps
To learn how to improve the performance of the Reader, see Fine-Tune a Reader.