Retrieval

Overview

Retrieval operations allow you to search, query, and access the processed data stored in your Clarifeye warehouses. Clarifeye provides multiple retrieval methods:

Semantic search for finding relevant objects or chunks
Full-text search for finding relevant objects or chunks
Hybrid search, combining semantic and full-text searches, for finding relevant objects or chunks
Query based retrieval
Direct retrieval from the tables

All those retrieval methods can be used in the Natural Language Retrieval Tab.

Retrieval Methods

Semantic Search

You can use semantic search to find chunks or extracted objects that are close to the query.

warehouse = client.get_warehouse("your_warehouse_id")
results = warehouse.retrieve_with_semantic_search("Job descriptions for data engineers",
                                                 n_objects=5,
                                                 indexes=["objects", "chunks"])
retrieved_objects = pd.DataFrame(results)

Semantic search is only available for objects and chunks that have been pushed to retrieval : push_to_retrieval

When using semantic search, objects will be sorted by relevance score (cosine similarity) and this score will be available in the score field of the retrieved objects. The results will contain a reference url directing users to the Clarifeye viewer where they can explore the source of the retrieved objects.

Full Text Search

You can also use full text search to retrieve data from the warehouse.

results = warehouse.retrieve_with_full_text_search("Job descriptions for data engineers",
                                                   n_objects=5,
                                                   indexes=["objects", "chunks"])
retrieved_objects = pd.DataFrame(results)

Full text search is only available for objects and chunks that have been pushed to retrieval : push_to_retrieval

Full text search uses BM25 ranking function to sort the result. BM25 scores are non-negative real numbers. A score of 0 means the document contains none of the query terms. Higher scores indicate greater relevance to the query. Score will be available in the score field of the retrieved objects. The results will contain a reference url directing users to the Clarifeye viewer where they can explore the source of the retrieved objects.

Hybrid Search

You can also use hybrid search to retrieve data from the warehouse. This will combine the semantic search and the full text search.

results = warehouse.retrieve_with_hybrid_search("Job descriptions for data engineers",
                                                n_objects=5,
                                                indexes=["objects", "chunks"])
retrieved_objects = pd.DataFrame(results)

Full text search is only available for objects and chunks that have been pushed to retrieval : push_to_retrieval

Hybrid search uses Reciprocal Rank Fusion(RRF) algorithm to combine the two scores and sort the result. The retrieved objects will have three score values:

score: semantic similarity score(cosine similarity)
fulltext_score: full text score(BM25)
rrf_score: combined score(RRF). Used to sort the final result

Filtering with tags

You can leverage tags filter to restrict the search to a specific tag. In that case:

only the chunks that have been tagged with the given tag will be retrieved.
only the objects that belong to a chunk that has been tagged with the given tag will be retrieved.

If multiple tags are provided, the chunks and objects that have been tagged with any of the tags will be retrieved.

results = warehouse.retrieve_with_full_text_search("Job descriptions for data engineers",
                                                   n_objects=5,
                                                   indexes=["objects", "chunks"],
                                                   included_tags=[{"parent": "job-category", "children": "data-engineer"}])
retrieved_objects = pd.DataFrame(results)

You can also use the excluded_tags parameter to exclude chunks or objects that have been tagged with the given tag.

results = warehouse.retrieve_with_semantic_search("Job descriptions for data engineers",
                                                  n_objects=5,
                                                  indexes=["objects", "chunks"],
                                                  excluded_tags=[{"parent": "job-category", "children": "data-engineer"}])
retrieved_objects = pd.DataFrame(results)

Generate the answer

You can generate answers based on the data returned by the different retrieval methods. The first option is to leverage the functions generate_answer_with_semantic_search, generate_answer_with_full_text_search, generate_answer_with_hybrid_search.

answer = warehouse.generate_answer_with_full_text_search(query="Job descriptions for data engineers",
                                                         question="What are the key required skills for data engineers?",
                                                         n_objects=5,
                                                         indexes=["objects", "chunks"],
                                                         included_tags=[{"parent": "job-category", "children": "data"}])

The query is the natural language query that will be used to retrieve the data where the question is the question that will be used to generate the answer. The rest of the parameters are the same as the ones used in the retrieval methods. This option can be tested through the Natural Language Retrieval Tab, after your retrieval is done, you can click on the “Answer Question” button to generate the answer. The second option is to directly pass the retrieved data into an LLM with the prompt you would like to use.

Additional options

For each of the provided retrieval methods, you can also use some of the following advanced options:

enrich_with_chunks

Whether or not to include the source chunk of an object, it will be available under the source_chunk_content key of the results. Enable this by passing enrich_with_chunks=True.

rerank

You can improve the relevancy and quality of retrieved results by passing rerank=True to any of the three search methods. Reranking uses Cohere’s rerank-v3.5 model. When enabled, the result will contain relevance_score, ranging 0(low relevancy) to 1(high relevancy). Reranking typically adds the most value when the query:

is long and complex
includes ambiguous or polysemous terms (e.g. “AI jobs involving data labeling”)
has multiple intents or facets (e.g. “Remote React jobs with equity at seed-stage companies”)
needs matching with implicit semantics (e.g. “What jobs suit someone with NLP research experience?”)
contains negative or comparative conditions (e.g. “Jobs that are not junior-level”)

reformulate_query

Clarifeye leverages HyDE(Hypothetical Document Embeddings), a technique to generate hypothetical answer to your query which will be used to retrieve objects. When enabled, your query will be first sent to OpenAI’s gpt-4o-mini endpoint to generate hypotherical answers. You can enable this by passing reformulate_query=True. Typical queries that benefit the most from using HyDE are:

open-domain questions (e.g. “Which jobs involve working with distributed systems?”)
long-tail or niche queries (e.g. “Roles where you need experience with WebAssembly”)
underspecified queries (e.g. “Jobs where I can work on core systems”)
queries containing jargons (e.g. “Positions that follow ISO 27001 compliance”)

Structured Queries

You can also use cypher queries to retrieve data from the warehouse.

warehouse_indentifier = "p" + warehouse.warehouse_id.replace("-", "_")
query = f"""
MATCH (a:{warehouse_indentifier}:Attribute)-[r]-(o:{warehouse_indentifier}:JobDescription)
WHERE a._type = "job_title_job_descriptions"
AND a.value = "Data Engineer"
RETURN o
"""
result = warehouse.retrieve_with_cypher(query)
retrieved_objects = pd.DataFrame(res["retrieval_results"])

The above query will return all the job descriptions that have a job title attribute with the value “Data Engineer”.

Cypher queries are only available for objects and chunks that have been pushed to retrieval : push_to_retrieval

The results will contain a reference url directing users to the Clarifeye viewer where they can explore the source of the retrieved objects. The text to cypher can be used to generate the query:

instruction = "All the documents which contains chunks about financial data"

response = warehouse.generate_cypher_query(instruction)
cypher_query = response["generated_cypher_query"]

Direct retrieval from the tables

The objects and chunks extracted can also be retrieved directly from the tables, this may be useful if you want to do some advanced python processing before sending the data to the LLM context

table = warehouse.get_table("parsed_documents")
parsed_documents = table.get_data()

table = warehouse.get_table("chunks")
chunks = table.get_data()

table = warehouse.get_table("job_descriptions")
job_descriptions = table.get_data()

Overview

Python Client

REST API

Overview

Retrieval Methods

Semantic Search

Full Text Search

Hybrid Search

Filtering with tags

Generate the answer

Additional options

enrich_with_chunks

rerank

reformulate_query

Structured Queries

Direct retrieval from the tables

Overview

Python Client

REST API

Documentation Index

​Overview

​Retrieval Methods

​Semantic Search

​Full Text Search

​Hybrid Search

​Filtering with tags

​Generate the answer

​Additional options

​enrich_with_chunks

​rerank

​reformulate_query

​Structured Queries

​Direct retrieval from the tables

Overview

Retrieval Methods

Semantic Search

Full Text Search

Hybrid Search

Filtering with tags

Generate the answer

Additional options

enrich_with_chunks

rerank

reformulate_query

Structured Queries

Direct retrieval from the tables