Retrieval Operations

The platform provides powerful retrieval capabilities using semantic search, full-text search, and hybrid approaches.

Sync Data to Retrieval Engine

Before performing retrieval, sync your data to the retrieval engine (Neo4j graph database):

# Sync all data to retrieval
sync_task = warehouse.run_sync_to_retrieval()
result = sync_task.wait_for_completion()
print(f"Sync status: {result['status']}")

# Or sync specific tables
chunks_table = warehouse.get_table("chunks")
sync_task = chunks_table.push_to_retrieval()
sync_task.wait_for_completion()

Semantic Search (Vector Search)

Retrieve relevant content using embedding-based similarity:

# Semantic search on chunks
results = warehouse.retrieve_with_semantic_search(
    query="What are the payment terms?",
    n_objects=10,  # Number of results
    indexes=["chunks"],  # Search in chunks
    reformulate_query=False,  # Use LLM to reformulate query
    rerank=True  # Use reranker for better results
)

# Access results
for result in results:
    print(f"Score: {result['score']}")
    print(f"Content: {result['content'][:200]}...")
    print(f"Document: {result.get('name', 'N/A')}")
    print("---")

Full-Text Search

Retrieve content using keyword-based search:

# Full-text search
results = warehouse.retrieve_with_full_text_search(
    query="invoice payment terms",
    indexes=["chunks"],
    n_objects=10
)

Hybrid Search

Combine semantic and full-text search for optimal results:

# Hybrid search (recommended)
results = warehouse.retrieve_with_hybrid_search(
    query="payment terms and conditions",
    indexes=["chunks"],
    n_objects=10,
    rrf_k=60,  # Reciprocal Rank Fusion parameter
    rerank=True
)

Search with Filters

Apply filters to narrow down results:

# Filter by tags
results = warehouse.retrieve_with_semantic_search(
    query="financial data",
    n_objects=10,
    indexes=["chunks"],
    included_tags={"document_type": "invoice"},  # Only invoices
    excluded_tags={"status": "draft"}  # Exclude drafts
)

# Filter by documents
results = warehouse.retrieve_with_semantic_search(
    query="contract terms",
    n_objects=10,
    indexes=["chunks"],
    selected_documents=[{"id": "doc-id-1"}, {"id": "doc-id-2"}]
)

# Filter by objects
results = warehouse.retrieve_with_semantic_search(
    query="high value items",
    n_objects=10,
    indexes=["objects"],
    included_objects=[{"table_name": "invoices"}]
)

Generate Answers with Retrieval

Get AI-generated answers based on retrieved context:

# Generate answer with semantic search
answer = warehouse.generate_answer_with_semantic_search(
    question="What are the payment terms in the contract?",
    query=None,  # If None, uses the question as query
    n_objects=10,
    indexes=["chunks"]
)

print(f"Answer: {answer}")

# Generate answer with hybrid search (recommended)
answer = warehouse.generate_answer_with_hybrid_search(
    question="What are the payment terms?",
    query="payment terms conditions",
    n_objects=10,
    indexes=["chunks"],
    rrf_k=60
)

Cypher Queries

For advanced use cases, query the knowledge graph directly:

# Execute a Cypher query
cypher_query = """
MATCH (c:Chunk)-[:BELONGS_TO]->(d:Document)
WHERE d.name CONTAINS 'invoice'
RETURN c.content, d.name
LIMIT 10
"""

results = warehouse.retrieve_with_cypher(cypher_query)

# Use templated queries
template = """
MATCH (c:Chunk)-[:BELONGS_TO]->(d:Document)
WHERE d.name CONTAINS $doc_name
RETURN c.content
LIMIT $limit
"""

results = warehouse.retrieve_with_templated_query(
    query_template=template,
    template_variables={"doc_name": "invoice", "limit": 10}
)

Filtering with attributes

When searching on objects index, you can leverage attributes filter to restrict the search on objects that have specific attribute values. In that case:

only the objects that have the given attribute values will be retrieved.

Attributes filter is only available for objects that have been pushed to retrieval : push_to_retrieval
attribute values are case sensitive
operator can be AND or OR

# Imagine we have the following object class
#
# from pydantic import BaseModel
# class Benefits(BaseModel):
#     healthcare: str
#
# class JobDescription(BaseModel):
#     work_type: str
#     role_type: str
#     location: str
#     benefits: Benefits

included_attributes = {
    "values":[
        {"work_type":"remote"},
        {"role_type":"individual contributor"}
        ],
    "operator":"AND"
}

excluded_attributes = {
    "values":[
        {"location":"Not Specified"},
        {"healthcare":"Not Specified"}
    ],
    "operator":"OR"
}

results = warehouse.retrieve_with_semantic_search("Job descriptions for data engineers",
                                                  n_objects=5,
                                                  indexes=["objects"],
                                                  included_attributes=included_attributes,
                                                  excluded_attributes=excluded_attributes)
retrieved_objects = pd.DataFrame(results)

Overview

Python Client

REST API

Retrieval Operations

Sync Data to Retrieval Engine

Semantic Search (Vector Search)

Full-Text Search

Hybrid Search

Search with Filters

Generate Answers with Retrieval

Cypher Queries

Filtering with attributes

Overview

Python Client

REST API

Documentation Index

​Sync Data to Retrieval Engine

​Semantic Search (Vector Search)

​Full-Text Search

​Hybrid Search

​Search with Filters

​Generate Answers with Retrieval

​Cypher Queries

​Filtering with attributes

Sync Data to Retrieval Engine

Semantic Search (Vector Search)

Full-Text Search

Hybrid Search

Search with Filters

Generate Answers with Retrieval

Cypher Queries

Filtering with attributes