Batch Add Documents

This guide explains how to add a batch of documents to your semantic index.

Overview¶

Gainly's API supports adding one document at a time through the Add Document endpoint.

To efficiently add multiple documents, you can implement:

A simple batching mechanism, plus
Rate limiting to avoid hitting the API rate limit

Implementation¶

The implementation requires two main components:

A function to add individual documents
A batch processing function that handles multiple documents with rate limiting

Here's a complete example in Python:

import time
import requests
from typing import Dict, List

def add_document(title: str, content: str) -> Dict:
    """
    Add a document to Gainly's index.

    Args:
        title: Document title (required)
        content: Document content
    """
    BASE_URL = "https://api.gainly.ai"
    VERSION = "v20241104"
    headers = {
        "Content-Type": "application/json",
        "X-API-Key": "YOUR_API_KEY_HERE"  # Replace with your actual API key
    }

    response = requests.post(
        f"{BASE_URL}/{VERSION}/documents",
        headers=headers,
        json={
            "title": title,
            "content": content
        }
    )
    return response.json()

def bulk_add_documents(documents: List[Dict[str, str]], delay_seconds: float = 10.0) -> List[str]:
    """
    Add multiple documents with rate limiting.

    Args:
        documents: List of dicts with 'title' and 'content' keys
        delay_seconds: Delay between API calls (Default: Using 10.0 seconds for the Free Plan)

    Returns:
        List[str]: List of document IDs that were successfully added
    """
    document_ids = []
    for doc in documents:
        try:
            result = add_document(title=doc['title'], content=doc['content'])
            if isinstance(result, dict) and 'id' in result:
                document_ids.append(result['id'])
            time.sleep(delay_seconds)  # Rate limiting delay
        except Exception as e:
            print(f"Error adding document '{doc['title']}': {e}")
    return document_ids

Usage Example¶

Here's how to use the batch processing function:

# Example documents to add
documents_to_add = [
    {
        "title": "Product Features",
        "content": "Gainly offers both lexical and AI semantic search capabilities..."
    },
    {
        "title": "Getting Started",
        "content": "To begin using Gainly, first obtain an API key from your dashboard..."
    },
    {
        "title": "API Reference",
        "content": "The Gainly API provides RESTful endpoints for search and document management..."
    }
]

# Add documents in batch
results = bulk_add_documents(documents_to_add)
print(f"Successfully added {len(results)} documents")

Best Practices¶

Rate Limiting¶

The sample implementation uses a default delay of 10 seconds between requests, which is appropriate for the Free Plan.

Once you upgrade to a Paid Plan, you can adjust this value to 1 second or even lower.

Batch Size & Chunking¶

Most practical use cases involve adding a very large number of documents that are likely too large to process in a single call to the bulk_add_documents function we defined earlier, due to memory and processing time constraints.

You can address this by breaking the large batches into smaller chunks to better manage memory and processing time constraints.

Here's how you can implement chunking:

def process_in_chunks(documents: List[Dict[str, str]], chunk_size: int = 100) -> List[str]:
    """
    Process documents in smaller chunks to manage memory and processing time.

    Args:
        documents: List of documents to process
        chunk_size: Number of documents to process in each chunk

    Returns:
        List[str]: List of document IDs that were successfully added
    """
    document_ids = []
    total_chunks = (len(documents) + chunk_size - 1) // chunk_size

    # Split documents into chunks
    for i in range(0, len(documents), chunk_size):
        chunk = documents[i:i + chunk_size]
        print(f"Processing chunk {i//chunk_size + 1} of {total_chunks} ({len(chunk)} documents)")

        # Process this chunk and extend IDs
        chunk_ids = bulk_add_documents(chunk)
        document_ids.extend(chunk_ids)

    return document_ids

# Example usage
large_document_set = [
    {"title": f"Document {i}", "content": f"Content {i}"} 
    for i in range(1000)
]

# Process 1000 documents in chunks of 100
document_ids = process_in_chunks(large_document_set, chunk_size=100)
print(f"Successfully processed {len(document_ids)} documents")

Some guidelines for choosing chunk sizes:

For most use cases, chunks of 100-500 documents work well
If your documents are very large (>100KB), consider smaller chunks of 50-100
For very small documents (<1KB), you can use larger chunks of 500-1000
Monitor memory usage and adjust chunk size

Error Handling¶

Make sure to add error handling to suit your needs. For example, you can update the add_document function to raise exceptions and handle them appropriately.

from typing import Tuple

class GainlyAPIError(Exception):
    """Custom exception for Gainly API errors"""
    def __init__(self, message: str, error_type: str, status_code: int):
        self.message = message
        self.error_type = error_type
        self.status_code = status_code
        super().__init__(self.message)

def add_document(title: str, content: str) -> Tuple[Dict, int]:
    """
    Add a document to Gainly's index.

    Args:
        title: Document title (required)
        content: Document content

    Returns:
        Tuple[Dict, int]: A tuple containing the response data and status code

    Raises:
        GainlyAPIError: If the API returns an error response
        requests.RequestException: For network-related errors
    """
    BASE_URL = "https://api.gainly.ai"
    VERSION = "v20241104"
    headers = {
        "Content-Type": "application/json",
        "X-API-Key": "YOUR_API_KEY_HERE"  # Replace with your actual API key
    }

    try:
        response = requests.post(
            f"{BASE_URL}/{VERSION}/documents",
            headers=headers,
            json={
                "title": title,
                "content": content
            }
        )

        # Always get the status code
        status_code = response.status_code

        if response.status_code == 200:
            return response.json(), status_code

        # Handle error responses
        try:
            error_data = response.json().get("error", {})
            raise GainlyAPIError(
                message=error_data.get("message", "Unknown error occurred"),
                error_type=error_data.get("type", "unknown_error"),
                status_code=status_code
            )
        except ValueError:
            # Handle non-JSON error responses
            raise GainlyAPIError(
                message=response.text or "Unknown error occurred",
                error_type="unknown_error",
                status_code=status_code
            )

    except requests.RequestException as e:
        # Get status code from the exception if available
        status_code = e.response.status_code if hasattr(e, 'response') and e.response else 0
        raise GainlyAPIError(
            message=str(e),
            error_type="network_error",
            status_code=status_code
        )

Monitoring¶

Keep track of the results to ensure all documents are added successfully.

Additional Options¶

The example above shows basic title and content fields. You can extend the implementation to include other supported fields from the API:

metadata: JSON object for custom metadata
source_uri: Source URI of the document
language: Language code (defaults to en)
tenant_id: Tenant ID for multi-tenant applications
created_at: Document creation timestamp
updated_at: Last modification timestamp

For more details about these fields, see the Add Document API Reference.