Batch Add Documents
This guide explains how to add a batch of documents to your semantic index.
Overview¶
Gainly's API supports adding one document at a time through the Add Document endpoint.
To efficiently add multiple documents, you can implement:
- A simple batching mechanism, plus
- Rate limiting to avoid hitting the API rate limit
Implementation¶
The implementation requires two main components:
- A function to add individual documents
- A batch processing function that handles multiple documents with rate limiting
Here's a complete example in Python:
import time
import requests
from typing import Dict, List
def add_document(title: str, content: str) -> Dict:
"""
Add a document to Gainly's index.
Args:
title: Document title (required)
content: Document content
"""
BASE_URL = "https://api.gainly.ai"
VERSION = "v20241104"
headers = {
"Content-Type": "application/json",
"X-API-Key": "YOUR_API_KEY_HERE" # Replace with your actual API key
}
response = requests.post(
f"{BASE_URL}/{VERSION}/documents",
headers=headers,
json={
"title": title,
"content": content
}
)
return response.json()
def bulk_add_documents(documents: List[Dict[str, str]], delay_seconds: float = 10.0) -> List[str]:
"""
Add multiple documents with rate limiting.
Args:
documents: List of dicts with 'title' and 'content' keys
delay_seconds: Delay between API calls (Default: Using 10.0 seconds for the Free Plan)
Returns:
List[str]: List of document IDs that were successfully added
"""
document_ids = []
for doc in documents:
try:
result = add_document(title=doc['title'], content=doc['content'])
if isinstance(result, dict) and 'id' in result:
document_ids.append(result['id'])
time.sleep(delay_seconds) # Rate limiting delay
except Exception as e:
print(f"Error adding document '{doc['title']}': {e}")
return document_ids
Usage Example¶
Here's how to use the batch processing function:
# Example documents to add
documents_to_add = [
{
"title": "Product Features",
"content": "Gainly offers both lexical and AI semantic search capabilities..."
},
{
"title": "Getting Started",
"content": "To begin using Gainly, first obtain an API key from your dashboard..."
},
{
"title": "API Reference",
"content": "The Gainly API provides RESTful endpoints for search and document management..."
}
]
# Add documents in batch
results = bulk_add_documents(documents_to_add)
print(f"Successfully added {len(results)} documents")
Best Practices¶
Rate Limiting¶
The sample implementation uses a default delay of 10 seconds between requests, which is appropriate for the Free Plan.
Once you upgrade to a Paid Plan, you can adjust this value to 1 second or even lower.
Batch Size & Chunking¶
Most practical use cases involve adding a very large number of documents that are likely too large to process in a single call to the bulk_add_documents
function we defined earlier, due to memory and processing time constraints.
You can address this by breaking the large batches into smaller chunks to better manage memory and processing time constraints.
Here's how you can implement chunking:
def process_in_chunks(documents: List[Dict[str, str]], chunk_size: int = 100) -> List[str]:
"""
Process documents in smaller chunks to manage memory and processing time.
Args:
documents: List of documents to process
chunk_size: Number of documents to process in each chunk
Returns:
List[str]: List of document IDs that were successfully added
"""
document_ids = []
total_chunks = (len(documents) + chunk_size - 1) // chunk_size
# Split documents into chunks
for i in range(0, len(documents), chunk_size):
chunk = documents[i:i + chunk_size]
print(f"Processing chunk {i//chunk_size + 1} of {total_chunks} ({len(chunk)} documents)")
# Process this chunk and extend IDs
chunk_ids = bulk_add_documents(chunk)
document_ids.extend(chunk_ids)
return document_ids
# Example usage
large_document_set = [
{"title": f"Document {i}", "content": f"Content {i}"}
for i in range(1000)
]
# Process 1000 documents in chunks of 100
document_ids = process_in_chunks(large_document_set, chunk_size=100)
print(f"Successfully processed {len(document_ids)} documents")
Some guidelines for choosing chunk sizes:
- For most use cases, chunks of 100-500 documents work well
- If your documents are very large (>100KB), consider smaller chunks of 50-100
- For very small documents (<1KB), you can use larger chunks of 500-1000
- Monitor memory usage and adjust chunk size
Error Handling¶
Make sure to add error handling to suit your needs. For example, you can update the add_document
function to raise exceptions and handle them appropriately.
from typing import Tuple
class GainlyAPIError(Exception):
"""Custom exception for Gainly API errors"""
def __init__(self, message: str, error_type: str, status_code: int):
self.message = message
self.error_type = error_type
self.status_code = status_code
super().__init__(self.message)
def add_document(title: str, content: str) -> Tuple[Dict, int]:
"""
Add a document to Gainly's index.
Args:
title: Document title (required)
content: Document content
Returns:
Tuple[Dict, int]: A tuple containing the response data and status code
Raises:
GainlyAPIError: If the API returns an error response
requests.RequestException: For network-related errors
"""
BASE_URL = "https://api.gainly.ai"
VERSION = "v20241104"
headers = {
"Content-Type": "application/json",
"X-API-Key": "YOUR_API_KEY_HERE" # Replace with your actual API key
}
try:
response = requests.post(
f"{BASE_URL}/{VERSION}/documents",
headers=headers,
json={
"title": title,
"content": content
}
)
# Always get the status code
status_code = response.status_code
if response.status_code == 200:
return response.json(), status_code
# Handle error responses
try:
error_data = response.json().get("error", {})
raise GainlyAPIError(
message=error_data.get("message", "Unknown error occurred"),
error_type=error_data.get("type", "unknown_error"),
status_code=status_code
)
except ValueError:
# Handle non-JSON error responses
raise GainlyAPIError(
message=response.text or "Unknown error occurred",
error_type="unknown_error",
status_code=status_code
)
except requests.RequestException as e:
# Get status code from the exception if available
status_code = e.response.status_code if hasattr(e, 'response') and e.response else 0
raise GainlyAPIError(
message=str(e),
error_type="network_error",
status_code=status_code
)
Monitoring¶
Keep track of the results to ensure all documents are added successfully.
Additional Options¶
The example above shows basic title
and content
fields. You can extend the implementation to include other supported fields from the API:
metadata
: JSON object for custom metadatasource_uri
: Source URI of the documentlanguage
: Language code (defaults toen
)tenant_id
: Tenant ID for multi-tenant applicationscreated_at
: Document creation timestampupdated_at
: Last modification timestamp
For more details about these fields, see the Add Document API Reference.