AI Augmentation with Lucene
AI Augmentation with Lucene
Lucee's AI capabilities can be enhanced with Retrieval-Augmented Generation (RAG) using the Lucene extension. This powerful combination allows AI models to reference your indexed content when responding to queries, creating more accurate and contextually relevant answers.
Overview
The Lucene extension enables you to create searchable collections of content that can be used to augment AI queries with relevant context. This approach improves AI responses by providing domain-specific information from your data sources, whether they're local documentation, databases, external APIs, or other resources.
Requirements
- Lucee 7.0 or higher
- Lucene Extension version 3.0 or higher (Maven-based version)
- Configured AI endpoint (see AI Documentation)
How It Works
- A collection is created to store searchable content
- Content is indexed from various sources (databases, files, web content, APIs, etc.)
- When a query is sent to an AI, it's first used to search the collection for relevant information
- The search results, including content chunks from matches, are added as context to the original query
- The augmented query is sent to the AI endpoint, enabling more informed responses
Implementation
Here's how to implement AI augmentation with Lucene in your Lucee application:
Step 1: Create a Collection
Create a searchable collection to store your indexed content:
// Define collection name
collectionName = "my_knowledge_base";
// Create if needed
collection action="list" name="local.collections";
var hasColl=false;
loop query=collections {
if(collections.name==collectionName) {
hasColl=true;
break;
}
}
if(!hasColl) {
// Define collection directory
var collDirectory=expandPath("{lucee-config-dir}/collections/knowledge");
if(!directoryExists(collDirectory)) {
directoryCreate(collDirectory,true);
}
// Create collection
collection action="Create" collection=collectionName path=collDirectory;
}
Step 2: Index Your Content
You can index content from virtually any source you can access in CFML:
// Example 1: Index content from a database
function indexDatabaseContent() {
// Query your data source
var qryContent = queryExecute("
SELECT
id AS url,
title,
description AS summary,
content,
categories AS keywords
FROM knowledge_articles
WHERE is_active = 1
");
<span class="c1">// Index the content</span>
<span class="nx">index</span> <span class="nx">action</span><span class="o">=</span><span class="s2">"update"</span>
<span class="nx">type</span><span class="o">=</span><span class="s2">"custom"</span>
<span class="nx">collection</span><span class="o">=</span><span class="nx">collectionName</span>
<span class="nx">key</span><span class="o">=</span><span class="s2">"url"</span>
<span class="nx">title</span><span class="o">=</span><span class="s2">"title"</span>
<span class="nx">body</span><span class="o">=</span><span class="s2">"content,summary"</span>
<span class="nx">custom1</span><span class="o">=</span><span class="s2">"keywords"</span>
<span class="nx">query</span><span class="o">=</span><span class="s2">"qryContent"</span><span class="p">;</span>
}
// Example 2: Index content from files
function indexFileContent() {
// Get list of files
var files = directoryList(expandPath("./resources/docs"), true, "path", "*.md");
<span class="c1">// Create query object to hold file contents</span>
<span class="kd">var</span> <span class="nx">qryFiles</span> <span class="o">=</span> <span class="nx">queryNew</span><span class="p">(</span><span class="s2">"url,title,body,keywords"</span><span class="p">);</span>
<span class="c1">// Process each file</span>
<span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">filePath</span> <span class="k">in</span> <span class="nx">files</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">content</span> <span class="o">=</span> <span class="nx">fileRead</span><span class="p">(</span><span class="nx">filePath</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">title</span> <span class="o">=</span> <span class="nx">listLast</span><span class="p">(</span><span class="nx">filePath</span><span class="p">,</span> <span class="s2">"/\");</span>
// Extract metadata from file content if applicable
// This is just an example - adapt to your file format
var keywords = "";
if(content contains "Keywords:") {
keywords = reMatch("Keywords:(.+?)[</span>r</span>n]", content);
if(arrayLen(keywords)) {
keywords = trim(replace(keywords[1], "Keywords:", ""));
}
}
// Add to query
queryAddRow(qryFiles);
querySetCell(qryFiles, "url", filePath);
querySetCell(qryFiles, "title", title);
querySetCell(qryFiles, "body", content);
querySetCell(qryFiles, "keywords", keywords);
}
// Index the files
index action="update"
type="custom"
collection=collectionName
key="url"
title="title"
body="body"
custom1="keywords"
query="qryFiles";
}
// Example 3: Index web content
function indexWebContent() {
// Define URLs to index
var urls = [
"https://example.com/api/docs",
"https://example.com/api/reference",
"https://example.com/api/tutorials"
];
<span class="c1">// Create query object</span>
<span class="kd">var</span> <span class="nx">qryWeb</span> <span class="o">=</span> <span class="nx">queryNew</span><span class="p">(</span><span class="s2">"url,title,body,keywords"</span><span class="p">);</span>
<span class="c1">// Process each URL</span>
<span class="k">for</span><span class="p">(</span><span class="kd">var</span> <span class="nx">url</span> <span class="k">in</span> <span class="nx">urls</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">httpService</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">http</span><span class="p">();</span>
<span class="nx">httpService</span><span class="p">.</span><span class="nx">setURL</span><span class="p">(</span><span class="nx">url</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">result</span> <span class="o">=</span> <span class="nx">httpService</span><span class="p">.</span><span class="nx">send</span><span class="p">().</span><span class="nx">getPrefix</span><span class="p">();</span>
<span class="k">if</span><span class="p">(</span><span class="nx">result</span><span class="p">.</span><span class="nx">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Extract title and content (simplified example)</span>
<span class="kd">var</span> <span class="nx">title</span> <span class="o">=</span> <span class="nx">reMatchNoCase</span><span class="p">(</span><span class="s2">"<title>(.+?)</title>"</span><span class="p">,</span> <span class="nx">result</span><span class="p">.</span><span class="nx">fileContent</span><span class="p">);</span>
<span class="nx">title</span> <span class="o">=</span> <span class="nx">arrayLen</span><span class="p">(</span><span class="nx">title</span><span class="p">)</span> <span class="o">?</span> <span class="nx">replaceNoCase</span><span class="p">(</span><span class="nx">title</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="s2">"<title>"</span><span class="p">,</span> <span class="s2">""</span><span class="p">)</span> <span class="o">:</span> <span class="nx">url</span><span class="p">;</span>
<span class="nx">title</span> <span class="o">=</span> <span class="nx">replaceNoCase</span><span class="p">(</span><span class="nx">title</span><span class="p">,</span> <span class="s2">"</title>"</span><span class="p">,</span> <span class="s2">""</span><span class="p">);</span>
<span class="c1">// Strip HTML for indexing body</span>
<span class="kd">var</span> <span class="nx">body</span> <span class="o">=</span> <span class="nx">reReplaceNoCase</span><span class="p">(</span><span class="nx">result</span><span class="p">.</span><span class="nx">fileContent</span><span class="p">,</span> <span class="s2">"<[^>]*>"</span><span class="p">,</span> <span class="s2">" "</span><span class="p">,</span> <span class="s2">"ALL"</span><span class="p">);</span>
<span class="c1">// Add to query</span>
<span class="nx">queryAddRow</span><span class="p">(</span><span class="nx">qryWeb</span><span class="p">);</span>
<span class="nx">querySetCell</span><span class="p">(</span><span class="nx">qryWeb</span><span class="p">,</span> <span class="s2">"url"</span><span class="p">,</span> <span class="nx">url</span><span class="p">);</span>
<span class="nx">querySetCell</span><span class="p">(</span><span class="nx">qryWeb</span><span class="p">,</span> <span class="s2">"title"</span><span class="p">,</span> <span class="nx">title</span><span class="p">);</span>
<span class="nx">querySetCell</span><span class="p">(</span><span class="nx">qryWeb</span><span class="p">,</span> <span class="s2">"body"</span><span class="p">,</span> <span class="nx">body</span><span class="p">);</span>
<span class="nx">querySetCell</span><span class="p">(</span><span class="nx">qryWeb</span><span class="p">,</span> <span class="s2">"keywords"</span><span class="p">,</span> <span class="s2">"api,documentation,web"</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// Index the web content</span>
<span class="k">if</span><span class="p">(</span><span class="nx">qryWeb</span><span class="p">.</span><span class="nx">recordCount</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">index</span> <span class="nx">action</span><span class="o">=</span><span class="s2">"update"</span>
<span class="nx">type</span><span class="o">=</span><span class="s2">"custom"</span>
<span class="nx">collection</span><span class="o">=</span><span class="nx">collectionName</span>
<span class="nx">key</span><span class="o">=</span><span class="s2">"url"</span>
<span class="nx">title</span><span class="o">=</span><span class="s2">"title"</span>
<span class="nx">body</span><span class="o">=</span><span class="s2">"body"</span>
<span class="nx">custom1</span><span class="o">=</span><span class="s2">"keywords"</span>
<span class="nx">query</span><span class="o">=</span><span class="s2">"qryWeb"</span><span class="p">;</span>
<span class="p">}</span>
}
Step 3: Augment AI Queries
Use the indexed content to augment AI queries with the enhanced content chunks feature in Lucene Extension 3.0:
function augmentQuery(userQuery) {
// Escape special characters to ensure proper search
var criteria = rereplace(userQuery, "([+\-&|!(){}\[\]\^""~*?:\\\/])", "\\1", "ALL");
<span class="c1">// Perform search with content chunks using the new contextpassages feature</span>
<span class="nx">search</span>
<span class="nx">contextpassages</span><span class="o">=</span><span class="mi">5</span> <span class="c1">// Number of passages to retrieve</span>
<span class="nx">contextHighlightBegin</span><span class="o">=</span><span class="s2">"<mark>"</span> <span class="c1">// Highlighting for matched terms </span>
<span class="nx">contextHighlightEnd</span><span class="o">=</span><span class="s2">"</mark>"</span>
<span class="nx">contextBytes</span><span class="o">=</span><span class="mi">4000</span> <span class="c1">// Total bytes of context to retrieve</span>
<span class="nx">contextpassageLength</span><span class="o">=</span><span class="mi">800</span> <span class="c1">// Length of each passage</span>
<span class="nx">name</span><span class="o">=</span><span class="s2">"local.searchResults"</span>
<span class="nx">collection</span><span class="o">=</span><span class="nx">collectionName</span>
<span class="nx">criteria</span><span class="o">=</span><span class="nx">criteria</span>
<span class="nx">suggestions</span><span class="o">=</span><span class="s2">"always"</span>
<span class="nx">maxrows</span><span class="o">=</span><span class="mi">5</span><span class="p">;</span> <span class="c1">// Limit number of results</span>
<span class="c1">// Format the augmented query</span>
<span class="kd">var</span> <span class="nx">augmentedQuery</span> <span class="o">=</span> <span class="s2">"User Query: #userQuery#"</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">contextData</span> <span class="o">=</span> <span class="p">[];</span>
<span class="c1">// Process search results</span>
<span class="nx">loop</span> <span class="nx">query</span><span class="o">=</span><span class="nx">searchResults</span> <span class="p">{</span>
<span class="c1">// Access context passages</span>
<span class="kd">var</span> <span class="nx">contextInfo</span> <span class="o">=</span> <span class="nx">searchResults</span><span class="p">.</span><span class="nx">context</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">passages</span> <span class="o">=</span> <span class="nx">contextInfo</span><span class="p">.</span><span class="nx">passages</span><span class="p">;</span>
<span class="c1">// Prepare source information</span>
<span class="kd">var</span> <span class="nx">sourceInfo</span> <span class="o">=</span> <span class="p">{</span>
<span class="s2">"title"</span><span class="o">:</span> <span class="nx">searchResults</span><span class="p">.</span><span class="nx">title</span><span class="p">,</span>
<span class="s2">"summary"</span><span class="o">:</span> <span class="nx">searchResults</span><span class="p">.</span><span class="nx">summary</span><span class="p">,</span>
<span class="s2">"score"</span><span class="o">:</span> <span class="nx">searchResults</span><span class="p">.</span><span class="nx">score</span><span class="p">,</span>
<span class="s2">"passages"</span><span class="o">:</span> <span class="p">[]</span>
<span class="p">};</span>
<span class="c1">// Process each passage in the result</span>
<span class="nx">loop</span> <span class="nx">query</span><span class="o">=</span><span class="nx">passages</span> <span class="p">{</span>
<span class="nx">arrayAppend</span><span class="p">(</span><span class="nx">sourceInfo</span><span class="p">.</span><span class="nx">passages</span><span class="p">,</span> <span class="p">{</span>
<span class="s2">"score"</span><span class="o">:</span> <span class="nx">passages</span><span class="p">.</span><span class="nx">score</span><span class="p">,</span>
<span class="s2">"content"</span><span class="o">:</span> <span class="nx">passages</span><span class="p">.</span><span class="nx">original</span>
<span class="p">});</span>
<span class="p">}</span>
<span class="c1">// Add this source to our context data</span>
<span class="nx">arrayAppend</span><span class="p">(</span><span class="nx">contextData</span><span class="p">,</span> <span class="nx">sourceInfo</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// Only add context if we found relevant information</span>
<span class="k">if</span><span class="p">(</span><span class="nx">arrayLen</span><span class="p">(</span><span class="nx">contextData</span><span class="p">))</span> <span class="p">{</span>
<span class="nx">augmentedQuery</span> <span class="o">&=</span> <span class="s2">"</span>
Context Information: #serializeJSON(contextData)#";
}
<span class="k">return</span> <span class="nx">augmentedQuery</span><span class="p">;</span>
}
Usage with AI Functions
You can integrate the augmentation functionality directly with Lucee's AI functions:
// Create an AI session
aiSession = LuceeCreateAISession(name:"myclaude");
// User query
userQuery = "How do I optimize database queries in my application?";
// Augment the query with relevant context from indexed content
augmentedQuery = augmentQuery(userQuery);
// Send to AI with augmented context
response = LuceeInquiryAISession(aiSession, augmentedQuery);
// Display the response
echo(response);
Benefits of AI Augmentation
- Enhanced Relevance: AI responses are informed by your specific content
- Reduced Hallucinations: Grounds responses in factual information from your data
- Domain Knowledge: AI can provide answers specific to your organization or industry
- Content Currency: Responses reflect your latest data, not just the AI's training cutoff
- Customizable Context: Index exactly what matters for your specific use case
- Efficiency: Better initial responses reduce the need for follow-up queries
- Privacy: Sensitive information stays within your system as context
Advanced Features
Content Chunk Optimization
The Lucene Extension 3.0 in Lucee 7 provides enhanced content chunking capabilities that allow for better context extraction:
search
contextpassages=5 // Number of distinct passages to extract
contextHighlightBegin="<mark>" // Optional highlighting for matched terms
contextHighlightEnd="</mark>"
contextBytes=5000 // Total context bytes across all passages
contextpassageLength=1000 // Length of each individual passage
name="local.searchResults"
collection=collectionName
criteria=criteria;
These parameters let you fine-tune how much context is provided to the AI:
contextpassages
: Controls how many separate text segments are returnedcontextBytes
: Sets the maximum total size of all context returnedcontextpassageLength
: Controls the maximum size of each individual passage
Content Source Flexibility
You can index virtually any content that you can access in CFML:
- Database records from any datasource
- Local files in any format (parse as needed)
- Web content from APIs or scraped pages
- Application logs or metrics
- User-generated content
- PDF, Word, or other document formats (with appropriate text extraction)
- External knowledge bases or documentation
Security Considerations
- The augmentation process includes indexed content in queries sent to AI providers
- Use local AI endpoints (like Ollama) for sensitive data scenarios
- Implement data filtering to avoid exposing confidential information
- Consider encrypting sensitive indexed content and implementing decryption at query time
- Add audit logging for all AI interactions
Performance Optimization
For best performance with the new Lucene Extension 3.0:
- Index strategically - focus on high-value content
- Use appropriate text segmentation for your domain
- Fine-tune search parameters like
maxrows
andcontextpassages
- Implement caching for frequent queries
- Schedule index maintenance during low-traffic periods
- Monitor performance metrics to optimize configuration
Examples of Use Cases
- Customer Support: Augment AI with product documentation, FAQs, and support history
- Development Assistance: Index code repositories, API docs, and coding standards
- Knowledge Management: Connect AI to your company's internal knowledge base
- Training: Create AI tutors with domain-specific knowledge from your course materials
- Research Assistant: Index research papers and data to enable AI analysis in your field
- Data Analysis: Combine AI with indexed analysis of your business metrics