This skill enables intelligent template filling using hybrid retrieval algorithms that combine BM25 semantic search with TF-IDF vector similarity. It automatically matches template fields with knowledge base data and fills Word documents (.docx) and Excel spreadsheets (.xlsx) with high precision.
Use this skill when:
Common trigger phrases:
This skill uses a hybrid retrieval approach combining two algorithms:
IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × doc_length / avgdl))
final_score = 0.5 × BM25_score + 0.5 × TF-IDF_score
The system uses a multi-level matching strategy:
Ensure the knowledge base is a JSON file with the following structure:
{
"filename.xlsx": {
"filename": "filename.xlsx",
"type": "xlsx",
"content": "=== Sheet: SheetName\nA1[Header1] | A2[Value1] | ..."
},
"filename.docx": {
"filename": "filename.docx",
"type": "docx",
"content": {
"paragraphs": ["text content..."],
"tables": [...]
}
}
}
Supported formats in JSON:
A1[Value] | B2[Value] pattern
Execute the main filling script:
python scripts/smart_filler.py
The script will:
The system generates:
Purpose: Core hybrid retrieval engine implementation
Key Classes:
BM25Retriever: BM25 ranking algorithm implementation
TFIDFRetriever: TF-IDF vector search implementation
HybridRetriever: Fusion of both retrieval methods
VectorKnowledgeBase: Knowledge base management and indexing
Usage Example:
from vector_kb import VectorKnowledgeBase
# Initialize and load knowledge base
kb = VectorKnowledgeBase()
kb.load_knowledge_base('knowledge_base.json').build_index()
# Search for values
results = kb.search('法人代表', top_k=5)
for result in results:
print(f"Score: {result['score']}, Value: {result['document']}")
Purpose: Main template filling orchestration
Key Classes:
TextExcelParser: Parses text-based Excel content
SmartFillSystem: Orchestrates the entire filling process
Usage Example:
from smart_filler import SmartFillSystem
# Configure paths
system = SmartFillSystem(
kb_path='knowledge_base.json',
template_dir='templates/',
output_dir='filled/'
)
# Initialize and process
system.load_kb()
system.process_all()
Configuration:
kb_path: Path to knowledge base JSON file
template_dir: Directory containing template files
output_dir: Directory for filled output files
Excel Content Format (text-based):
=== Sheet: SheetName ===
A1[Header1] | A2[Value1] | B1[Header2] | B2[Value2]
Document Content Format (field extraction):
字段名[::\s]*值
Year-based Data:
Based on real-world testing:
| Metric | Value |
|---------|--------|
| Knowledge Base Fields | 89+ |
| Files Processed | 5+ |
| Total Fields Filled | 388+ |
| Fields Per File (Average) | 77.6 |
| XX基金 Replacement Rate | 100% |
| Precision Improvement | 50%+ over keyword matching |
| Efficiency Gain | 90%+ over manual filling |
Cause: Knowledge base content format incompatible
Solution: Ensure Excel content uses A1[Value] format; check JSON structure
Cause: Field name ambiguity
Solution: Adjust hybrid retrieval weights; use more specific field names in templates
Cause: Non-UTF-8 characters in knowledge base
Solution: Ensure knowledge base JSON is UTF-8 encoded; use sys.stdout.reconfigure(encoding='utf-8') in scripts
Modify the hybrid retrieval weight balance in HybridRetriever:
# Default: BM25 0.5, TF-IDF 0.5
# Change to emphasize semantic matching:
self.bm25_weight = 0.3
self.tfidf_weight = 0.7
Extend TextExcelParser._extract_from_text() to support additional patterns:
patterns = {
'new_field': r'新字段[::\s]*([^\n\r]+)',
# Add more patterns...
}
Process multiple knowledge bases:
kb_files = ['kb1.json', 'kb2.json', 'kb3.json']
for kb_file in kb_files:
system = SmartFillSystem(kb_file, 'templates/', f'filled_{kb_file}/')
system.load_kb()
system.process_all()
Potential improvements for future versions:
For deeper understanding:
共 1 个版本