← 返回
数据分析 Key 中文

Bio Ontology Mapper

Map unstructured biomedical text to standardized ontologies (SNOMED CT, MeSH, ICD-10) for terminology normalization and semantic interoperability. Extracts m...
将非结构化生物医学文本映射至SNOMED CT、MeSH、ICD-10等标准本体,实现术语标准化与语义互操作。
renhaosu2024
数据分析 clawhub v0.1.0 1 版本 100000 Key: 需要
★ 0
Stars
📥 552
下载
💾 48
安装
1
版本
#latest

概述

Bio-Ontology Mapper

Overview

Biomedical terminology normalization tool that maps free-text clinical and scientific concepts to standardized ontologies for semantic interoperability and data harmonization.

Key Capabilities:

  • Multi-Ontology Support: SNOMED CT, MeSH, ICD-10, LOINC, RxNorm
  • Entity Extraction: NER for diseases, symptoms, procedures, drugs
  • Fuzzy Matching: Handle typos, abbreviations, and synonyms
  • Confidence Scoring: Reliability metrics for each mapping
  • Batch Processing: Normalize large datasets efficiently
  • Cross-Mapping: Translate between ontology systems

When to Use

✅ Use this skill when:

  • Normalizing clinical notes for EHR integration
  • Standardizing terminology for multi-site studies
  • Mapping legacy data to modern ontologies
  • Preparing data for clinical data warehouses
  • Converting free-text to coded data for analysis
  • Building semantic search for biomedical literature
  • Teaching biomedical informatics principles

❌ Do NOT use when:

  • Clinical diagnosis or decision support → Use clinical decision tools
  • Real-time patient care → Latency too high for acute settings
  • Replacing expert coding → Use for pre-coding, final review needed
  • Processing PHI without de-identification → Ensure HIPAA compliance

Integration:

  • Upstream: clinical-data-cleaner (data preparation), ehr-semantic-compressor (text extraction)
  • Downstream: clinical-data-cleaner (SDTM mapping), unstructured-medical-text-miner (NLP pipelines)

Core Capabilities

1. Entity Recognition and Mapping

Extract and map biomedical entities to ontologies:

from scripts.mapper import BioOntologyMapper

mapper = BioOntologyMapper()

# Map clinical text
result = mapper.map_text(
    text="Patient has diabetes and hypertension, taking metformin",
    ontologies=["snomed", "mesh", "rxnorm"],
    confidence_threshold=0.7
)

for entity in result.entities:
    print(f"{entity.text} → {entity.concept_id} ({entity.ontology})")
    print(f"  Preferred: {entity.preferred_term}")
    print(f"  Confidence: {entity.confidence:.2f}")

Supported Ontologies:

OntologyDomainUse Case
----------------------------
SNOMED CTClinicalEHR interoperability
MeSHLiteraturePubMed indexing
ICD-10BillingDiagnosis codes
LOINCLabsTest result standardization
RxNormDrugsMedication normalization
HGNCGenesGene name standardization

2. Cross-Ontology Translation

Map concepts between different ontologies:

# Cross-map SNOMED to ICD-10
translation = mapper.cross_map(
    source_id="22298006",  # SNOMED: Myocardial infarction
    source_ontology="snomed",
    target_ontology="icd10"
)

print(f"ICD-10: {translation.target_id} - {translation.target_term}")
# Output: I21.9 - Acute myocardial infarction, unspecified

Cross-Mapping Coverage:

  • SNOMED CT ↔ ICD-10-CM (clinical modifications)
  • MeSH ↔ SNOMED CT (literature to clinical)
  • RxNorm ↔ ATC (drug classifications)
  • LOINC ↔ SNOMED (lab to clinical)

3. Batch Normalization

Process large datasets:

# Batch process CSV
results = mapper.batch_map(
    input_file="clinical_terms.csv",
    text_column="diagnosis_description",
    ontologies=["snomed", "icd10"],
    output_format="csv",
    max_workers=4
)

# Results include:
# - Original term
# - Mapped concept ID
# - Confidence score
# - Alternative mappings (if ambiguous)

Performance:

  • ~100 terms/second (with caching)
  • ~20 terms/second (API lookup)
  • Parallel processing for large datasets

4. Confidence Scoring and Validation

Assess mapping reliability:

scoring = mapper.score_mapping(
    term="heart attack",
    candidate="22298006",  # Myocardial infarction
    factors=["string_similarity", "context_match", "frequency"]
)

print(f"Overall confidence: {scoring.confidence:.2f}")
print(f"Breakdown: {scoring.factors}")

Scoring Factors:

  • String similarity: Levenshtein distance, n-grams
  • Context match: Surrounding words alignment
  • Frequency: Common usage in corpus
  • Semantic similarity: Vector embeddings

Common Patterns

Pattern 1: Clinical Note Normalization

Scenario: Convert free-text diagnoses to SNOMED codes.

# Normalize clinical notes
python scripts/main.py \
  --input notes.csv \
  --column diagnosis_text \
  --ontology snomed \
  --threshold 0.8 \
  --output coded_diagnoses.csv

# Results: "heart attack" → 22298006 (Myocardial infarction)

Post-Processing:

  • Review low-confidence mappings (<0.8)
  • Handle ambiguous terms manually
  • Validate against clinical context

Pattern 2: Literature Indexing

Scenario: Map research paper keywords to MeSH.

# Map keywords to MeSH
mesh_terms = mapper.map_to_mesh(
    keywords=["cancer immunotherapy", "checkpoint inhibitors", "PD-1"],
    include_tree_numbers=True,
    include_qualifiers=True
)

for term in mesh_terms:
    print(f"{term.input} → {term.descriptor}")
    print(f"  Tree: {term.tree_numbers}")
    print(f"  Entry terms: {term.synonyms}")

Pattern 3: Drug Name Normalization

Scenario: Standardize medication names across datasets.

# Normalize drug names
drugs = ["Tylenol", "Advil", "Motrin", "acetaminophen"]

for drug in drugs:
    result = mapper.map_to_rxnorm(drug)
    print(f"{drug} → {result.rxcui}: {result.name}")
    # Tylenol → 161: Acetaminophen
    # Advil → 5640: Ibuprofen
    # Motrin → 5640: Ibuprofen

Pattern 4: EHR Data Harmonization

Scenario: Merge data from multiple hospital systems.

# Harmonize diagnoses from 3 hospitals
python scripts/main.py \
  --batch \
  --inputs "hospital_a.csv,hospital_b.csv,hospital_c.csv" \
  --target-ontology snomed \
  --cross-map-to icd10 \
  --output harmonized_data.csv

Complete Workflow Example

From free-text to coded database:

from scripts.mapper import BioOntologyMapper
from scripts.validator import MappingValidator

# Initialize
mapper = BioOntologyMapper()
validator = MappingValidator()

# Step 1: Extract entities from text
clinical_note = "Patient has Type 2 diabetes and hypertension..."
entities = mapper.extract_entities(clinical_note)

# Step 2: Map to SNOMED
mappings = []
for entity in entities:
    mapping = mapper.map_to_snomed(
        entity.text,
        context=clinical_note,
        top_n=3
    )
    mappings.append(mapping)

# Step 3: Validate mappings
for mapping in mappings:
    validation = validator.validate(
        mapping,
        check_clinical_plausibility=True
    )
    if not validation.is_valid:
        print(f"Review needed: {mapping}")

# Step 4: Export to database format
db_records = [m.to_database_record() for m in mappings]

Quality Checklist

Pre-Mapping:

  • [ ] Text preprocessed (lowercase, punctuation handled)
  • [ ] Abbreviations expanded where possible
  • [ ] Language identified (multilingual support)

During Mapping:

  • [ ] Confidence threshold appropriate (>0.7 for clinical)
  • [ ] Multiple candidates considered for ambiguous terms
  • [ ] Context used for disambiguation

Post-Mapping:

  • [ ] Low-confidence mappings flagged for review
  • [ ] Unmapped terms logged
  • [ ] CRITICAL: Clinical expert validation for high-stakes use

Before Production:

  • [ ] Mapping accuracy validated on gold standard
  • [ ] False positive rate acceptable (<5%)
  • [ ] Recall acceptable for use case (>90%)
  • [ ] API rate limits respected

Common Pitfalls

Mapping Errors:

  • Abbreviation ambiguity → "MI" = Myocardial infarction OR Michigan
  • ✅ Use context; flag for manual review
  • Outdated terms → Old terminology not in current ontology
  • ✅ Use historical mappings; update terminology
  • False confidence → High score for wrong concept
  • ✅ Always review top-3 candidates

Technical Issues:

  • API failures → No local fallback
  • ✅ Implement caching; use local reference files
  • Version mismatches → Different ontology versions
  • ✅ Track ontology version used
  • PHI exposure → Sending patient data to external APIs
  • ✅ De-identify before API calls; use local processing when possible

References

Available in references/ directory:

  • snomed_ct_guide.md - SNOMED CT hierarchy and relationships
  • mesh_structure.md - MeSH tree structure and qualifiers
  • ontology_mappings.md - Crosswalks between systems
  • nlp_best_practices.md - Biomedical text processing
  • api_documentation.md - External service integration
  • validation_datasets.md - Gold standard test sets

Scripts

Located in scripts/ directory:

  • main.py - CLI interface for mapping
  • mapper.py - Core ontology mapping engine
  • extractor.py - Named entity recognition
  • cross_mapper.py - Ontology-to-ontology translation
  • scorer.py - Confidence calculation
  • batch_processor.py - Large dataset handling
  • validator.py - Mapping quality checks
  • caching.py - Local storage for frequent lookups

Limitations

  • Ambiguity: Many-to-many mappings common; context required
  • Coverage: Rare diseases and new concepts may not be in ontologies
  • Versioning: Ontology updates can change mappings over time
  • Language: Best support for English; other languages limited
  • Real-time: Not suitable for time-critical clinical applications
  • API Dependency: Requires internet for most lookups (caching helps)

⚠️ Critical: Ontology mapping is for research and data integration, not clinical decision-making. Always validate mappings with domain experts before use in patient care contexts. Never process PHI without appropriate de-identification and compliance measures.

Parameters

ParameterTypeDefaultDescription
---------------------------------------
--termstrRequiredSingle term to map
--inputstrRequiredInput file path
--outputstrRequiredOutput file path
--ontologystr'both'
--thresholdfloat0.7
--formatstr'json'
--use-apistrRequiredUse UMLS/MeSH APIs
--api-keystrRequired

版本历史

共 1 个版本

  • v0.1.0 当前
    2026-03-19 18:03 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

A股量化 AkShare

mbpz
A股量化数据分析工具,基于AkShare库获取A股行情、财务数据、板块信息等。用于回答关于A股股票查询、行情数据、财务分析、选股等问题。
★ 165 📥 60,051
data-analysis

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 368 📥 140,535
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 198 📥 65,149