← 返回
未分类 中文

Protein Phylogeny

Comprehensive protein family phylogenetic analysis workflow with quality control, conservation analysis, coevolution network analysis, and publication-ready...
全面的蛋白质家族系统发育分析工作流,包含质量控制、保守性分析、共进化网络分析以及可直接出版的结果...
billwanttobetop
未分类 clawhub v1.5.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 404
下载
💾 1
安装
1
版本
#latest

概述

Protein Family Phylogenetic Analysis

Complete workflow for protein family evolutionary analysis: quality control → conservation → coevolution → phylogeny → publication report.

Quick Start

Input: FASTA file with protein sequences (any family, any size)

Output: Publication-ready report with phylogenetic tree, conservation analysis, coevolution networks, and high-quality figures

Typical workflow:

# 1. Quality control (removes low-quality sequences)
bash scripts/01_quality_control.sh input.fasta output_dir/

# 2. Conservation analysis
bash scripts/02_conservation.sh output_dir/qc/final.fasta output_dir/

# 3. Coevolution analysis
bash scripts/03_coevolution.sh output_dir/qc/final.fasta output_dir/

# 4. Phylogenetic tree
bash scripts/04_phylogeny.sh output_dir/qc/final.fasta output_dir/

# 5. Generate figures
bash scripts/05_visualize.sh output_dir/

# 6. Create report
bash scripts/06_report.sh output_dir/ "Family Name"

Workflow Overview

Stage 1: Quality Control (references/01-quality-control.md)

Purpose: Filter raw sequences to high-quality, non-redundant dataset

Steps:

  1. Literature validation (remove predicted sequences)
  2. Length filtering (remove fragments/fusions)
  3. CD-HIT redundancy removal (90% identity)
  4. Complexity check (remove low-complexity regions)
  5. Motif validation (confirm family membership)
  6. MAFFT alignment (high accuracy mode)
  7. trimAl trimming (automatic strategy)
  8. Final validation (gap ratio, coverage)

Key parameters:

  • CD-HIT threshold: 90% (adjustable 70-95%)
  • Length range: mean ± 2 SD
  • Gap threshold: < 30% per position
  • Motif coverage: > 50%

Output: qc/final.fasta (high-quality aligned sequences)

Stage 2: Conservation Analysis (references/02-conservation.md)

Purpose: Identify functionally important conserved residues

Method: Shannon entropy

  • H_norm < 0.3: Highly conserved
  • H_norm 0.3-0.6: Moderately conserved
  • H_norm > 0.6: Variable

Output:

  • Conserved positions list
  • Conservation landscape plot
  • Gap vs conservation scatter plot

Stage 3: Coevolution Analysis (references/03-coevolution.md)

Purpose: Identify residue pairs that evolve together

Method: Normalized Mutual Information (NMI)

  • Corrects for phylogenetic bias
  • Identifies structural/functional coupling
  • Builds coevolution network

Output:

  • Coevolved position pairs (MI scores)
  • Network graph (hub identification)
  • Hub residue heatmap

Stage 4: Phylogenetic Analysis (references/04-phylogeny.md)

Purpose: Reconstruct evolutionary relationships

Method: IQ-TREE maximum likelihood

  • Automatic model selection (ModelFinder)
  • UFBoot2 ultrafast bootstrap (1000 replicates)
  • Convergence check (> 0.99 required)

Output:

  • Phylogenetic tree (.treefile)
  • Bootstrap consensus tree (.contree)
  • Model parameters (.iqtree)

Stage 5: Visualization (references/05-visualization.md)

Purpose: Generate publication-quality figures (300 DPI)

Figures:

  1. Workflow diagram
  2. Conservation heatmap
  3. Coevolution network
  4. Hub analysis
  5. Quality metrics
  6. Phylogenetic tree
  7. Bootstrap distribution
  8. Supplementary plots

Style: Clean, colorblind-friendly, Nature/Science standards

Stage 6: Report Generation (references/06-report.md)

Purpose: Create comprehensive analysis report

Sections:

  1. Overview (dataset summary)
  2. Quality control (methods + results)
  3. Conservation analysis (algorithms + findings)
  4. Coevolution analysis (networks + hubs)
  5. Phylogenetic analysis (tree + support)
  6. Quality assessment (standards comparison)
  7. Conclusions (biological insights)

Format: Markdown → Feishu/Word/PDF

Key Features

AI-Friendly Design

  • Modular scripts: Each stage is independent
  • Clear parameters: All thresholds documented
  • Error handling: Automatic validation at each step
  • Progress tracking: JSON state files
  • Resume capability: Skip completed stages

Token Efficiency

  • Progressive disclosure: Load only needed references
  • Compact instructions: Essential info only
  • Script execution: No need to read code
  • Cached results: Reuse intermediate files

Professional Quality

  • Publication standards: All methods peer-reviewed
  • Reproducible: Fixed random seeds, versioned tools
  • Validated: Tested on 10+ protein families
  • Documented: Complete algorithm explanations

Dependencies

Required tools:

  • CD-HIT v4.8.1+
  • MAFFT v7.490+
  • trimAl v1.4+
  • IQ-TREE v2.0+
  • Python 3.8+ (BioPython, NumPy, Matplotlib, NetworkX)
  • R 4.0+ (ape, phytools)

Installation:

bash scripts/install_dependencies.sh

Common Pitfalls

1. Low Sequence Similarity (< 25%)

Problem: Alignment unreliable, phylogeny uncertain

Solution:

  • Use profile HMM (HMMER) instead of MAFFT
  • Consider domain-based analysis
  • Increase CD-HIT threshold to 95%

2. High Gap Ratio (> 30%)

Problem: Many unreliable positions

Solution:

  • Stricter trimAl settings (-gt 0.8)
  • Manual inspection of alignment
  • Remove problematic sequences

3. Bootstrap Convergence Failure (< 0.99)

Problem: Tree topology unstable

Solution:

  • Increase bootstrap replicates (2000+)
  • Try different substitution models
  • Check for long-branch attraction

4. No Conserved Motifs

Problem: Family definition unclear

Solution:

  • Verify sequences are truly homologous
  • Use structural alignment (DALI, TM-align)
  • Consider broader superfamily analysis

Advanced Usage

Custom Quality Control

Edit scripts/01_quality_control.sh parameters:

CDHIT_THRESHOLD=0.85  # More stringent
MIN_LENGTH=200        # Shorter proteins
MAX_LENGTH=600        # Longer proteins
GAP_THRESHOLD=0.25    # Stricter gap cutoff

Alternative Phylogeny Methods

See references/04-phylogeny.md for:

  • Bayesian inference (MrBayes)
  • Distance methods (FastTree)
  • Parsimony (PAUP*)

Custom Visualization

Edit scripts/05_visualize.sh for:

  • Color schemes
  • Figure dimensions
  • Font sizes
  • Layout styles

Troubleshooting

Issue: CD-HIT crashes with large datasets

Fix: Split input, process in batches, merge results

Issue: IQ-TREE runs forever

Fix: Use -fast mode or reduce bootstrap replicates

Issue: Figures look pixelated

Fix: Increase DPI in scripts/05_visualize.sh (default 300)

Issue: Report generation fails

Fix: Check all intermediate files exist, rerun failed stages

References

For detailed methodology, see:

Citation

If you use this workflow, please cite:

  • CD-HIT: Li & Godzik (2006) Bioinformatics
  • MAFFT: Katoh & Standley (2013) Mol Biol Evol
  • trimAl: Capella-Gutiérrez et al. (2009) Bioinformatics
  • IQ-TREE: Nguyen et al. (2015) Mol Biol Evol
  • This workflow: [Your publication]

Example Usage

# Download your sequences
# (from UniProt, NCBI, or your own database)

# Run full workflow
bash scripts/run_full_workflow.sh sequences.fasta analysis_output/ "Your Family Name"

# Results in analysis_output/:
# - qc/final.fasta (high-quality sequences)
# - conservation/ (conserved positions)
# - coevolution/ (coevolved pairs)
# - phylogeny/ (phylogenetic tree)
# - figures/ (publication-quality plots)
# - report.md (complete analysis)

版本历史

共 1 个版本

  • v1.5.0 当前
    2026-05-07 12:51 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Automd Gromacs

billwanttobetop
AutoMD‑GROMACS:面向AI的GROMACS分子动力学自动化,包含工作流、增强采样、特殊系统模拟、高级分析与出版级输出。
★ 2 📥 821

AutoMD-Viz

billwanttobetop
生成符合期刊要求的高质量分子动力学可视化,包括结构图、数据图、轨迹投影和完整报告。
★ 0 📥 444

Auto Proteomics

billwanttobetop
公共OpenClaw技能,用于低token路由及对已处理的DDA LFQ蛋白质组学输入进行下游分析。适用于用户已有蛋白质水平定量...
★ 0 📥 556