Batch Image Mask Adder v4.3

Purpose

This skill provides automated batch processing of images to detect faces and add masks. It uses OpenCV DNN SSD deep learning face detector (ResNet-10 based) for high-accuracy detection, with multi-scale tiled detection for small/distant faces, intelligent post-processing to eliminate false positives, automatic front/side face recognition for selecting the appropriate mask template, an enhanced detection mode (--enhanced) for crowded classroom/group photos with many small/distant faces, upward-angle adaptive positioning for selfie/wide-angle photos, manual face coordinate injection for missed detections, model instance reuse for efficient batch processing, and batch processing support.

Note: The DNN model weights file (res10_300x300_ssd_iter_140000.caffemodel, ~10MB) is not bundled in the skill package. It is automatically downloaded from OpenCV's official GitHub repository on first use. Subsequent runs reuse the downloaded models with no network required.

When to Use This Skill

Use this skill when users need to:

Add masks to faces in images automatically
Process multiple images in batch
Handle images with many people, small faces, or varying distances (e.g., classroom photos, group photos)
Use enhanced detection mode (--enhanced) for crowded scenes with 20+ people where standard mode misses small/distant faces
Manually specify face coordinates when auto-detection misses some faces (--manual-faces)
Handle selfie/wide-angle photos where mask positioning needs auto-adjustment
Set up Python environment and install required libraries (OpenCV, Pillow, NumPy)

Key Technical Improvements (v4.3 vs v4.1 vs v4.0 vs v3.1 vs v2.0 vs v1.0)

Face Detection: DNN SSD vs Haar Cascade

|---------|---------------------|-----------------|---------------------|----------------------|---------------|

| Manual face override | No | No | No | No | Yes (--manual-faces) |

| Batch dual-template | No | No | No | No | Yes |

Why DNN is Better

The Haar Cascade classifier relies on simple rectangular features and struggles with:

Small faces (under ~50px)
Slightly angled faces
Varying lighting conditions
Dense face clusters (classrooms, group photos)

The DNN SSD detector (ResNet-10 backbone) uses deep learning for much more robust face detection, combined with our multi-scale approach.

Workflow

Initial Setup (if needed)

Verify Python Installation

Check if Python 3.7+ is installed and accessible
Ensure Python is added to system PATH

Install Required Libraries

```bash

pip install opencv-python pillow numpy

```

Image Processing Workflow

Load Resources

Load DNN face detection model (ResNet-10 SSD, models/ directory)
Load mask template image, auto-remove dark background to transparent

Detect Faces (Multi-Scale)

Standard full-image detection at 300x300
Multi-level tiled detection: 2×2, 3×3, 4×4 tiles with 30% overlap (standard mode)
★ Enhanced mode (v4.0): Extended tiling 2×2 through 8×8, plus local region enhancement (back row, right half, center area)
NMS deduplication to merge overlapping detections
Post-processing filter to eliminate false positives
★ Enhanced mode: Multi-stage filtering (aspect ratio, area anomaly, spatial zones) + overlap cleanup

Add Masks

Position mask from face 30% height (below eyes) to 125% (below chin)
Mask width extends 10% beyond face edges on both sides for full cheek coverage
This ensures eyes remain visible while nose, mouth, chin and cheeks are fully covered
★ Front/Side face multi-feature recognition (v3.1): Uses AR deviation + edge asymmetry + gradient bias scoring
★ Dual mask template support: Automatically selects front or side mask based on face orientation
★ Side face direction detection: Automatically flips side mask for left/right facing
Alpha compositing with transparency for smooth blending

Save Results

Save processed images to output directory
Preserve original quality (JPEG quality=95)
Images with no faces detected are copied as-is

Bundled Resources

Models

`models/deploy.prototxt` & `models/res10_300x300_ssd_iter_140000.caffemodel`

OpenCV DNN face detection model files (ResNet-10 SSD architecture). Pre-trained on face detection dataset, significantly more accurate than Haar Cascade.

Scripts

`scripts/add_mask_with_image.py`

Core script for adding masks to images using DNN detection.

Usage:

# Single mask mode (all faces use same mask)
python scripts/add_mask_with_image.py <input_image> [mask_image] [output_path]

# Dual mask mode (auto front/side face recognition)
python scripts/add_mask_with_image.py --front <front_mask> --side <side_mask> <input_image> [output_path]

# Enhanced detection mode (for crowded classroom/group photos with many small faces)
python scripts/add_mask_with_image.py --enhanced <input_image> [mask_image] [output_path]
python scripts/add_mask_with_image.py --enhanced --front <front_mask> --side <side_mask> <input_image> [output_path]

# Manual face coordinates (for faces missed by auto-detection)
python scripts/add_mask_with_image.py --manual-faces "x,y,w,h;x,y,w,h" <input_image> [mask_image] [output_path]

Features:

DNN deep learning face detection
Multi-scale tiled detection (standard: 2×2 to 4×4; enhanced: 2×2 to 6×6)
Intelligent false-positive filtering
Auto background removal for mask templates (supports both dark and white backgrounds)
Smart alpha channel detection — preserves existing transparency in PNG templates
Processes ALL detected faces (not just the largest one)
Optimized mask positioning (30%-125% of face height, 10% side extension)
★ Front/side face multi-feature recognition (AR deviation + edge asymmetry + gradient bias)
★ Dual mask template support with automatic selection
★ Side face direction detection with auto-flip
★ Enhanced detection mode (--enhanced) for crowded scenes:
Local region enhancement (back row, right half, center area independently detected)
Multi-stage false positive filtering (aspect ratio constraints, spatial zone filtering)
Overlap detection cleanup (prevents duplicate masks on same face)
★ Skin color verification (v4.1): HSV-based skin tone check eliminates blackboard/desk/backpack false positives
★ Hand/arm region spatial filtering (v4.1): lateral spatial analysis removes hand/gesture misdetections
★ Cartoon/virtual character filtering (v4.2): auto-skip anime/illustration/virtual faces, only add masks to real people
Texture uniformity analysis (Laplacian median/mean ratio)
Flat region detection (Sobel gradient near-zero pixel ratio)
Extreme saturation check (high-saturation pixel ratio)
★ Upward-angle adaptive positioning (v4.3): auto-adjusts mask y-position for selfie/wide-angle photos
Detects elongated face boxes (fh/fw > 1.35) indicating upward camera angle
Shifts mask start from 30% to 15% to prevent masks sitting too low
★ Manual face coordinates (v4.3): --manual-faces "x,y,w,h;x,y,w,h" for detection misses
★ Model instance reuse (v4.3): accepts pre-loaded net/mask objects for batch efficiency

`scripts/batch_process.py`

Batch processing script with shared model instance for performance.

Usage:

# Single mask batch processing
python scripts/batch_process.py <input_dir> [mask_image] [output_dir]

# Dual mask batch processing
python scripts/batch_process.py --front <front_mask> --side <side_mask> <input_dir> [output_dir]

# Enhanced detection batch processing
python scripts/batch_process.py --enhanced --front <front_mask> --side <side_mask> <input_dir> [output_dir]

Features:

Shared DNN model & mask template across all images (loaded once)
★ Full dual-template support (v3.0): --front / --side for batch processing
★ Enhanced detection support (v3.0): --enhanced for batch processing
★ Cartoon filtering in batch mode (v3.0)
Recursive directory scanning
Progress tracking with time estimates
Detailed statistics (faces found, front/side counts, cartoon skipped, timing)
Preserves directory structure in output

Assets

`assets/mask.png`

Default mask template (currently: 小红花口罩). Used in single-mask mode.

`assets/mask_front.png`

Front-face mask template. Used in dual-mask mode for faces detected as frontal.

`assets/mask_side.png`

Side-face mask template. Used in dual-mask mode for faces detected as side-facing. Auto-flipped horizontally when the face is facing right (template default is left-facing).

References

`references/usage_guide.md`

Complete user guide with troubleshooting and examples.

Configuration Parameters

These can be adjusted in add_mask_with_image.py:

# Detection (Standard mode)
CONFIDENCE_THRESHOLD = 0.25   # Balanced threshold for recall vs false positive control
MIN_FACE_SIZE = 12            # Minimum face size in pixels
ASPECT_RATIO_RANGE = (0.4, 2.5)  # Face width/height ratio filter

# Enhanced mode detection (--enhanced flag)
ENHANCED_CONFIDENCE_THRESHOLD = 0.20  # Lower threshold for higher recall (v4.1: tuned from 0.15)
ENHANCED_MIN_FACE_SIZE = 12           # Minimum face size in enhanced mode
ENHANCED_TILE_LEVELS = [2,3,4,5,6]    # Extended tiling levels (v4.1: tuned from 2-8 to 2-6)
ENHANCED_OVERLAP_THRESHOLD = 0.30     # Overlap cleanup threshold (v4.1: tuned from 0.35)

# Skin color verification (v4.1)
SKIN_COLOR_MIN_RATIO = 0.15           # Min skin pixel ratio in upper 40% of face box
SKIN_COLOR_CONF_THRESHOLD = 0.45      # Detections below this conf require skin check
SKIN_HSV_LOWER1 = (0, 30, 60)         # Skin HSV range 1 lower bound
SKIN_HSV_UPPER1 = (20, 180, 255)      # Skin HSV range 1 upper bound
SKIN_HSV_LOWER2 = (160, 30, 60)       # Skin HSV range 2 lower bound (red wrap-around)
SKIN_HSV_UPPER2 = (180, 180, 255)     # Skin HSV range 2 upper bound

# Upward-angle adaptive positioning (v4.3)
UPWARD_ANGLE_ASPECT_THRESHOLD = 1.35  # fh/fw > 1.35 → upward angle detected
UPWARD_ANGLE_Y_START_RATIO = 0.15     # Mask starts at 15% (vs normal 30%) for selfies
UPWARD_ANGLE_Y_END_RATIO = 1.10       # Slightly shorter mask for selfies

# Mask positioning (ratios relative to face bounding box)
MASK_Y_START_RATIO = 0.30    # Start at 30% = below eyes (covers nose, mouth, chin)
MASK_Y_END_RATIO = 1.25      # End below chin for full coverage
MASK_X_START_RATIO = -0.10   # Extend 10% beyond left edge (cover cheeks)
MASK_X_END_RATIO = 1.10      # Extend 10% beyond right edge (cover cheeks)

# Front/side face classification (v3.1 multi-feature)
# Composite scoring: ar_deviation_score * 0.4 + edge_asymmetry_score * 0.3 + gradient_bias_score * 0.3
# total_score > 0.40 → side face

Lessons Learned (from iterative optimization)

1. Mask Positioning Is Critical

30% start = optimal for privacy protection — covers everything below eyes (nose, mouth, chin, cheeks)
50% start = too low for full coverage — only covers mouth area, nose and cheeks partially exposed
35% start = good compromise — covers nose and mouth but may miss some cheek area
Width should extend 10% beyond face edges — exact face width leaves side cheeks exposed on angled faces

2. Multi-Scale Tiled Detection Is Essential for Group Photos

Standard 300×300 input cannot resolve faces smaller than ~50px in the original image. Multi-scale tiled detection solves this:

2×2 tiles: catches medium-distance faces
3×3 tiles: catches far-distance faces
4×4 tiles: catches very small/distant faces
30% tile overlap prevents faces at tile boundaries from being missed

3. Balanced Confidence Threshold + Two-Stage Smart Filtering

Using a high threshold (e.g., 0.45) misses many real faces. A very low threshold (0.18) catches nearly all faces but introduces too many false positives. The optimal approach: moderate threshold (0.25) + two-stage post-processing:

Stage 1 (Position/Size rules): Filter by location, area anomaly, and size consistency
Stage 2 (Spatial relationship): Filter body parts — if a low-confidence detection is directly below a high-confidence face, it's likely a chest/waist/knee misdetection

4. Body Part Spatial Filtering Is Critical for Group Photos

In standing group photos, the DNN detector often picks up clothing patterns, buttons, belt buckles, and knee textures as faces. These false positives share a key spatial property: they are directly below a real face (same x-range). The body-part filter checks if a low-confidence detection falls within the vertical column under any high-confidence face anchor, effectively eliminating chest/waist/knee misdetections.

5. Dark/White Background Removal for Mask Templates

Many mask template images have black or white backgrounds. Simply overlaying them creates ugly rectangles. Solution:

Smart alpha detection: If PNG already has alpha channel with >10% transparent pixels, preserve as-is
Auto background removal (for images without alpha):
Pixels with R,G,B all < 30 → fully transparent (dark background)
Pixels with R,G,B between 30-60 → semi-transparent (smooth edge)
Pixels with R,G,B all > 245 → fully transparent (white background)
Pixels with R,G,B between 230-245 → semi-transparent (smooth edge)

6. Front/Side Face Recognition Strategy (v3.1)

DNN SSD detection boxes don't distinguish face orientation well (boxes are often near-square for both). The v3.1 strategy uses multi-feature composite scoring for significantly improved accuracy:

AR deviation score (weight 0.4): How much the aspect ratio deviates from 1.0 (both directions). Normalized by 0.12.
Edge asymmetry score (weight 0.3): Sobel gradient magnitude difference between left/right halves. Normalized by 35%.
Gradient bias score (weight 0.3): Absolute horizontal gradient distribution offset. Normalized by 10.0.
Composite score > 0.55 → side face
Definite thresholds: ratio > 0.95 → always front; ratio < 0.70 → always side
Side face direction: Brighter half = face-facing direction (more skin visible)

7. Process ALL Faces, Not Just the Largest

The original v1.0 script only processed the single largest face. For group photos (classrooms, teams), ALL faces need masks.

8. Enhanced Detection Mode for Crowded Scenes (v4.0)

Standard detection (2-4 tile levels, conf=0.25) works well for most scenarios but misses many small/distant faces in crowded classroom photos (20+ people). The enhanced mode (--enhanced) addresses this with:

Extended tiling (2-8 levels): More granular tiles mean small faces get proportionally larger in each tile, improving detection
Local region enhancement: Key areas (back row 28-55%, right half 50-100%, center 30-70%) are cropped and independently detected. Each crop gets its own full + tiled detection passes, dramatically increasing recall for faces that were too small in the full image
Lower confidence threshold (0.15 vs 0.25): Catches more borderline detections
Multi-stage false positive filtering: Tighter aspect ratio constraints (0.65-1.50 vs 0.4-2.5), multi-level bottom/top spatial zone filtering, and anomaly area detection
Overlap cleanup: After NMS, an additional pass removes detections where >35% of a smaller box overlaps a higher-confidence box, preventing duplicate masks

9. Skin Color Verification for False Positive Reduction (v4.1)

Even with spatial filtering, some low-confidence detections on blackboards, desks, backpacks, or wall decorations slip through. The v4.1 skin color verification provides a definitive signal:

Core idea: Real human faces have skin in the upper portion (forehead/eye area), while non-face objects (text, patterns, furniture) do not
Implementation: Extract upper 40% of detection box → convert to HSV → check skin-tone pixel ratio against dual HSV ranges (covers most human skin tones)
Threshold: skin_ratio >= 15% → pass as real face. This is conservative enough to accept most faces (even partially occluded) while rejecting obvious non-face detections
Only applied to low-confidence detections (conf < 0.45) — high-confidence detections are already reliable and don't need verification
Real-world impact: Eliminated blackboard/wall decoration false positives in 1班 (3 non-face detections removed with skin=0.00-0.01), while preserving all real faces in 4.jpg (3 low-confidence faces passed with skin=0.25-0.74)

10. Hand/Arm Region Spatial Filtering (v4.1)

In addition to the body-below filter (v4.0), v4.1 adds lateral spatial filtering to catch hand/arm misdetections:

Hands appear at 0.8-3.0× face width distance laterally from the head
They're below eye level but not necessarily directly below the face
They're typically smaller than actual faces (< 50% of median face area)
Combined with the low-confidence threshold, this effectively filters raised hands, gesturing arms, etc.

11. Cartoon/Virtual Character Detection (v4.2)

When processing screenshots or images containing both real people and cartoon/anime/illustration characters (e.g., PPT slides with avatar illustrations), the DNN detector picks up cartoon faces too. The v4.2 cartoon filter distinguishes them using three texture-based features:

Texture uniformity (Laplacian median/mean ratio): Real skin has pervasive micro-texture (pores, fine lines) → median ≈ 0.30-0.47 of mean. Cartoon has large flat regions with sharp edges → median ≈ 0.10-0.25 of mean.
Flat pixel ratio (Sobel gradient near-zero): Real faces have subtle gradients everywhere → flat ratio 0.00-0.06. Cartoon has solid color blocks → flat ratio > 0.10.
High saturation ratio (S > 150 pixels): Cartoon uses vivid colors with many super-saturated pixels. Real skin stays in moderate saturation range.

Key design decisions:

Threshold set at 0.50 (real faces max at ~0.30, clear cartoon > 0.50) — 20-point safety margin
All features are resolution-normalized (resized to 64×64 for analysis) to handle varying face sizes
Works even on small, JPEG-compressed cartoon faces (tested with 72×84 PPT illustration)

Real-world results on classroom photos:

4.jpg (1706×1279, ~40 students): 24 faces (standard) → 37 faces (enhanced), +54%
5.jpg (3264×2448, ~25 students): 13 faces (standard) → 26 faces (enhanced), +100%

Trade-off: Enhanced mode is 3-5x slower due to more detection passes. Use standard mode for simple photos, enhanced mode for crowded scenes.

12. Upward-Angle Adaptive Mask Positioning (v4.3)

Selfies and wide-angle group photos taken from a low camera angle produce elongated face bounding boxes (height/width ratio > 1.35). With the standard mask positioning (y_start=30%), the mask sits too low — leaving the nose exposed or barely covered. The v4.3 adaptive positioning detects this condition and shifts the mask upward:

Trigger condition: face_height / face_width > 1.35 AND orientation is 'front'
Adjusted positioning: y_start moves from 0.30 to 0.15 (catches from eyebrow level down), y_end from 1.25 to 1.10 (tighter chin fit)
Why front-only: Side faces already have different proportions; applying the shift would over-correct
Real-world impact: Fixed mask misalignment on 合影-05 (selfie-angle group photo) where masks were displaced downward due to elongated detection boxes

13. Manual Face Coordinates and Model Reuse (v4.3)

No face detector achieves 100% recall. In production, some faces are consistently missed — especially small/partially-occluded faces at frame edges or unusual angles. Rather than endlessly tuning detector parameters (which risks introducing new false positives), v4.3 adds a practical escape hatch:

Manual face injection (--manual-faces "x,y,w,h;x,y,w,h"): Allows specifying exact bounding boxes for missed faces. These are merged with auto-detected results before mask application, inheriting all mask processing (front/side classification, template selection, positioning)
Model instance reuse: add_mask_with_image() now accepts pre-loaded net, mask_front, mask_side, mask_single objects. This avoids redundant model loading when processing multiple images (e.g., batch scripts or fix_masks.py workflows), reducing per-image overhead from ~2s to <0.1s
Design principle: Manual coordinates are a complement, not a replacement. Always run auto-detection first, then add manual overrides only for persistent misses

Post-Processing Filter Rules (Two-Stage)

Stage 1: Position & Size Filtering

Rule 1: conf < 0.40 AND area > 3× avg_high_conf_area → REJECT (large object)
Rule 2: conf < 0.45 AND face_center_y > 70% image_height → REJECT (furniture/floor)
Rule 3: conf < 0.35 AND face_y < 10% image_height → REJECT (ceiling/lights)
Rule 4: conf < 0.35 AND (face_x < 3% OR face_right > 97%) → REJECT (edge noise)
Rule 5: conf < 0.45 AND area < 25% median_high_conf_area → REJECT (tiny noise)

Stage 2: Body Part Spatial Filtering

For each low-confidence detection (conf < 0.50):
  For each high-confidence anchor face (conf >= 0.50):
    IF x-aligned (within 1.5× anchor width) 
    AND below anchor face bottom
    AND within 4× anchor face height distance
    → REJECT (body part: chest/waist/knee/shoe)

Enhanced Mode Post-Processing (v4.1)

Stage 1: Tighter filtering rules
Rule 0: conf < 0.50 AND (aspect < 0.65 OR aspect > 1.50) → REJECT (abnormal shape)
Rule 1: conf < 0.55 AND area > 2× avg_high_conf_area → REJECT (oversized object)
Rule 2a: face_center_y > 68% image_height → REJECT unconditionally (floor/furniture)
Rule 2b: conf < 0.50 AND face_center_y > 55% image_height → REJECT (bottom zone)
Rule 3: conf < 0.40 AND face_y < 25% image_height → REJECT (ceiling/signs)
Rule 4: conf < 0.25 AND (face_x < 2% OR face_right > 98%) → REJECT (edge noise)
Rule 5a: conf < 0.40 AND area < 20% median_high_conf_area → REJECT (too small)
Rule 5b: conf < 0.55 AND area < 10% median_high_conf_area → REJECT (extremely small)

Stage 2: Body part + hand spatial filtering (v4.1 improved)
  - Body part: conf < 0.45, x-aligned with anchor face, below anchor, within 3× height → REJECT
  - Hand region (NEW): conf < 0.45, lateral to anchor (0.8-3.0× face width), below eye level,
    area < 50% median → REJECT (hand/arm misdetection)

Stage 3: Skin color verification (v4.1 NEW)
  - For detections with conf < 0.45:
    - Extract upper 40% of detection box (forehead/eye region)
    - Convert to HSV and check skin-tone pixel ratio
    - HSV ranges: H=0-20 or 160-180, S=30-180, V=60-255
    - skin_ratio >= 0.15 → PASS (real face)
    - skin_ratio < 0.15 → REJECT (blackboard, desk, backpack, etc.)

Stage 4: Overlap cleanup (>35% overlap with higher-conf detection → REJECT)

Supported Formats

Input: JPG, JPEG, PNG, BMP, TIFF, WebP
Output: Same format as input (JPEG quality=95)

Troubleshooting

DNN Model Files Not Found

Ensure models/deploy.prototxt and models/res10_300x300_ssd_iter_140000.caffemodel exist
Download from OpenCV official repository if missing

False Positives (Non-Face Objects Getting Masks)

The two-stage post-processing filter should catch most cases automatically
Stage 2 body-part filter is especially effective for group standing photos
Increase CONFIDENCE_THRESHOLD slightly (e.g., to 0.30) if still seeing issues
Adjust filter rules in post_filter_faces() for specific scenarios

Not Enough Faces Detected

First try: Use --enhanced flag for enhanced detection mode (recommended for 20+ people)
Try lowering CONFIDENCE_THRESHOLD (e.g., from 0.25 to 0.18)
Ensure input images have sufficient resolution
Very small faces (< 12px) cannot be detected reliably

Mask Covers Eyes

Increase MASK_Y_START_RATIO (e.g., from 0.30 to 0.40 or 0.50)
Current default of 0.30 is optimized for privacy protection (covers below eyes)

Module Not Found

Install: pip install opencv-python pillow numpy
OpenCV must include DNN module (standard opencv-python package does)

Security Considerations

All processing happens locally, no network operations
No external API calls or data transmission
No personal data collection
Model files are standard OpenCV pre-trained weights

图片人脸批量戴口罩

概述