2 min readfrom Machine Learning

Per-pixel bounding-box regression + DBSCAN for handwritten word detection - visual walkthrough of WordDetectorNet [P]

Per-pixel bounding-box regression + DBSCAN for handwritten word detection - visual walkthrough of WordDetectorNet [P]
Per-pixel bounding-box regression + DBSCAN for handwritten word detection - visual walkthrough of WordDetectorNet [P]

Overview of WordDetectorNN architecture.

Sharing a visual breakdown of WordDetectorNet, Harald Scheidl's handwritten-word detection model. I think the design choice at its core is unusual enough to be worth a closer look - and I haven't seen it written up in detail anywhere else.

The mechanism: Instead of anchor-based detection + NMS, every pixel the network classifies as a "word pixel" also regresses 4 scalar distances (top/right/bottom/left) to the enclosing bounding box. Each word pixel therefore reconstructs one candidate box, producing thousands of overlapping candidates per word. These are then collapsed with DBSCAN using distance = 1 − IoU as the metric, taking the median box per cluster as the final detection.

Architecture: ResNet18 backbone (modified to 1-channel grayscale input, with intermediate features exposed after each residual block) → FPN-style decoder that upscales and concatenates features at all scales → head producing 6 output channels per pixel (2 segmentation logits + 4 distance values). Loss = cross-entropy + IoU, equally weighted. Trained on IAM with 448×448 inputs → 224×224 outputs.

What I find interesting about the design:

  1. The per-pixel distance regression means there is nothing to tune like anchors or NMS thresholds.
  2. The 1 − IoU distance for DBSCAN is conceptually clean: spatially-overlapping candidates cluster together by construction.

What I don't like about the design:

  1. The pairwise IoU distance matrix is O(n²) in the number of candidate boxes, and this is genuinely the runtime bottleneck in practice (not the forward pass).
  2. The clustering step blocks end-to-end training — hyperparameters like DBSCAN's eps have to be set manually.

Full visual write-up with figures (one per pipeline stage + an architecture diagram): https://lellep.xyz/blog/worddetectornet-visually-explained.html

Credit where credit is due: Original architecture by Harald Scheidl, see here https://github.com/githubharald/WordDetectorNN

submitted by /u/martin_lellep
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#automated anomaly detection
#financial modeling with spreadsheets
#rows.com
#no-code spreadsheet solutions
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#cloud-based spreadsheet applications
#WordDetectorNet
#per-pixel bounding-box regression
#handwritten word detection
#DBSCAN
#IoU
#ResNet18 backbone
#NMS
#anchor-based detection
#FPN-style decoder
#distance metric
#cross-entropy loss
#1-channel grayscale input