Annotation studio·Image · Video · Speech

Training data for ideas
taking shape.

Zyka Foundry is a precision labeling studio for teams building computer vision, video intelligence, and speech AI. We handle the unglamorous part of the model — so your team can stay focused on the interesting part.

Start a project See our workflow

12M+

frames labeled in the last 12 months

98.6%

median inter-annotator agreement

<24h

median turnaround on active pools

languages covered for speech work

Trusted by teams shipping real models

Parallax LabsHelio HealthNorthwind AVVerso RoboticsMeridian AudioFolio DiagnosticsKestrel MobilitySignal & SampleCartographMosaic MLLimen SecurityAtlas GeoParallax LabsHelio HealthNorthwind AVVerso RoboticsMeridian AudioFolio DiagnosticsKestrel MobilitySignal & SampleCartographMosaic MLLimen SecurityAtlas Geo

[01] What we believe

Models are only as honest as the data behind them. We treat labels like the product they are.

01 /

Specialists, not a crowd

Radiologists do medical imaging. Linguists do phoneme work. Drivers do LiDAR. Matching expertise to modality is not optional — it's the whole job.

02 /

Quality as a measured system

Every pool runs under a QA framework: gold tasks, consensus scoring, calibrated reviewers, and agreement deltas shipped with every batch.

03 /

Your taxonomy, not ours

We don't hand you a generic schema. We sit with your ML team, iterate on edge cases, and encode your judgment into the annotation guide.

[02] Modality · Image

Pixel-level honesty.

Nine image workflows, one quality system. From a tight crop on a bolt in a factory feed to a boundary-accurate mask of a pulmonary nodule — we match the labeler to the domain and the tool to the task.

frame_00124.jpg · 1920 × 1080

Live preview

annotator: m.ortega · reviewer: j.ikedaIoU 0.94 · IAA 0.97

[03] Modality · Video

Motion, traced
frame by frame.

Video isn't just a stack of images — it's a contract of identity and causality over time. We label with that in mind: consistent IDs, frame-accurate boundaries, and reviewers trained to catch temporal drift before it poisons your model.

scene_42.mp4 · 59.94 fps · 3840 × 2160

00:00:44:23 / 00:02:03:09

person-A (tracking)

person-B (tracking)

vehicle-01 (3D bbox)

action · handoff

action · walking

anomaly · loiter

0:000:301:001:302:00

Track

—

Segment

—

Duration

—

[04] Modality · Speech

Sound with meaning attached.

Speech is the modality where linguistics, acoustics, and judgment collide. Our annotators include trained phoneticians, native speakers of low-resource languages, and certified listeners for subjective quality work.

call_042_agent_customer.wav · 48kHz · stereo

00:02:05.920

speaker · agent

speaker · customer

event · hold-music

emotion · frustration

Transcript

thanksfor calling,how can I help?yeah uhmy orderit hasn't[0.4s hesitation]arrived yetand I'm kindalosing patience here

Live state

active: speaker · agent
intent: order.status.check
entity: order_id (missing)
sentiment: neutral

[05] HITL · Human-in-the-loop

When the model is right, we stay out of the way. When it isn't,
we close the loop.

HITL isn't a fallback tier for when things go wrong. It's the plumbing between your model in production and the labeled data it still needs to learn from. Confidence-gated, cost- efficient, and wired to surface the samples that actually move your evals.

The loopContinuously running

stream

Production inference

Your model makes predictions at full throughput.

1.0×

gate

Confidence split

~85% auto-accepted. ~15% routed by entropy + uncertainty.

85 / 15

review

Human review pool

Matched specialists label, correct, rationalize.

3-class QA

feeds

artifact

Redeploy

New checkpoint rolls forward. Confidence gate retightens.

v + 1

queue

Retrain queue

Corrected labels join SFT / RLHF training pool.

∑ deltas

delta

Corrected labels

Disagreement routed to senior review. Rationale logged.

JSONL + audit

7×

median throughput gain with pre-label + correct

34%

lower cost-per-label on mature schemas

<6h

median time from prod flag to human review

2.1×

model eval improvement per active-learning batch

· Eight HITL patterns we operate

01 /3–10× throughput
Pre-label & correct
Your model proposes boxes, masks, transcripts, or captions. A trained reviewer approves, fixes, or rejects. On mature schemas, throughput goes up by an order of magnitude and label cost drops just as hard.
02 /Uncertainty-routed
Active learning
We route only the samples your model is least sure about — low-confidence predictions, high-entropy distributions, near-decision-boundary cases. The labels you pay for are the ones that move the needle.
03 /Every Nth inference
Production review pool
A fixed sample rate of production traffic is mirrored into a review queue. Drift, regressions, and silent failures show up in the agreement delta the day they start — not in the quarterly model eval.
04 /Specialist pools on standby
Edge-case triage
When your model hits something weird — a rare medical phenotype, a code-switched utterance, a long-tail class — the task is auto-routed to a specialist pool. No generic-crowd guessing on data that deserves expertise.
05 /Pairwise · rubric · Likert
RLHF & preference
Pairwise A/B ranking, rubric-scored absolute rating, and multi-turn dialogue judgment for generative models — image, video, speech, text. We run the calibration pipeline that keeps subjective judgment internally consistent.
06 /Policy · jailbreak · adversarial
Safety & red-team
Jailbreak probing, policy-violation review, and adversarial prompt labeling for safety-tuned models. Trained reviewers who know the taxonomy cold and log rationale alongside the label.
07 /Signal, not noise
Disagreement routing
When three annotators disagree, the schema is under-specified — not the annotators. Disagreements are routed to a senior reviewer, resolved with written rationale, and become test cases for the next schema revision.
08 /Statistical + human review
Drift detection & reroute
Population-level statistics on the production inference stream surface distributional shift. Flagged slices are pulled into a re-labeling queue before the model's confidence catches up with its actual accuracy.

· A note on the word "loop"

A loop implies something that runs. Most HITL pipelines we inherit from customers are not actually loops — they're a one-way conveyor from humans to training data, never back. The difference shows up in model performance twelve months later. We build them as real loops.

[06] How we work

A quiet, careful process
that you never have to manage.

Good annotation looks boring from the outside. No heroic late-night pushes, no Slack fires, no "we'll fix it in the next batch."

The work shows up on time, the agreement numbers are where they should be, and the edge cases are the ones your guide already anticipated.

01
Kickoff & taxonomy
We sit with your ML team for 2–3 working sessions to understand the model's failure modes, the decision boundary you want to enforce, and the edge cases that keep your PMs up at night. Output: a versioned annotation guide, test set, and success metric.
1–2 weeks
02
Gold set & calibration
We hand-build 200–1,000 gold tasks with your team, then calibrate annotator pools against them. Annotators who don't hit the agreement bar don't see production data. Simple as that.
3–5 days
03
Pilot batch
A small production batch (1–5% of scope) runs end-to-end through the tool, pool, QA, and export pipeline. We surface schema ambiguities early, before they compound across millions of labels.
3–7 days
04
Production at scale
Full throughput with layered QA — peer review, senior review, gold-task injection, and consensus scoring on disputed tasks. Rolling dashboards on agreement, throughput, and reviewer calibration.
Ongoing
05
Retrospective & schema v2
Every batch ends with a retro. What did annotators argue about? What did the model get wrong after training on this data? That feedback becomes the next version of the guide.
Per milestone

[07] Quality system

Numbers you can audit,
not adjectives.

Every delivery ships with an agreement report, reviewer calibration curve, and a delta against your gold set. You don't have to trust us — you can check.

Disagreement is the most valuable signal in any annotation pipeline. We surface it, root-cause it, and feed it back into the next schema revision instead of averaging it away.

IAA

Inter-annotator agreement

Krippendorff's α, Cohen's κ, or F1 depending on task shape. Reported per batch and per reviewer.

Gold

Gold-task injection

Annotators see hidden gold tasks at a 3–5% rate. Performance below the bar triggers recalibration.

Consensus

N-way consensus

Critical or ambiguous tasks routed to 3+ annotators. Conflict resolution by senior reviewer with written rationale.

Drift

Reviewer drift monitoring

Weekly calibration checks for reviewers themselves. Humans get lax over time — we control for it explicitly.

[08] Who we work with

Teams whose data earns its keep.

Domain fit matters more than volume. A pool of drivers will always beat a generic crowd on AV work. A pool of radiologists will always beat a generic crowd on mammography. We staff accordingly.

AV · drones · robotics
Autonomous systems
LiDAR cuboids, sensor-fusion tracking, lane and drivable-surface segmentation, behavior prediction labels.
14 AV programs shipped
radiology · pathology · ophthalmology
Healthcare & life sciences
DICOM-native workflows with clinician annotators. HIPAA environment, IRB-aware de-identification, audit-ready trails.
4 FDA-cleared customers
SFT · RLHF · safety red-team
Generative AI
Preference ranking, rubric-driven aesthetic scoring, and multi-turn dialogue judgment for text-to-image, video, and voice models.
1.2M pairwise judgments
CCTV · perimeter · behavior
Security & surveillance
Anomaly detection, re-identification across cameras, loitering and intrusion events, crowd density estimation at scale.
24/7 monitored pool
catalog · shelf · fit
Retail & e-commerce
Product attribute tagging, shelf compliance audits, try-on pose and garment segmentation, aesthetic search preference data.
6M SKUs processed
EO · SAR · aerial
Geospatial & satellite
Building footprints, land-use segmentation, change detection, vessel and vehicle detection from overhead imagery.
EO + SAR trained pools
ASR · intent · QA
Contact centers & voice
Transcription, diarization, intent and entity schema work for voice agents, IVR, and conversational analytics.
7 languages live
QA · predictive maintenance
Industrial & manufacturing
Surface defect segmentation, machine anomaly audio, thermal imaging classification, assembly compliance.
Line-speed throughput

[09] Our people

A team of specialists. Not a crowd.

We're a hybrid model. A full-time annotation team of 140+ domain specialists handles the hard work. A vetted extension pool of 800+ trained contributors scales us up when volume spikes.

Annotators are paid above local living wage, receive health coverage, and are employed under local labor law — not gig-work contracts. We believe the quality of a label reflects the conditions it was made in.

140+

full-time specialists

vision · speech · domain experts

800+

extension pool

vetted · calibrated · callable

languages in-house

native speakers on speech work

74%

retention at 24 months

industry median ~22%

radiologists on staff

plus pathologists, OCT readers

linguists on staff

phonetics + computational

[10] Tools & integration

We meet you where your stack already is.

We don't force you onto a proprietary platform. If you've standardized on Label Studio, we run in Label Studio. If you've built an internal tool, we'll train our pool on it. Data never leaves your cloud unless you ask us to mirror it.

Platform

CVAT (self-hosted)
Label Studio
Scale Nucleus
Labelbox
Supervisely

3D / LiDAR

Segments.ai
Deepen AI
Custom PCD viewer

Speech

Praat
ELAN
Audino
Custom MUSHRA rig

RLHF & eval

Argilla
SurgeHQ schema
Custom pairwise UI

Transport

S3 / GCS / Azure Blob
SFTP
Signed-URL handoff

Formats

COCO
YOLO
Pascal VOC
CVAT XML
KITTI
nuScenes
DICOM-SR
WebVTT
TextGrid
custom schemas

[11] Security & compliance

Your data stays your data.

We operate on a principle of least access. Annotators see only the fields they need, from workstations inside our secure VPN — no downloads, no screenshots, no phone cameras allowed in sensitive pools.

PII redaction, watermarking, and full audit trails on every task are available by default on healthcare, financial, and government engagements.

SOC 2 Type II

ISO 27001

HIPAA

GDPR

CCPA

Clean-room workstations

Locked-down VDI, no USB, no external network egress, recorded sessions on flagged pools.

End-to-end encryption

AES-256 at rest, TLS 1.3 in transit. Customer-managed keys available on enterprise plans.

Signed BAAs & DPAs

HIPAA BAA and GDPR DPA on every engagement. Sub-processor list published and versioned.

Full audit trail

Every annotation is attributed, timestamped, and reviewable. Deletion certificates on project close-out.

[12] What teams tell us

The nicest thing a customer has said is that they forgot we existed.

We switched from a crowd platform after six months of arguing with the data. Agreement numbers on our mammography schema went from 0.71 to 0.94 in the first pilot batch. That was the day the CTO stopped asking me about label quality.

Head of ML

Folio Diagnostics

Their LiDAR team knew the sensor stack. Nobody had to explain what a moving ground return was. We lost a quarter of engineering time on a previous vendor just trying to communicate basic things. That time came back.

Perception Lead

Kestrel Mobility

Annotation is supposed to be invisible when it's working. For two years now I've been able to forget it's even part of our process — which is the highest compliment I can give a vendor.

VP Data Science

Meridian Audio

We ran the same 5,000-sample evaluation batch with three other vendors first. Zyka Foundry was the only one whose disagreement pattern was consistent with our own internal review. They were not labeling a different model than us.

Research Scientist

Mosaic ML

[13] Engagement models

Priced for the real shape of the work.

We don't price by the label. Labels are the output, not the unit of work — the real cost is in schema design, annotator calibration, and review. These are the three shapes most engagements take.

Pilot

For teams validating fit

Fixed-scope engagement

from $8,000

Up to 20,000 simple labels
1 modality · 1 taxonomy
Shared annotation guide authoring
Gold set + calibration included
Agreement report per batch
2-week standard turnaround

Start a pilot

Most common

Program

For recurring production pipelines

Managed monthly program

from $22,000 / mo

Dedicated annotator pool (20–80 people)
Dedicated program manager + QA lead
SLA on throughput and agreement
Weekly reviewer calibration
Live dashboards on agreement & throughput
Multi-modality supported
Direct Slack line to the pool PM

Scope a program

Enterprise

For regulated or high-volume teams

Custom · multi-year

Let's talk

Clinician / licensed specialist pools
On-premise or customer VPC deployment
Customer-managed encryption keys
Custom BAA / DPA / DPIA
Dedicated legal + security review
24/7 program staffing
Data residency guarantees (US / EU / APAC)

[14] Questions

Some things worth knowing before we start.

If your question isn't here, the fastest path is a 20-minute call with our delivery team. We'd rather answer the specific thing you care about than write another generic FAQ entry.

[15] Start a conversation

Let's get your data in shape.

Tell us what you're building. We'll reply within one working day with a scope sketch, a rough timeline, and a short list of the sharpest questions we'd ask in a kickoff session.

hello@zykafoundry.studio +1 (415) 555-0142

San Francisco

548 Market St · CA 94104

Amsterdam

Herengracht 182 · 1016 BR

Singapore

8 Marina Blvd · 018981

Mexico City

Polanco · CDMX 11550

Training data for ideastaking shape.

Models are only as honest as the data behind them. We treat labels like the product they are.

Specialists, not a crowd

Quality as a measured system

Your taxonomy, not ours

Pixel-level honesty.

Bounding boxes

Semantic segmentation

Instance segmentation

Keypoints & pose

Image classification

Polygon & polyline

Medical imaging

Quality rating & RLHF

OCR & document annotation

Motion, tracedframe by frame.

Object tracking

Temporal segmentation

Action recognition

3D bounding boxes · LiDAR

Human interaction annotation

Video captioning

Anomaly detection

Video quality evaluation

Sound with meaning attached.

Transcription

Speaker diarization

Emotion & sentiment tagging

Language identification

Phoneme alignment

Audio event tagging

Intent & entity labeling

MOS / MUSHRA evaluation

When the model is right, we stay out of the way. When it isn't,we close the loop.

Pre-label & correct

Active learning

Production review pool

Edge-case triage

RLHF & preference

Safety & red-team

Disagreement routing

Drift detection & reroute

A quiet, careful processthat you never have to manage.

Kickoff & taxonomy

Gold set & calibration

Pilot batch

Production at scale

Retrospective & schema v2

Numbers you can audit,not adjectives.

Inter-annotator agreement

Gold-task injection

N-way consensus

Reviewer drift monitoring

Teams whose data earns its keep.

Autonomous systems

Healthcare & life sciences

Generative AI

Security & surveillance

Retail & e-commerce

Geospatial & satellite

Contact centers & voice

Industrial & manufacturing