First Photo Spot Checks — Pasley Hill Specs

The Problem Today

One person, 2,000+ photos a day, eyeballed by hand.

Every day, someone at Redline opens the day's batch of 2,000+ first photos — the lead listing photo for each used car on a dealer lot — and looks at each one to decide whether it's framed correctly. When a photo is bad, they manually compose an email to the Merchandising Specialist (MS) who shot it and that MS's Regional Manager (RM), so the car can be reshot and re-uploaded.

This is the kind of work that is both critically important and soul-crushing to do by hand. A bad lead photo is the first thing a buyer sees; it directly affects how fast a car sells. But asking a human to make the same yes/no framing judgment two thousand times before lunch guarantees three things:

2,000+

first photos to review per day

~3–5 hrs

of a person's day, every day

Variable

standard drifts by reviewer & fatigue

Hours

lag before MS hears a photo was rejected

The honest failure mode isn't that the reviewer is bad at it — it's that attention is finite. By photo 1,500, the bar quietly moves. Two reviewers won't reject the same set. And the reshoot email is a second manual job stacked on top of the first: find the MS, find their RM, attach the photo, describe what's wrong, hit send. That second job is exactly the part that gets dropped when the day gets busy — so bad photos stay live longer than anyone wants.

Why this is the flagship pilot The task is narrow, repetitive, has a clear right answer, and already produces a labeled history (every accept/reject Redline has ever made). That's the ideal shape for a vision model. It's also high-visibility: if it works, everyone sees it work the next morning.

How It Works

A vision model with an explicit rubric, a confidence band that routes borderline cases to a human, and a templated reshoot email.

The core idea, in plain terms

We don't ask a model "is this a good photo?" and trust a vibe. We give it the same checklist the human uses, score each item, and combine those into an accept / reject / not-sure decision. The model never silently invents a standard — it answers a fixed rubric, and we can read its reasons.

The framing rubric

Each first photo is checked against a small, concrete set of pass/fail criteria. These are the things a Redline reviewer is actually looking for:

Vehicle centered — the car is roughly in the middle of the frame, not shoved to one edge.
Full vehicle in frame — no wheels, roof, or bumper cropped off.
Correct angle — the standard 3-quarter front (front + driver side), not a flat side or rear shot.
Not cut off / no clipping — the whole body fits with reasonable margin.
No obstructions — no people, price boards, cones, other cars, or hoses blocking the vehicle.
Adequate lighting — not blown out, not so dark the body color is unreadable.

This list is the spec we tune against, and it's the first thing we'll confirm with Redline so the model is scored on exactly the rules the team enforces today.

Two viable approaches — we'll start with the cheaper one to learn fast

There are two honest ways to build the scorer, and they trade off the same way:

(A) Vision-LLM + rubric prompt (start here). A hosted vision-language model (via Amazon Bedrock) is given the photo plus the rubric and returns a structured verdict per criterion with a short reason and a confidence. Near-zero training data needed on day one, fast to stand up, easy to read why it decided what it did. Cost is per-image inference. This is how we get a working pilot in days, not months.
(B) Fine-tuned / few-shot image classifier (graduate into this). Once we have a labeled set from Redline's accept/reject history, we train a dedicated classifier (or few-shot embed-and-compare). It's cheaper per image at 2,000/day volume, faster, and more consistent — but it only earns its keep after we've harvested labels and proven the rubric. We move criteria onto it one at a time as accuracy justifies.

The path is: ship (A) fast, use it to generate and confirm labels, then quietly swap individual checks over to (B) where it's clearly better and cheaper. Redline sees a working system the whole time.

Building the labeled set (the real asset)

Redline already has the most valuable thing for this project: a history of human decisions. Every photo previously flagged-and-reshot is a labeled "reject"; the vast majority that passed are "accepts." We pull a few thousand of each, have the current reviewer confirm a clean validation slice (a few hundred images they re-judge carefully against the rubric), and use that as ground truth. That validation set is what we measure precision and recall against — and what catches the model drifting.

The confidence band + human-in-the-loop

This is the part that decides whether MSs trust the system or learn to ignore it. The model returns a confidence; we split it into three lanes:

per-photo verdict
  ├─ HIGH-confidence PASS      → silently accept, log it
  ├─ HIGH-confidence FAIL      → queue reshoot email (MS + RM)
  └─ BORDERLINE (uncertain)    → human review queue
                                  reviewer clicks Accept / Reject
                                  → that click becomes a new training label

The whole point is that the human now only looks at the borderline few percent instead of all 2,000 — and every click they make feeds back as a fresh label. The system gets more confident over time, so the review queue shrinks. We tune the band's width to hit our error targets, not to hit a tidy number.

Precision and recall — and why precision matters more here

We will deliberately bias the system toward precision on rejects. A false reject (telling an MS to reshoot a perfectly good photo) is the expensive mistake: it wastes a field rep's drive time and, worse, teaches them the tool cries wolf. Once an MS believes that, they ignore every email — and the project is dead even if it's technically accurate.

Starting targets (to be confirmed against the validation set) Reject precision ≥ 95% — when we tell someone to reshoot, we're almost always right.
Reject recall ≥ 80–85% to start — we'll catch the clear bad ones automatically; genuinely ambiguous ones go to the human, not into a false accusation. Recall climbs as labels accumulate.

The borderline lane is what lets us be aggressive on precision without dropping real problems on the floor — anything we're not sure about goes to a person, it doesn't get auto-rejected or auto-passed.

The reshoot notification

When a photo is a confirmed fail (auto or human), the system composes a templated email to the MS and CCs their RM. It includes the photo itself, the VIN/stock context, which rubric items failed in plain language, and concrete reshoot instructions for that specific problem (e.g. "back up ~6 feet and shoot the front-driver 3-quarter angle so the full car fits"). Generic "reshoot this" emails get ignored; specific ones get acted on.

SUBJECT: Reshoot needed — [Year Make Model] · Stock #[####] · [Dealer]

Hi [MS first name],

The first photo for this vehicle needs a reshoot before it lists well:

  • Vehicle is cut off on the right (rear bumper out of frame)
  • Angle is closer to a flat side than the standard 3-quarter front

How to fix:
  • Step back ~6 ft and frame the full car with margin on all sides
  • Stand off the front-driver corner for the 3-quarter front angle

[ photo thumbnail ]   Reshoot & re-upload when you're next on the lot.

— Redline Merchandising QA (automated)   ·   RM: [RM name] cc'd

Architecture & Stack

Event-driven on AWS. New photo lands → screen it → route it → notify. No standing servers to babysit.

Amazon S3 AWS Lambda AWS Batch (backfill) Amazon Bedrock (vision-LLM) SageMaker (later: fine-tuned classifier) Amazon SES (email) DynamoDB (verdicts/log) EventBridge SQS (review queue) CloudWatch Lightweight web dashboard (S3 + CloudFront)

Data flow

[ Photo source ]  (DAM / S3 bucket / pipeline API — TBC w/ Redline)
       │  new first-photo event
       ▼
[ EventBridge ] ──► [ Lambda: ingest ]  pull image + VIN/stock + MS/RM mapping
       │
       ▼
[ Lambda/Batch: score ] ──► [ Bedrock vision-LLM + rubric ]
       │                          returns per-criterion verdict + confidence
       ▼
  ┌──────── confidence router ────────┐
  │ PASS (high)   → log to DynamoDB    │
  │ FAIL (high)   → SES reshoot email  │──► MS + RM
  │ BORDERLINE    → SQS review queue   │──► human dashboard → click = new label
  └────────────────────────────────────┘
       │
       ▼
[ DynamoDB log ] ──► [ Dashboard: daily ops + accuracy ]
                     [ labeled-data store → future fine-tune ]

Why this shape

It's event-driven and serverless, so it costs almost nothing when idle and scales to a daily spike of 2,000+ without provisioning. The backfill path (AWS Batch) is a one-time job that scores the historical photo set to build the validation/label set. The steady-state path (Lambda) handles each new photo as it arrives. The dashboard is a static front-end reading DynamoDB through a thin API — nothing exotic to run.

Build Plan

Phased so Redline sees a working screener fast, then we tune accuracy and close the loop. AI-assisted dev makes this genuinely quick — the gating item is photo-source access, not code.

Phase 0 — Discovery & access

~2–4 days · the real unlock

Confirm where photos live and how we read them (DAM export, S3 bucket, or pipeline API), confirm the MS→RM mapping source, and lock the rubric with the current reviewer. Pull a sample of accept/reject history. Nothing downstream starts cleanly until image access is real — this is the dependency that actually matters.

Phase 1 — Rubric scorer on real photos

~3–5 days · "it judges photos"

Wire Bedrock vision-LLM to the rubric, return structured per-criterion verdicts + confidence, and run it over a few hundred labeled samples. Produce a first precision/recall read against the validation set. This is the demo that proves the concept.

Phase 2 — Confidence band + review queue

~3–5 days · "humans only see the hard ones"

Tune the three-lane router for our precision target, stand up the SQS-backed human review queue, and build the reviewer screen (Accept / Reject, one click = one label). Verify the borderline volume is small enough to be sane.

Phase 3 — Auto-notification

~2–3 days · "MSs get actionable emails"

Templated SES emails to MS + RM with photo, failed criteria in plain language, and reshoot instructions. Start in a shadow/approval mode — emails are drafted and reviewed before they actually send — until precision is proven, then flip to auto-send.

Phase 4 — Dashboard & accuracy tracking

~3–4 days · "we can see it working"

Daily ops view (photos screened, auto-passed, flagged, in review queue, emails sent) plus an accuracy panel tracking precision/recall against the rolling validation set and the false-reject rate. This is how Redline trusts it and how we catch drift.

Phase 5 — Graduate to fine-tuned classifier (optional)

later · cost & consistency win

Once labels have accumulated, train a dedicated classifier and move high-volume checks off the LLM where it's clearly cheaper and more consistent. Pure optimization — only done when the numbers justify it.

Honest estimate Roughly 3–4 weeks to a trustworthy production pilot running in shadow mode, assuming photo access is granted promptly in Phase 0. The first useful demo (Phase 1) lands inside the first week.

Data & Access Needed

Almost all the risk-to-schedule is here, not in the build. The single most important item is read access to the photos.

What	Why we need it	Form
Read access to first photos (the blocker)	To screen each photo. Need to confirm the source: photo pipeline / DAM, an S3 bucket, or a vendor API. Everything depends on this.	S3 bucket + IAM role, or API endpoint + credentials
New-photo trigger	To screen in near-real-time as photos land rather than batch-polling.	S3 event / webhook / queue — whatever the source supports
Photo → VIN / stock / dealer metadata	So the email says which car, at which dealer.	Field in the photo record or a lookup we can join on
Photo → MS mapping	To know who shot it / who reshoots it.	From the photo metadata or a roster (possibly Rippling/Predian)
MS → RM mapping	To CC the right Regional Manager.	Org roster / spreadsheet / HR export
MS & RM email addresses	To send the reshoot notification.	Directory export
Historical accept/reject examples	To build the validation set and, later, fine-tune.	Past flagged-photo emails / logs, or a tagged sample
The framing rubric, confirmed	So the model scores exactly Redline's current standard.	30–60 min with the current reviewer + a few annotated examples
SES sending domain	So emails come from a verified redline-branded address, not spam.	Domain verification / DKIM in SES

Risks & Open Questions

The build is low-risk. The two things that can hurt us are model accuracy and getting clean access to the photos.

RISK · false rejects erode trust If MSs get told to reshoot good photos, they'll stop reading the emails — and then even correct flags get ignored. Mitigation: bias hard toward reject-precision, route anything uncertain to a human (never auto-reject borderline), and run notifications in shadow/approval mode until precision is proven on real traffic.

RISK · photo-source access is the gating dependency We can't screen what we can't read. If the source is a closed vendor system with no API or export, Phase 0 stalls. Mitigation: nail this down first; identify a fallback (scheduled export to an S3 drop bucket) before committing a timeline.

RISK · rubric ambiguity & standard drift "Correctly framed" is partly judgment, and even humans disagree at the margins. If the rubric isn't pinned down, the model chases a moving target. Mitigation: lock the rubric with the reviewer up front, keep a labeled validation set as the source of truth, and review disagreements monthly.

RISK · edge cases & lot conditions Weather, glare, snow, cars parked tight, convertibles, oddball body styles. Mitigation: the borderline lane absorbs these into human review rather than guessing; they become labeled training data over time.

OPEN QUESTIONS Where exactly do first photos live, and can we get an event on upload? · Is there a clean photo→MS→RM mapping, or do we assemble it? · Roughly what fraction of photos get rejected today (sets review-queue volume)? · Should the email go only on hard fails, or also nudge borderline-but-passed photos? · Does Redline want a "reshoot received/closed" loop tracked, or is sending the email enough for the pilot?

Cost to Own

Build is cheap and fast. The ongoing story is model-inference cost and a little accuracy tuning — not server maintenance.

Run cost The infrastructure is serverless and nearly free at idle. The real recurring line item is vision-model inference at ~2,000+ images/day via Bedrock. That's a modest, predictable monthly cost while on the LLM path — and it's exactly the cost that drops once high-volume checks graduate to a fine-tuned classifier (Phase 5). S3 storage, Lambda invocations, SES email, and the dashboard are rounding-error by comparison.

The honest version: the cost to own this is not engineering hours after launch — there's very little to maintain. The two things that need ongoing attention are:

Accuracy tuning. Periodically re-check precision/recall against the validation set, fold in the human-reviewer labels, and adjust the confidence band. This is light, occasional work — measured in hours per month, not a standing role — but it's the work that keeps MSs trusting the emails.
Image-access integration. If Redline's photo source changes (new vendor, new pipeline), the ingest connector needs updating. This is the same dependency that gates the build, just over the system's lifetime.

Bottom line A high-leverage pilot: it gives back hours of a person's day every day, surfaces bad photos in minutes instead of hours, and applies one consistent standard to all 2,000+. The build is fast and low-risk; the durable investment is keeping the model honest and the photo feed connected — both small, both well understood.