REPLY AI Agent Challenge - April 2026

I participated in the REPLY AI Agent Challenge in April 2026.

What the challenge actually was

Stripping away the narrative, the problem was:

Given multiple datasets (transactions, users, locations, SMS, emails),
decide which transactions are fraudulent.

Key constraints:

fraud behavior evolves over time
decisions have asymmetric cost (false positives vs false negatives)
systems must generalize across datasets
only the evaluation dataset determines the official score
only the first submission on evaluation data is accepted

In addition to accuracy, the system is also evaluated on:

cost
speed
efficiency

Pre-Challenge Execution

Before the challenge started, I focused on something that usually gets ignored:

how the team would operate under constraint.

The team was split between Japan and Brazil, so I introduced a minimal structure early:

one shared channel for final decisions and conclusions
private loops for iteration and feedback
explicit separation between signal and noise

Roles were defined upfront:

Architecture / Direction (me) → system design, decisions, scope control
Core Engineering → implementation and integration
Validation / Output → testing and final result

We aligned on a simple execution loop:

define → implement → connect → test → repeat

This was not about process overhead — it was about avoiding:

duplicated work
misalignment under time pressure
decision bottlenecks

Result:

the team could move in parallel without losing direction

Strategy Before Data

Before seeing the full problem, we aligned on a few working assumptions:

this is a decision system over data, not a UI problem
iteration speed matters more than initial accuracy
we need something that works early and improves incrementally

Based on that, we prepared:

a basic data pipeline
a modular scoring structure
logging to support fast iteration

So when the challenge started:

we were adapting a system, not starting from zero

System Design

We built a layered decision system to balance signal quality, cost, and speed.

L1 — Statistical Baseline

For each user, we computed:

median transaction amount
MAD (Median Absolute Deviation)
behavioral references (recipients, methods, activity hours)

This established a stable notion of “normal” behavior per user.

L2 — Feature-Based Scoring

Each transaction was transformed into a set of features:

amount deviation (MAD-based)
amount vs income
balance drain ratio
new recipient detection
geo inconsistency (when applicable)
description-based signals
phishing exposure prior to the transaction

Phishing exposure was modeled by:

parsing SMS and emails
detecting suspicious patterns
building a timeline per user
correlating transaction timing with prior events

This added context that is not visible in the transaction alone.

L3 — Composite Scoring

Features were combined into a weighted score.

Design choices:

no single feature is decisive
known benign patterns reduce score
small transactions are heavily down-weighted
certain transaction types are penalized

We used a dynamic threshold (~top 10%) to select suspicious transactions.

This ensured:

valid output constraints
adaptability across datasets

L4 — Selective LLM Usage

LLM was used selectively, not as a primary mechanism.

We sent:

~30% highest-risk transactions

to a Groq-hosted model.

The model received:

transaction context
user profile
recent communications

and returned a probability score.

In this system:

LLM acts as a secondary signal layered on top of statistical and behavioral analysis

Key Design Choice

Instead of trying to detect fraud directly, we focused on:

what happens before the transaction.

Specifically:

phishing messages
suspicious emails
timing between contact and action

This allowed us to model causal context, not just isolated anomalies.

Efficiency Considerations

Efficiency was treated as a first-class constraint:

most transactions resolved locally
LLM used only when necessary
batch processing to control latency
simple statistical methods over heavier models

This kept the system responsive and predictable under load.

Result

40th place out of 1,971 teams

Given:

a 6-hour constraint
a distributed team
evolving datasets

this result reflects consistent execution rather than a single optimization.

What I would improve

Under the time constraint, several decisions were made to prioritize speed and reliability.

With more time, I would focus on:

Temporal modeling
Transactions were mostly evaluated independently. Incorporating sequence-aware analysis would better capture evolving fraud behavior.
Adaptive weighting
Feature weights were fixed. A data-driven approach would improve generalization across datasets.
Tighter feedback loop
Thresholding and scoring were calibrated per dataset, but not continuously refined during execution.
More selective LLM routing
LLM was already used as a fallback, but selection could be further optimized for cost vs impact.

These were deliberate trade-offs:

prioritize a system that works reliably under constraint over a more complex system that would require more time to stabilize.

Final Takeaway

This challenge was not about building a perfect model.

It was about:

making decisions under uncertainty
balancing accuracy with cost and speed
structuring a system that can adapt quickly

And at the team level:

creating enough structure so execution remains stable under pressure