I participated in the REPLY AI Agent Challenge in April 2026.
What the challenge actually was
Stripping away the narrative, the problem was:
Given multiple datasets (transactions, users, locations, SMS, emails),
decide which transactions are fraudulent.
Key constraints:
- fraud behavior evolves over time
- decisions have asymmetric cost (false positives vs false negatives)
- systems must generalize across datasets
- only the evaluation dataset determines the official score
- only the first submission on evaluation data is accepted
In addition to accuracy, the system is also evaluated on:
- cost
- speed
- efficiency
Pre-Challenge Execution
Before the challenge started, I focused on something that usually gets ignored:
how the team would operate under constraint.
The team was split between Japan and Brazil, so I introduced a minimal structure early:
- one shared channel for final decisions and conclusions
- private loops for iteration and feedback
- explicit separation between signal and noise
Roles were defined upfront:
- Architecture / Direction (me) → system design, decisions, scope control
- Core Engineering → implementation and integration
- Validation / Output → testing and final result
We aligned on a simple execution loop:
define → implement → connect → test → repeat
This was not about process overhead — it was about avoiding:
- duplicated work
- misalignment under time pressure
- decision bottlenecks
Result:
the team could move in parallel without losing direction
Strategy Before Data
Before seeing the full problem, we aligned on a few working assumptions:
- this is a decision system over data, not a UI problem
- iteration speed matters more than initial accuracy
- we need something that works early and improves incrementally
Based on that, we prepared:
- a basic data pipeline
- a modular scoring structure
- logging to support fast iteration
So when the challenge started:
we were adapting a system, not starting from zero
System Design
We built a layered decision system to balance signal quality, cost, and speed.
L1 — Statistical Baseline
For each user, we computed:
- median transaction amount
- MAD (Median Absolute Deviation)
- behavioral references (recipients, methods, activity hours)
This established a stable notion of “normal” behavior per user.
L2 — Feature-Based Scoring
Each transaction was transformed into a set of features:
- amount deviation (MAD-based)
- amount vs income
- balance drain ratio
- new recipient detection
- geo inconsistency (when applicable)
- description-based signals
- phishing exposure prior to the transaction
Phishing exposure was modeled by:
- parsing SMS and emails
- detecting suspicious patterns
- building a timeline per user
- correlating transaction timing with prior events
This added context that is not visible in the transaction alone.
L3 — Composite Scoring
Features were combined into a weighted score.
Design choices:
- no single feature is decisive
- known benign patterns reduce score
- small transactions are heavily down-weighted
- certain transaction types are penalized
We used a dynamic threshold (~top 10%) to select suspicious transactions.
This ensured:
- valid output constraints
- adaptability across datasets
L4 — Selective LLM Usage
LLM was used selectively, not as a primary mechanism.
We sent:
~30% highest-risk transactions
to a Groq-hosted model.
The model received:
- transaction context
- user profile
- recent communications
and returned a probability score.
In this system:
LLM acts as a secondary signal layered on top of statistical and behavioral analysis
Key Design Choice
Instead of trying to detect fraud directly, we focused on:
what happens before the transaction.
Specifically:
- phishing messages
- suspicious emails
- timing between contact and action
This allowed us to model causal context, not just isolated anomalies.
Efficiency Considerations
Efficiency was treated as a first-class constraint:
- most transactions resolved locally
- LLM used only when necessary
- batch processing to control latency
- simple statistical methods over heavier models
This kept the system responsive and predictable under load.
Result
40th place out of 1,971 teams
Given:
- a 6-hour constraint
- a distributed team
- evolving datasets
this result reflects consistent execution rather than a single optimization.
What I would improve
Under the time constraint, several decisions were made to prioritize speed and reliability.
With more time, I would focus on:
-
Temporal modeling
Transactions were mostly evaluated independently. Incorporating sequence-aware analysis would better capture evolving fraud behavior. -
Adaptive weighting
Feature weights were fixed. A data-driven approach would improve generalization across datasets. -
Tighter feedback loop
Thresholding and scoring were calibrated per dataset, but not continuously refined during execution. -
More selective LLM routing
LLM was already used as a fallback, but selection could be further optimized for cost vs impact.
These were deliberate trade-offs:
prioritize a system that works reliably under constraint over a more complex system that would require more time to stabilize.
Final Takeaway
This challenge was not about building a perfect model.
It was about:
- making decisions under uncertainty
- balancing accuracy with cost and speed
- structuring a system that can adapt quickly
And at the team level:
creating enough structure so execution remains stable under pressure