AI Prompt Evolution System

0.847

All-Time Best Score

Generations Run

Population Size

+41%

Score Improvement

📈 Evolution Convergence

🔬 Mutation Effectiveness

🏆 Best Prompt — Detailed Metrics

      "Think step by step and analyze and respond to: Explain the concept of machine learning in simple terms, including how it works and real-world applications. Explain your reasoning."
    

Accuracy

0.882

Reasoning Quality

0.821

Coherence

0.794

🥇 Prompt Leaderboard

Rank	Generation	Composite Score	Prompt Preview	Tags
1	Gen 9	0.847	"Think step by step and analyze: Explain ML... Explain your reasoning."	CoTmutated
2	Gen 8	0.831	"You are an expert assistant. Step 1: Understand. Step 2: Apply..."	structuredelite
3	Gen 10	0.819	"### Task Explain ML... ### Instructions - Be accurate..."	structuredcrossover
4	Gen 7	0.795	"As a knowledgeable AI, provide an expert response to: Explain ML..."	template
5	Gen 5	0.773	"Let's think through this carefully... Question: Explain ML..."	few-shotmutated

⚙️ Evolution Pipeline

🌱

Generate

Initial population
of diverse prompts

→

🤖

LLM Inference

Run each prompt
through the model

→

📊

Evaluate

Score accuracy,
reasoning, coherence

→

🏆

Rank & Select

Elitism + tournament
selection

→

🧬

Mutate

Prefix, paraphrase,
crossover operators

↻

📖 Evolutionary Prompt Optimization — Documentation

What is Evolutionary Prompt Optimization?

Evolutionary Prompt Optimization (EPO) applies Genetic Algorithm principles to the problem of prompt engineering for LLMs. A population of prompt candidates is maintained across generations. Each candidate is scored using a fitness function that measures output quality. High-scoring prompts reproduce (are selected as parents), and new offspring are created through mutation and crossover. Over generations, the population converges toward prompts that consistently elicit high-quality responses.

Fitness Function

The composite fitness score combines three metrics: accuracy × 0.40 + reasoning_quality × 0.35 + coherence × 0.25. Accuracy measures keyword overlap with expected answers. Reasoning quality scores the use of logical connectives, structured steps, and response depth. Coherence evaluates transition word usage, sentence variety, and paragraph structure. All three scores are in [0, 1].

Mutation Operators

Five mutation operators are applied stochastically: prefix_injection prepends instruction prefixes; suffix_modification appends reasoning directives; instruction_paraphrase replaces instruction verbs with synonyms; structure_mutation toggles markdown formatting; and temperature_word_swap randomly substitutes words from a paraphrase pool. crossover recombines sentence halves from two parents to create two offspring.

Selection Strategy

Elitist selection preserves the top elite_fraction (default 25%) of prompts unchanged each generation, preventing fitness regression. The remaining slots are filled via tournament selection: k candidates are sampled randomly and the best is chosen. This balances exploitation of high-fitness prompts with exploration of the broader population.

Adaptive Mutation Rate

The mutation rate decays from 0.7 in generation 1 to 0.1 by the final generation using the formula rate = 0.7 - 0.6 × (gen / max_gen). This follows the exploration-exploitation trade-off: high mutation early explores diverse prompt structures, while low mutation late refines the best candidates found.

Convergence

The system tracks per-generation best, average, and worst scores. Convergence is reached when best scores plateau across consecutive generations. The RankingSystem stores the all-time best prompt and can export the full evolution history as JSON for analysis or visualization.

🧬 AI Prompt Evolution System