๐Ÿงฌ AI Prompt Evolution System

Automatically evolves LLM prompts using evolutionary algorithms โ€” mutation, selection, and ranking โ€” to optimize accuracy, reasoning quality, and coherence.

Python 3.9+ MIT License Evolutionary AI No External Deps
0.847
All-Time Best Score
10
Generations Run
8
Population Size
+41%
Score Improvement

๐Ÿ“ˆ Evolution Convergence

๐Ÿ”ฌ Mutation Effectiveness

๐Ÿ† Best Prompt โ€” Detailed Metrics

"Think step by step and analyze and respond to: Explain the concept of machine learning in simple terms, including how it works and real-world applications. Explain your reasoning."
Accuracy
0.882
Reasoning Quality
0.821
Coherence
0.794

๐Ÿฅ‡ Prompt Leaderboard

Rank Generation Composite Score Prompt Preview Tags
1 Gen 9
0.847
"Think step by step and analyze: Explain ML... Explain your reasoning."
CoTmutated
2 Gen 8
0.831
"You are an expert assistant. Step 1: Understand. Step 2: Apply..."
structuredelite
3 Gen 10
0.819
"### Task Explain ML... ### Instructions - Be accurate..."
structuredcrossover
4 Gen 7
0.795
"As a knowledgeable AI, provide an expert response to: Explain ML..."
template
5 Gen 5
0.773
"Let's think through this carefully... Question: Explain ML..."
few-shotmutated

โš™๏ธ Evolution Pipeline

๐ŸŒฑ
Generate
Initial population
of diverse prompts
โ†’
๐Ÿค–
LLM Inference
Run each prompt
through the model
โ†’
๐Ÿ“Š
Evaluate
Score accuracy,
reasoning, coherence
โ†’
๐Ÿ†
Rank & Select
Elitism + tournament
selection
โ†’
๐Ÿงฌ
Mutate
Prefix, paraphrase,
crossover operators
โ†ป

๐Ÿ“– Evolutionary Prompt Optimization โ€” Documentation

What is Evolutionary Prompt Optimization?

Evolutionary Prompt Optimization (EPO) applies Genetic Algorithm principles to the problem of prompt engineering for LLMs. A population of prompt candidates is maintained across generations. Each candidate is scored using a fitness function that measures output quality. High-scoring prompts reproduce (are selected as parents), and new offspring are created through mutation and crossover. Over generations, the population converges toward prompts that consistently elicit high-quality responses.

Fitness Function

The composite fitness score combines three metrics: accuracy ร— 0.40 + reasoning_quality ร— 0.35 + coherence ร— 0.25. Accuracy measures keyword overlap with expected answers. Reasoning quality scores the use of logical connectives, structured steps, and response depth. Coherence evaluates transition word usage, sentence variety, and paragraph structure. All three scores are in [0, 1].

Mutation Operators

Five mutation operators are applied stochastically: prefix_injection prepends instruction prefixes; suffix_modification appends reasoning directives; instruction_paraphrase replaces instruction verbs with synonyms; structure_mutation toggles markdown formatting; and temperature_word_swap randomly substitutes words from a paraphrase pool. crossover recombines sentence halves from two parents to create two offspring.

Selection Strategy

Elitist selection preserves the top elite_fraction (default 25%) of prompts unchanged each generation, preventing fitness regression. The remaining slots are filled via tournament selection: k candidates are sampled randomly and the best is chosen. This balances exploitation of high-fitness prompts with exploration of the broader population.

Adaptive Mutation Rate

The mutation rate decays from 0.7 in generation 1 to 0.1 by the final generation using the formula rate = 0.7 - 0.6 ร— (gen / max_gen). This follows the exploration-exploitation trade-off: high mutation early explores diverse prompt structures, while low mutation late refines the best candidates found.

Convergence

The system tracks per-generation best, average, and worst scores. Convergence is reached when best scores plateau across consecutive generations. The RankingSystem stores the all-time best prompt and can export the full evolution history as JSON for analysis or visualization.