ML Simulations - Chris Kenedi

CartPole Training

Episode: 0

Reward: 0

Best: 0

Avg(20): 0

σ: 0.25

Steps/s: 0

Runtime: 0.0s

Status: Idle

Weights: [0.00, 0.00, 0.00, 0.00]

Speed Training

⚙️ Configuration

Max Steps: Force: Initial σ: Gravity: Pole Mass: Cart Mass:

Rewards track balanced timesteps per episode.

Algorithm:

Real-Time Weights / Q-Values

Linear Regression via Gradient Descent

Training a model to predict housing prices (Target) based on area (Feature).
Dataset: California Housing (Normalized)

Learning Rate (α): 0.0100

Loss (MSE): 0.00 Epoch: 0

Left: The regression line (red) fitting the data points (blue). As loss decreases, the line aligns better with the data trend.

Linear regression is mathematically identical to a single-neuron neural network (with linear activation). Watch the weight (w) and bias (b) update in real-time as gradient descent optimizes them, just like in complex deep networks!

Try a high learning rate (> 0.5) to see divergence. The path will oscillate or fly off the landscape.

Linear Regression = Single Neuron

Cost Landscape (w, b)

Green circle = optimal parameters (global minimum). Red = current position. Yellow = descent path.

Residuals (Error)

Sparse Autoencoder (SAE)

The "Juice Un-mixer" Analogy

Imagine you have a smoothie made of apples, kale, and ginger. If you taste it, you just taste "smoothie" (a messy, mixed signal). An SAE is like a magical machine that takes one sip and tells you exactly how many grams of apple, kale, and ginger were used. It "un-mixes" the ingredients into their original, pure forms.

1. The Problem: Superposition

AI models are "greedy." To save space, they often use a single neuron to represent multiple unrelated things (like "Dogs" and "The Eiffel Tower"). This is called Polysemanticity. It makes the model efficient but impossible for humans to read.

2. The Solution: Dictionary Learning

An SAE creates a "Learned Dictionary" of thousands of simple templates. By checking the messy AI signal against this dictionary, it finds the few specific "templates" that match the current thought.

3. The Secret: Sparsity

Usually, neural networks try to use every neuron a little bit. We force the SAE to use as few "dictionary items" as possible (Sparsity). This pressure forces the AI to find pure, high-level concepts instead of blurry mixtures.

Why do features "fade out" during training?

In Demo 1, new dictionary patches start random and gray. As training progresses, the SAE realizes most of them are useless noise. The L1 Regularization (sparsity penalty) forces these useless features to zero (they fade to black). Only the most useful features that explain real patterns (like edges) survive. This is "Feature Selection" in action.

Demo 1: Learning a Visual Dictionary

Draw a shape and click "Train." Watch as the "Learned Dictionary" patches evolve from random noise into specific edge detectors that represent your drawing.

Input Drawing (8x8)

→

Hidden Units (Sparse)

→

Reconstruction

The Learned Dictionary (16 Features)

These are the templates the AI uses to "read" your drawing.

Reconstruction Error (MSE)

0.000000

Sparsity Penalty (λ) 0.40 Learning Rate 0.10

Feed Practice Data

Demo 2: Decoding the AI's "Messy" Mind

In Large Language Models, concepts overlap in a confused state called Polysemanticity. One neuron might respond to both "Apple (the fruit)" and "Apple (the company)." By using an SAE, researchers can "separate" these concepts into individual, Monosemantic features.

Simulate an LLM Activation:

Raw LLM Neurons (Polysemantic)

Densely packed: overlapping signals where one neuron carries multiple meanings.

SAE Decoded Features (Monosemantic)

The SAE "unmixes" the noise: revealing the specific concepts the AI is processing.

Concept	LLM Reality (The Problem)	SAE Solution (The Fix)
Superposition	AI packs too many concepts into too few neurons.	Expands concepts into a massive overcomplete layer.
Polysemanticity	One neuron handles "Bananas" and "The Space Shuttle."	Each dictionary item isolates a single monosemantic idea.
Black Box Loss	AI behavior is inscrutable and "alien" to humans.	Transforms weights into a map of human concepts.

Real-World Research Impact

In 2024, Anthropic used SAEs on their Claude 3 Sonnet model to discover millions of features, including a specific "Golden Gate Bridge" feature. When they manually clamped this feature to "ON", the model became obsessed with the bridge, mentioning it in unrelated conversations. This proved that SAEs don't just find correlation; they find the actual controls of the AI's mind.

Demo 3: The Geometry of Superposition

How do 5 different features fit into just 2 neurons? The AI learns to arrange them in a star-like shape. The SAE solves a "matching problem" to reconstruct data from this compressed space.

Dictionary Controls

Feature Count 5 Sparsity (Threshold) 0.70

How to read this:

Orange Lines: The learned "Dictionary Features".
Blue Dot: Mouse cursor (Input Data).
Dashed Line: The matched feature. If the dot is close enough (passes threshold), the SAE "fires" that feature.

CPU vs GPU Compute Demonstration ? How this works:
This demo uses JavaScript to simulate hardware performance.
• WebGPU/Simulation: Visualizations mimic real parallel workloads.
• Performance: We calculate 'ops per millisecond' based on real-world IPC (Instructions Per Clock) and core counts of the selected hardware specs.
• Accuracy: While running in a browser environment limits raw hardware access, the relative speed difference effectively demonstrates the massive architectural advantage GPUs have for parallel matrix operations compared to sequential CPU processing.

Compare how Serial processing (CPU) differs from Parallel processing (GPU) on matrix tasks.

Hardware Preset: Matrix Size:

CPU i9-14900KS

Workers: 24 | Clock: 6.2 GHz

Time: 0.00s Progress: 0%

GPU Blackwell RTX 5090

Workers: 1024 | Clock: 2.5 GHz

Time: 0.00s Progress: 0%

Understanding the Architecture

CPU (Central Processing Unit)

Designed with a few, very fast, and versatile cores. CPUs are optimized for sequential processing (doing one thing after another very quickly) and handling complex logic/branching.

GPU (Graphics Processing Unit)

Designed with thousands of smaller, specialized cores. While individual cores might be slower than a CPU core, their massive parallelism makes them incredibly faster for vector and matrix operations used in ML and Gaming.

GPU Core Types

Shading Units: Programmable cores that calculate color, lighting, and visual effects for individual pixels.
ROPs (Render Output Units): Handle the final steps like writing pixels to memory, blending, and anti-aliasing.
TMUs (Texture Mapping Units): Specialized hardware for applying and filtering textures (images) on 3D surfaces.
CUDA Cores (NVIDIA): General-purpose parallel processors used for compute tasks like machine learning and physics simulations.
Ray Tracing Cores: Specialized hardware for calculating light ray intersections to generate realistic reflections and lighting.

Beyond GPUs: TPUs and NPUs

TPUs (Tensor Processing Units) and NPUs (Neural Processing Units) take specialization even further. They are essentially stripped-down GPUs designed exclusively for the mathematical operations (tensor arithmetic) used in Deep Learning. By assuming the workload is always neural networks, they remove graphics-specific hardware (like texture mapping and ROPs) to pack even more compute density for AI tasks.

📚

RAG

Open RAG Demo