SmolML: Machine Learning from Scratch, Made Clear!

Introduction

SmolML is a pure Python machine learning library built entirely from the ground up for educational purposes. It aims to provide a transparent, understandable, and educational implementation of core machine learning concepts. Unlike powerful libraries like Scikit-learn, PyTorch, or TensorFlow, SmolML is built using only pure Python and its basic collections, random, and math modules. No NumPy, no SciPy, no C++ extensions – just Python, all the way down. The goal isn’t to compete with production-grade libraries on speed or features, but to help users understand how ML really works.

Core Components

Automatic Differentiation & N-Dimensional Arrays

The foundation of SmolML includes custom arrays and an autograd engine. The automatic differentiation (Value) is a simple autograd engine that tracks operations and computes gradients automatically, which is the heart of training neural networks. The N-dimensional arrays (MLArray) are inspired by NumPy, supporting common mathematical operations needed for ML. Although it’s extremely inefficient due to being written in Python, it’s ideal for understanding N-Dimensional Arrays.

Preprocessing Tools

SmolML provides essential preprocessing tools, including scalers like StandardScaler and MinMaxScaler, which are fundamental for preparing data. Algorithms tend to perform better when features are on a similar scale, and these tools can help achieve that.

Build Your Own Neural Networks

SmolML allows you to build your own neural networks with various components:

Activation Functions

It offers non-linearities like relu, sigmoid, softmax, and tanh that allow networks to learn complex patterns.

Weight Initializers

Smart strategies like Xavier and He are provided to set initial network weights for stable training.

Loss Functions

You can use loss functions such as mse_loss, binary_cross_entropy, and categorical_cross_entropy to measure model error.

Optimizers

Algorithms like SGD, Adam, and AdaGrad are available to update model weights based on gradients to minimize loss.

Classic ML Models

SmolML also includes implementations of classic ML models:

Regression

It provides implementations of Linear and Polynomial regression for predicting continuous values.

Tree-Based Models

You can use Decision Tree and Random Forest implementations for classification and regression tasks.

K-Means Clustering

The KMeans clustering algorithm is available for grouping similar data points together.

Who is SmolML For?

SmolML is suitable for students learning ML concepts for the first time, developers curious about the internals of ML libraries they use daily, and educators looking for a simple, transparent codebase to demonstrate ML principles. It’s also for anyone who enjoys learning by building!

Limitations

It’s important to note that SmolML is built for learning, not for breaking speed records or handling massive datasets. Being pure Python, it’s much slower than libraries using optimized C/C++/Fortran backends. It’s best suited for small datasets and toy problems where understanding the mechanics is more important than computation time. SmolML is not recommended for production applications; instead, use battle-tested libraries like Scikit-learn, PyTorch, TensorFlow, JAX, etc., for real-world tasks.

Getting Started with SmolML

You can start using SmolML by cloning the repository and exploring the code and examples:

git clone https://github.com/rodmarkun/SmolML
cd SmolML

You can also run the tests in the tests/ folder. Just install the requirements.txt to compare SmolML against other standard libraries like TensorFlow, sklearn, etc., and generate plots with matplotlib:

cd tests
pip install -r requirements

Contributing to SmolML

Contributions to SmolML are always welcome. If you’re interested in contributing, you can fork the repository and create a new branch for your changes. Once you’re done, submit a pull request to merge your changes into the main branch.

Supporting SmolML

If you find SmolML useful and want to support it, you can star the project on GitHub, donate to the Ko-fi page, or share the project with your friends.

By learning and using SmolML, you can gain a deeper understanding of machine learning core principles and build a solid foundation for further development in the field of machine learning.