[
  
    {
      "title"       : "Who owns the copyright for an AI generated creative work?",
      "category"    : "opinion",
      "tags"        : "copyright, creativity, neural networks, machine learning, artificial intelligence",
      "url"         : "./AI-and-intellectual-property.html",
      "date"        : "2021-04-20 00:00:00 +0000",
      "description" : "As neural networks are used more and more in the creative process, text, images and even music are now created by AI, but who owns the copyright for those works?",
      "content"     : "Recently I was reading an article about a cool project that intends to have a neural network create songs of the late club of the 27 (artists that have tragically died at age 27 or near, and in the height of their respective careers), artists such as Amy Winehouse, Jimmy Hendrix, Curt Cobain and Jim Morrison.The project was created by Over the Bridge, an organization dedicated to increase awareness on mental health and substance abuse in the music industry, trying to denormalize and remove the glamour around such illnesses within the music community.They are using Google’s Magenta, which is a neural network that precisely was conceived to explore the role of machine learning within the creative process. Magenta has been used to create a brand new “Beatles” song or even there was a band that used it to write a full album in 2019.So, while reading the article, my immediate thought was: who owns the copyright of these new songs?Think about it, imagine one of this new songs becomes a massive hit with millions of youtube views and spotify streams, who can claim the royalties generated?At first it seems quite simple, Over the Bridge should be the ones reaping the benefits, since they are the ones who had the idea, gathered the data and then fed the neural network to get the “work of art”. But in a second thought, didn’t the original artists provide the basis for the work the neural network generated? shouldn’t their state get credit? what about Google whose tool was used, should they get credit too?Neural networks have been also used to create poetry, paintings and to write news articles, but how do they do it? A computer program developed for machine learning purposes is an algorithm that “learns” from data to make future decisions. When applied to art, music and literary works, machine learning algorithms are actually learning from some input data to generate a new piece of work, making independent decisions throughout the process to determine what the new work looks like. An important feature of this is that while programmers can set the parameters, the work is actually generated by the neural network itself, in a process akin to the thought processes of humans.Now, creative works qualify for copyright protection if they are original, with most definitions of originality requiring a human author. Most jurisdictions, including Spain and Germany, specifically state that only works created by a human can be protected by copyright. In the United States, for example, the Copyright Office has declared that it will “register an original work of authorship, provided that the work was created by a human being.”So as we currently stand, a human author is required to grant a copyright, which makes sense, there is no point of having a neural network be the beneficiary of royalties of a creative work (no bank would open an account for them anyways, lol).I think amendments have to be made to the law to ensure that the person who undertook all the arrangements necessary for the work to be created by the neural network gets the credit but also we need to modify copyright law to ensure the original authors of the body of work used as data input to produce the new piece get their corresponding share of credit. This will get messy if someone uses for example the #1 song of every month in a decade to create the decade song, then there would be as many as 120 different artists to credit.In a computer generated artistic work, both the person who undertook all the arrangements necessary for its creation as well as the original authors of the data input need to be credited.There will still be some ambiguity as to who undertook the arrangements necessary, only the one who gathered the data and pressed the button to let the network learn, or does the person who created the neural network’s model also get credit? Shall we go all the way and say that even the programmer of the neural network gets some credit as well?There are some countries, in particular the UK where some progress has been made to amend copyright laws to cater for computer generated works of art, but I believe this is one of those fields where technology will surpass our law making capacity and we will live under a grey area for a while, and maybe this is just what we need, by having these works ending up free for use by anyone in the world, perhaps a new model for remunerating creative work can be established, one that does not require commercial success to be necessary for artists to make a living, and thus they can become free to explore their art.Perhaps a new model for remunerating creative work can be established, one that does not require commercial success to be necessary for artists to make a living.The Next Rembrandt is a computer-generated 3-D–printed painting developed by a facial-recognition algorithm that scanned data from 346 known paintings by the Dutch painter in a process lasting 18 months. The portrait is based on 168,263 fragments from Rembrandt’s works."
    } ,
  
    {
      "title"       : "So, what is a neural network?",
      "category"    : "theory",
      "tags"        : "neural networks, machine learning, artificial intelligence",
      "url"         : "./back-to-basics.html",
      "date"        : "2021-04-02 00:00:00 +0000",
      "description" : "ELI5: what is a neural network.",
      "content"     : "The omnipresence of technology nowadays has made it commonplace to read news about AI, just a quick glance at today’s headlines, and I get: This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. How A.I.-powered companies dodged the worst damage from COVID AI technology detects ‘ticking time bomb’ arteries AI in Drug Discovery Starts to Live Up to the Hype Pentagon seeks commercial solutions to get its data ready for AITopics from business, manufacturing, supply chain, medicine and biotech and even defense are covered in those news headlines, definitively the advancements on the fields of artificial intelligence, in particular machine learning and deep neural networks have permeated into our daily lives and are here to stay. But, do the general population know what are we talking about when we say “an AI”? I assume most people correctly imagine a computer algorithm or perhaps the more adventurous minds think of a physical machine, an advanced computer entity or even a robot, getting smarter by itself with every use-case we throw at it. And most people will be right, when “an AI” is mentioned it is indeed an algorithm run by a computer, and there is where the boundary of their knowledge lies.They say that the best way to learn something is to try to explain it, so in a personal exercise I will try to do an ELI5 (Explain it Like I am 5) version of what is a neural network.Let’s start with a little history, humans have been tinkering with the idea of an intelligent machine for a while now, some even say that the idea of artificial intelligence was conceived by the ancient greeks (source), and several attempts at devising “intelligent” machines have been made through history, a notable one was ‘The Analytical Engine’ created by Charles Babbage in 1837:The Analytical Engine of Charles Babbage - 1837Then, in the middle of last century by trying to create a model of how our brain works, Neural Networks were born. Around that time, Frank Rosenblatt at Cornell trying to understand the simple decision system present in the eye of a common housefly, proposed the idea of a perceptron, a very simple system that processes certain inputs with basic math operations and produces an output.To illustrate, let’s say that the brain of the housefly is a perceptron, its inputs are whatever values are produced by the multiple cells in its eyes, when the eye cell detects “something” it’s output will be a 1, and if there is nothing a 0. Then the combination of all those inputs can be processed by the perceptron (the fly brain), and the output is a simple 0 or 1 value. If it is a 1 then the brain is telling the fly to flee and if it is a 0 it means it is safe to stay where it is.We can imagine then that if many of the eye cells of the fly produce 1s, it means that an object is quite near, and therefore the perceptron will calculate a 1, it is time to flee.The perceptron is just a math operation, one that multiplies certain input values with preset “parameters” (called weights) and adds up the resulting multiplications to generate a value.Then the magic spark was ignited, the parameters (weights) of the perceptron could be “learnt” by a process of minimizing the difference between known results of particular observations, and what the perceptron is actually calculating. It is this process of learning what we call training the neural network.This idea is so powerful that even today it is one of the fundamental building blocks of what we call AI.From this I will try to explain how this simple concept can have such diverse applications as natural language processing (think Alexa), image recognition like medical diagnosis from a CTR scan, autonomous vehicles, etc.A basic neural network is a combination of perceptrons in different arrangements, the perceptron therefore was downgraded from “fly brain” to “network neuron”.A neural network has different components, in its basic form it has: Input Hidden layers OutputInputThe inputs of a neural network are in their essence just numbers, therefore anything that can be converted to a number can become an input. Letters in a text, pixels in an image, frequencies in a sound wave, values from a sensor, etc. are all different things that when converted to a numerical value serve as inputs for the neural network. This is one of the reasons why applications of neural networks are so diverse.Inputs can be as many as one need for the task at hand, from maybe 9 inputs to teach a neural network how to play tic-tac-toe to thousands of pixels from a camera for an autonomous vehicle. Since the input of a perceptron needs to be a single value, if for example a color pixel is chosen as input, it most likely will be broken into three different values; its red, green and blue components, hence each pixel will become 3 different inputs for the neural network.Hidden layersA “layer” within a neural network is just a group of perceptrons that all perform the same exact mathematical operation to the inputs and produce an output. The catch is that each of them have different weights (parameters), therefore their output for a given input will be different amongst them. There are many types of layers, the most typical of them being a “dense” layer, which is another word to say that all the inputs are connected to all the neurons (individual perceptrons), and as said before, each of these connections have a weight associated with it, so that the operation that each neuron performs is a simple weighted sum of all the inputs.The hidden layer is then typically connected to another dense layer, and their connection means that each output of a neuron from the first layer is treated effectively as an input for the subsequent one, and it is thus connected to every neuron.A neural network can have from one to as many layers as one can think, and the number of layers depends solely on the experience we have gathered on the particular problem we would like to solve.Another critical parameter of a hidden layer is the number of neurons it has, and again, we need to rely on experience to determine how many neurons are needed for a given problem. I have seen networks that vary from a couple of neurons to the thousands. And of course each hidden layer can have as many neurons as we please, so the number of combinations is vast.To the number of layers, their type and how many neurons each have, is what we call the network topology (including the number of inputs and outputs).OutputAt the very end of the chain, another layer lies (which behaves just like a hidden layer), but has the peculiarity that it is the final layer, and therefore whatever it calculates will be the output values of the whole network. The number of outputs the network has is a function of the problem we would like to solve. It could be as simple as one output, with its value representing a probability of an action (like in the case of the flee reaction of the housefly), to many outputs, perhaps if our network is trying to distinguish images of animals, one would have an output for each animal species, and the output would represent how much confidence the network has that the particular image belongs to the corresponding species.As we said, the neural network is just a collection of individual neurons, doing basic math operations on certain inputs in series of layers that eventually generate an output. This mesh of neurons is then “trained” on certain output values from known cases of the inputs; once it has learned it can then process new inputs, values that it has never seen before with surprisingly accurate results.Many of the problems neural networks solve, could be certainly worked out by other algorithms, however, since neural networks are in their core very basic operations, once trained, they are extremely efficient, hence much quicker and economical to produce results.There are a few more details on how a simple neural network operate that I purposedly left out to make this explanation as simple as possible. Thinks like biases, the activation functions and the math behind learning, the backpropagation algorithm, I will leave to a more in depth article. I will also write (perhaps in a series) about the more complex topologies combining different types of layers and other building blocks, a part from the perceptron.Things like “Alexa”, are a bit more complex, but work on exactly the same principles. Let’s break down for example the case of asking “Alexa” to play a song in spotify. Alexa uses several different neural networks to acomplish this:1. Speech recognitionAs a basic input we have our speech: the command “Alexa, play Van Halen”. This might seem quite simple for us humans to process, but for a machine is an incredible difficult feat to be able to understand speech, things like each individual voice timbre, entonation, intention and many more nuances of human spoken language make it so that traditional algorithms have struggled a lot with this. In our simplified example let’s say that we use a neural network to transform our spoken speech into text characters a computer is much more familiarized to learn.2. Understanding what we mean (Natural Language Understanding)Once the previous network managed to succesfuly convert our spoken words into text, there comes the even more difficult task of making sense of what we said. Things that we humans take for granted such as context, intonation and non verbal communication, help give our words meaning in a very subtle, but powerful way, a machine will have to do with much less information to correctly understand what we mean. It has to correctly identify the intention of our sentence and the subject or entities of what we mean.The neural network has to identify that it received a command (by identifying its name), the command (“play music”), and our choice (“Van Halen”). And it does so by means of simple math operations as described before. Of course the network involved is quite complex and has different types of neurons and connection types, but the underlying principles remain.3. Replying to usOnce Alexa understood what we meant, it then proceeds to execute the action of the command it interpreted and it replies to us in turn using natural language. This is accomplished using a technique called speech synthesis, things like pitch, duration and intensity of the words and phonems are selected based on the “meaning” of what Alexa will respond to us: “Playing songs by Van Halen on Spotify” sounding quite naturally. And all is accomplished with neural networks executing many simple math operations.Although it seems quite complex, the process for AI to understand us can be boiled down to simple math operationsOf course Amazon’s Alexa neural networks have undergone quite a lot of training to get to the level where they are, the beauty is that once trained, to perform their magic they just need a few mathematical operations.As said before, I will continue to write about the basics of neural networks, the next article in the series will dive a bit deeper into the math behind a basic neural network."
    } ,
  
    {
      "title"       : "Starting the adventure",
      "category"    : "",
      "tags"        : "general blogging, thoughts, life",
      "url"         : "./starting-the-adventure.html",
      "date"        : "2021-03-24 00:00:00 +0000",
      "description" : "Midlife career change: a disaster or an opportunity?",
      "content"     : "In the midst of a global pandemic caused by the SARS-COV2 coronavirus; I decided to start blogging. I wanted to blog since a long time, I have always enjoyed writing, but many unknowns and having “no time” for it prevented me from taking it up. Things like: “I don’t really know who my target audience is”, “what would my topic or topics be?”, “I don’t think I am a world-class expert in anything”, and many more kept stopping me from setting up my own blog. Now seemed like a good time as any so with those and tons of other questions in my mind I decided it was time to start.Funnily, this is not my first post. The birth of the blog came very natural as a way to “document” my newly established pursuit for getting myself into Machine Learning. This new adventure of mine comprises several things, and if I want to succeed I need to be serious about them all: I want to start coding again! I used to code a long time ago, starting when I was 8 years old in a Tandy Color Computer hooked up to my parent’s TV. Machine Learning is a vast, wide subject, I want to learn the generals, but also to select a few areas to focus on. Setting up a blog to document my journey and share it: Establish a learning and blogging routine. If I don’t do this, I am sure this endeavour will die off soon.As for the focus areas I will start with: Neural Networks fundamentals: history, basic architecture and math behind them Deep Neural Networks Reinforcement Learning Current state of the art: what is at the cutting edge now in terms of Deep Neural Networks and Reinforcement Learning?I selected the above areas to focus on based on my personal interests, I have been fascinated by the developments in reinforcement learning for a long time, in particular Deep Mind’s awesome Go, Chess and Starcraft playing agents. Therefore, I started reading a lot about it and even started a personal project for coding a tic-tac-toe learning agent.With my limited knowledge I have drafted the following learning path: Youtube: Three Blue One Brown’s videos on Neural Networks, Calculus and Linear Algebra. I cannot recommend them enough, they are of sufficient depth and use animation superbly to facilitate the understanding of the subjects. Coursera: Andrew Ng’s Machine Learning course Book: Deep Learning with Python by Francois Chollet Book: Reinforcement Learning: An Introduction, by Richard S. Sutton and Andrew G. BartoAs for practical work I decided to start by coding my first models from scratch (without using libraries such as Tensorflow), to be able to deeply understand the math and logic behind the models, so far it has proven to be priceless.For my next project I think I will start to do the basic hand-written digits recognition, which is the Machine Learning Hello World, for this I think I will start to use Tensorflow already.I will continue to write about my learning road, what I find interesting and relevant, and to document all my practical exercises, as well as news and the state of the art in the world of AI.So far, all I have learned has been so engaging that I am seriously thinking of a career change. I have 17 years of international experience in multinational corporations across various functions, such as Information Services, Sales, Customer Care and New Products Introduction, and sincerely, I am finding more joy in artificial intelligence than anything else I have worked on before. Let’s see where the winds take us.Thanks for reading!P.S. For the geeks like me, here is a snippet on the technical side of the blog.Static Website GeneratorI researched a lot on this, when I started I didn’t even know I needed a static website generator. I was just sure of one thing, I wanted my blog site to look modern, be easy to update and not to have anything extra or additional content or functionality I did not need.There is a myriad of website generators nowadays, after a lengthy search the ones I ended up considering are: wordpress wix squarespace ghost webflow netlify hugo gatsby jekyllI started with the web interfaced generators with included hosting in their offerings:wordpress is the old standard, it is the one CMS I knew from before, and I thought I needed a fully fledged CMS, so I blindly ran towards it. Turns out, it has grown a lot since I remembered, it is now a fully fledged platform for complex websites and ecommerce development, even so I decided to give it a try, I picked a template and created a site. Even with the most simplistic and basic template I could find, there is a lot going on in the site. Setting it up was not as difficult or cumbersome as others claim, it took me about one hour to have it up and running, it looks good, but a bit crowded for my personal taste, and I found out it serves ads in your site for the readers, that is a big no for me.I have tried wix and squarespace before, they are fantastic for quick and easy website generation, but their free offering has ads, so again, a big no for me.I discovered ghost as the platform used by one of the bloggers I follow (Sebastian Ruder), turns out is a fantastic evolution over wordpress. It runs on the latest technologies, its interface is quite modern, and it is focused on one thing only: publishing. They have a paid hosting service, but the software is open sourced, therefore free to use in any hosting.I also tested webflow and even created a mockup there, the learning curve was quite smooth, and its CMS seems quite robust, but a bit too much for the functionalities I required.Next were the generators that don’t have a web interface, but can be easily set up:The first I tried was netlify, I also set up a test site in it. Netlify provides free hosting, and to keep your source files it uses GitHub (a repository keeps the source files where it publishes from). It has its own CMS, Netlify CMS, and you have a choice of site generators: Hugo, Gatsby, MiddleMan, Preact CLI, Next.js, Elevently and Nuxt.js, and once you choose there are some templates for each. I did not find the variety of templates enticing enough, and the set up process was much more cumbersome than with wordpress (at least for my knowledge level). I choose Hugo for my test site.I also tested gatsby with it’s own Gatsby Cloud hosting service, here is my test site. They also use GitHub as a base to host the source files to build the website, so you create a repository, and it is connected to it. I found the free template offerings quite limited for what I was looking for.Finally it came the turn for jekyll, although an older, and slower generator (compared to Hugo and Gatsby), it was created by one of the founders of GitHub, so it’s integration with GitHub Pages is quite natural and painless, so much so, that to use them together you don’t even have to install Jekyll in your machine! You have two choices: keep it all online, by having one repository in Github keep all the source files, modify or add them online, and having Jekyll build and publish your site to the special gh-pages repository everytime you change or add a new file to the source repository. Have a synchronized local copy of the source files for the website, this way you can edit your blog and customize it in your choice of IDE (Integrated Development Environment). Then, when you update any file on your computer, you just “push” the changes to GitHub, and GitHub Pages automatically uses Jekyll to build and publish your site.I chose the second option, specially because I can manipulate files, like images, in my laptop, and everytime I sync my local repository with GitHub, they are updated and published automatically. Quite convenient.After testing with several templates to get the feel for it, I decided to keep Jekyll for my blog for several reasons: the convenience of not having to install anything extra on my computer to build my blog, the integration with GitHub Pages, the ease of use, the future proofing via integration with modern technologies such as react or vue and the vast online community that has produced tons of templates and useful information for issue resolution, customization and added functionality.I picked up a template, just forked the repository and started modifying the files to customize it, it was fast and easy, I even took it upon myself to add some functionality to the template (it served as a coding little project) like: SEO meta tags Dark mode (configurable in _config.yml file) automatic sitemap.xml automatic archive page with infinite scrolling capability new page of posts filtered by a single tag (without needing autopages from paginator V2), also with infinite scrolling click to tweet functionality (just add a &lt;tweet&gt; &lt;/tweet&gt; tag in your markdown. custom and responsive 404 page responsive and automatic Table of Contents (optional per post) read time per post automatically calculated responsive post tags and social share icons (sticky or inline) included linkedin, reddit and bandcamp icons copy link to clipboard sharing option (and icon) view on github link button (optional per post) MathJax support (optional per post) tag cloud in the home page ‘back to top’ button comments ‘courtain’ to mask the disqus interface until the user clicks on it (configurable in _config.yml) CSS variables to make it easy to customize all colors and fonts added several pygments themes for code syntax highlight configurable from the _config.yml file. See the highlighter directory for reference on the options. responsive footer menu and footer logo (if setup in the config file) smoother menu animationsAs a summary, Hugo and Gatsby might be much faster than Jekyll to build the sites, but their complexity I think makes them useful for a big site with plenty of posts. For a small site like mine, Jekyll provides sufficient functionality and power without the hassle.You can use the modified template yourself by forking my repository. Let me know in the comments or feel free to contact me if you are interested in a detailed walkthrough on how to set it all up.HostingSince I decided on Jekyll to generate my site, the choice for hosting was quite obvious, Github Pages is very nicely integrated with it, it is free, and it has no ads! Plus the domain name isn’t too terrible (the-mvm.github.io).Interplanetary File SystemTo contribute to and test IPFS I also set up a mirror in IPFS by using fleek.co. I must confess that it was more troublesome than I imagined, it was definetively not plug and play because of the paths used to fetch resources. The nature of IPFS makes short absolute paths for website resources (like images, css and javascript files) inoperative; the easiest fix for this is to use relative paths, however the same relative path that works for the root directory (i.e. /index.html) does not work for links inside directories (i.e. /tags/), and since the site is static, while generating it, one must make the distinction between the different directory levels for the page to be rendered correctly.At first I tried a simple (but brute force solution):# determine the level of the current file{% assign lvl = page.url | append:'X' | split:'/' | size %}# create the relative base (i.e. \"../\"){% capture relativebase %}{% for i in (3..lvl) %}../{% endfor %}{% endcapture %}{% if relativebase == '' %} {% assign relativebase = './' %}{% endif %}...# Eliminate unecesary double backslashes{% capture post_url %}{{ relativebase }}{{ post.url }}{% endcapture %}{% assign post_url = post_url | replace: \"//\", \"/\" %}This jekyll/liquid code was executed in every page (or include) that needed to reference a resource hosted in the same server.But this fix did not work for the search function, because it relies on a search.json file (also generated programmatically to be served as a static file), therefore when generating this file one either use the relative path for the root directory or for a nested directory, thus the search results will only link correctly the corresponding pages if the page where the user searched for something is in the corresponding scope.So the final solution was to make the whole site flat, meaning to live in a single directory. All pages and posts will live under the root directory, and by doing so, I can control how to address the relative paths for resources."
    } ,
  
    {
      "title"       : "Deep Q Learning for Tic Tac Toe",
      "category"    : "",
      "tags"        : "machine learning, artificial intelligence, reinforcement learning, coding, python",
      "url"         : "./deep-q-learning-tic-tac-toe.html",
      "date"        : "2021-03-18 21:14:20 +0000",
      "description" : "Inspired by Deep Mind's astonishing feats of having their Alpha Go, Alpha Zero and Alpha Star programs learn (and be amazing at it) Go, Chess, Atari games and lately Starcraft; I set myself to the task of programming a neural network that will learn by itself how to play the ancient game of tic tac toe. How hard could it be?",
      "content"     : "BackgroundAfter many years of a corporate career (17) diverging from computer science, I have now decided to learn Machine Learning and in the process return to coding (something I have always loved!).To fully grasp the essence of ML I decided to start by coding a ML library myself, so I can fully understand the inner workings, linear algebra and calculus involved in Stochastic Gradient Descent. And on top learn Python (I used to code in C++ 20 years ago).I built a general purpose basic ML library that creates a Neural Network (only DENSE layers), saves and loads the weights into a file, does forward propagation and training (optimization of weights and biases) using SGD. I tested the ML library with the XOR problem to make sure it worked fine. You can read the blog post for it here.For the next challenge I am interested in reinforcement learning greatly inspired by Deep Mind’s astonishing feats of having their Alpha Go, Alpha Zero and Alpha Star programs learn (and be amazing at it) Go, Chess, Atari games and lately Starcraft; I set myself to the task of programming a neural network that will learn by itself how to play the ancient game of tic tac toe (or noughts and crosses).How hard could it be?Of course the first thing to do was to program the game itself, so I chose Python because I am learning it, so it gives me a good practice opportunity, and PyGame for the interface.Coding the game was quite straightforward, albeit for the hiccups of being my first PyGame and almost my first Python program ever.I created the game quite openly, in such a way that it can be played by two humans, by a human vs. an algorithmic AI, and a human vs. the neural network. And of course the neural network against a choice of 3 AI engines: random, minimax or hardcoded (an exercise I wanted to do since a long time).While training, the visuals of the game can be disabled to make training much faster.Now, for the fun part, training the network, I followed Deep Mind’s own DQN recommendations:The network will be an approximation for the Q value function or Bellman equation, meaning that the network will be trained to predict the \"value\" of each move available in a given game state.A replay experience memory was implemented. This meant that the neural network will not be trained after each move. Each move will be recorded in a special \"memory\" alongside with the state of the board and the reward it received for taking such an action (move).After the memory is sizable enough, batches of random experiences sampled from the replay memory are used for every training roundA secondary neural network (identical to the main one) is used to calculate part of the Q value function (Bellman equation), in particular the future Q values. And then it is updated with the main network's weights every n games. This is done so that we are not chasing a moving target.Designing the neural networkThe Neural Network chosen takes 9 inputs (the current state of the game) and outputs 9 Q values for each of the 9 squares in the board of the game (possible actions). Obviously some squares are illegal moves, hence while training there was a negative reward given to illegal moves hoping that the model would learn not to play illegal moves in a given position.I started out with two hidden layers of 36 neurons each, all fully connected and activated via ReLu. The output layer was initially activated using sigmoid to ensure that we get a nice value between 0 and 1 that represents the QValue of a given state action pair.The many models…Model 1 - the first tryAt first the model was trained by playing vs. a “perfect” AI, meaning a hard coded algorithm that never looses and that will win if it is given the chance. After several thousand training rounds, I noticed that the Neural Network was not learning much; so I switched to training vs. a completely random player, so that it will also learn how to win. After training vs. the random player, the Neural Network seems to have made progress and is steadily diminishing the loss function over time.However, the model was still generating many illegal moves, so I decided to modify the reinforcement learning algorithm to punish more the illegal moves. The change consisted in populating with zeros all the corresponding illegal moves for a given position at the target values to train the network. This seemed to work very well for diminishing the illegal moves:Nevertheless, the model was still performing quite poorly winning only around 50% of games vs. a completely random player (I expected it to win above 90% of the time). This was after only training 100,000 games, so I decided to keep training and see the results:Wins: 65.46% Losses: 30.32% Ties: 4.23%Note that when training restarts, the loss and illegal moves are still high in the beginning of the training round, and this is caused by the epsilon greedy strategy that prefers exploration (a completely random move) over exploitation, this preference diminishes over time.After another round of 100,000 games, I can see that the loss function actually started to diminish, and the win rate ended up at 65%, so with little hope I decided to carry on and do another round of 100,000 games (about 2 hours in an i7 MacBook Pro):Wins: 46.40% Losses: 41.33% Ties: 12.27%As you can see in the chart, the calculated loss not even plateaued, but it seemed to increase a bit over time, which tells me the model is not learning anymore. This was confirmed by the win rate decreasing with respect of the previous round to a meek 46.4% that looks no better than a random player.Model 2 - Linear activation for the outputAfter not getting the results I wanted, I decided to change the output activation function to linear, since the output is supposed to be a Q value, and not a probability of an action.Wins: 47.60% Losses: 39% Ties: 13.4%Initially I tested with only 1000 games to see if the new activation function was working, the loss function appears to be decreasing, however it reached a plateau around a value of 1, hence still not learning as expected. I came across a technique by Brad Kenstler, Carl Thome and Jeremy Jordan called Cyclical Learning Rate, which appears to solve some cases of stagnating loss functions in this type of networks. So I gave it a go using their Triangle 1 model.With the cycling learning rate in place, still no luck after a quick 1,000 games training round; so I decided to implement on top a decaying learning rate as per the following formula:The resulting learning rate combining the cycles and decay per epoch is:Learning Rate = 0.1, Decay = 0.0001, Cycle = 2048 epochs, max Learning Rate factor = 10xtrue_epoch = epoch - c.BATCH_SIZElearning_rate = self.learning_rate*(1/(1+c.DECAY_RATE*true_epoch))if c.CLR_ON: learning_rate = self.cyclic_learning_rate(learning_rate,true_epoch)@staticmethoddef cyclic_learning_rate(learning_rate, epoch): max_lr = learning_rate*c.MAX_LR_FACTOR cycle = np.floor(1+(epoch/(2*c.LR_STEP_SIZE))) x = np.abs((epoch/c.LR_STEP_SIZE)-(2*cycle)+1) return learning_rate+(max_lr-learning_rate)*np.maximum(0,(1-x))c.DECAY_RATE = learning rate decay ratec.MAX_LR_FACTOR = multiplier that determines the max learning ratec.LR_STEP_SIZE = the number of epochs each cycle lastsWith these many changes, I decided to restart with a fresh set of random weights and biases and try training more (much more) games.1,000,000 episodes, 7.5 million epochs with batches of 64 moves eachWins: 52.66% Losses: 36.02% Ties: 11.32%After 24 hours!, my computer was able to run 1,000,000 episodes (games played), which represented 7.5 million training epochs of batches of 64 plays (480 million plays learned), the learning rate did decreased (a bit), but is clearly still in a plateau; interestingly, the lower boundary of the loss function plot seems to continue to decrease as the upper bound and the moving average remains constant. This led me to believe that I might have hit a local minimum.Model 3 - new network topologyAfter all the failures I figured I had to rethink the topology of the network and play around with combinations of different networks and learning rates.100,000 episodes, 635,000 epochs with batches of 64 moves eachWins: 76.83% Losses: 17.35% Ties: 5.82%I increased to 200 neurons each hidden layer. In spite of this great improvement the loss function was still in a plateau at around 0.1 (Mean Squared Error). Which, although it is greatly reduced from what we had, still was giving out only 77% win rate vs. a random player, the network was playing tic tac toe as a toddler!*I can still beat the network most of the time! (I am playing with the red X)*100,000 more episodes, 620,000 epochs with batches of 64 moves eachWins: 82.25% Losses: 13.28% Ties: 4.46%Finally we crossed the 80% mark! This is quite an achievement, it seems that the change in network topology is working, although it also looks like the loss function is stagnating at around 0.15.After more training rounds and some experimenting with the learning rate and other parameters, I couldn’t improve past the 82.25% win rate.These have been the results so far:It is quite interesting to learn how the many parameters (hyper-parameters as most authors call them) of a neural network model affect its training performance, I have played with: the learning rate the network topology and activation functions the cycling and decaying learning rate parameters the batch size the target update cycle (when the target network is updated with the weights from the policy network) the rewards policy the epsilon greedy strategy whether to train vs. a random player or an “intelligent” AI.And so far the most effective change has been the network topology, but being so close but not quite there yet to my goal of 90% win rate vs. a random player, I will still try to optimize further.Network topology seems to have the biggest impact on a neural network's learning ability.Model 4 - implementing momentumI reached out to the reddit community and a kind soul pointed out that maybe what I need is to apply momentum to the optimization algorithm. So I did some research and ended up deciding to implement various optimization methods to experiment with: Stochastic Gradient Descent with Momentum RMSProp: Root Mean Square Plain Momentum NAG: Nezterov’s Accelerated Momentum Adam: Adaptive Moment Estimation and keep my old vanilla Gradient Descent (vGD) ☺Click here for a detailed explanation and code of all the implemented optimization algorithms.So far, I have not been able to get better results with Model 4, I have tried all the momentum optimization algorithms with little to no success.Model 5 - implementing one-hot encoding and changing topology (again)I came across an interesting project in Github that deals exactly with Deep Q Learning, and I noticed that he used “one-hot” encoding for the input as opposed to directly entering the values of the player into the 9 input slots. So I decided to give it a try and at the same time change my topology to match his:So, ‘one hot’ encoding is basically changing the input of a single square in the tic tac toe board to three numbers, so that each state is represented with different inputs, thus the network can clearly differentiate the three of them. As the original author puts it, the way I was encoding, having 0 for empty, 1 for X and 2 for O, the network couldn’t easily tell that, for instance, O and X both meant occupied states, because one is two times as far from 0 as the other. With the new encoding, the empty state will be 3 inputs: (1,0,0), the X will be (0,1,0) and the O (0,0,1) as in the diagram.Still, no luck even with Model 5, so I am starting to think that there could be a bug in my code.To test this hypothesis, I decided to implement the same model using Tensorflow / Keras.Model 6 - Tensorflow / Kerasself.PolicyNetwork = Sequential()for layer in hidden_layers: self.PolicyNetwork.add(Dense( units=layer, activation='relu', input_dim=inputs, kernel_initializer='random_uniform', bias_initializer='zeros'))self.PolicyNetwork.add(Dense( outputs, kernel_initializer='random_uniform', bias_initializer='zeros'))opt = Adam(learning_rate=c.LEARNING_RATE, beta_1=c.GAMMA_OPT, beta_2=c.BETA, epsilon=c.EPSILON, amsgrad=False)self.PolicyNetwork.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])As you can see I am reusing all of my old code, and just replacing my Neural Net library with Tensorflow/Keras, keeping even my hyper-parameter constants.The training function changed to:reduce_lr_on_plateau = ReduceLROnPlateau(monitor='loss', factor=0.1, patience=25)history = self.PolicyNetwork.fit(np.asarray(states_to_train), np.asarray(targets_to_train), epochs=c.EPOCHS, batch_size=c.BATCH_SIZE, verbose=1, callbacks=[reduce_lr_on_plateau], shuffle=True)With Tensorflow implemented, the first thing I noticed, was that I had an error in the calculation of the loss, although this only affected reporting and didn’t change a thing on the training of the network, so the results kept being the same, the loss function was still stagnating! My code was not the issue.Model 7 - changing the training scheduleNext I tried to change the way the network was training as per u/elBarto015 advised me on reddit.The way I was training initially was: Games begin being simulated and the outcome recorded in the replay memory Once a sufficient ammount of experiences are recorded (at least equal to the batch size) the Network will train with a random sample of experiences from the replay memory. The ammount of experiences to sample is the batch size. The games continue to be played between the random player and the network. Every move from either player generates a new training round, again with a random sample from the replay memory. This continues until the number of games set up conclude.The first change was to train only after every game concludes with the same ammount of data (a batch). This was still not giving any good results.The second change was more drastic, it introduced the concept of epochs for every training round, it basically sampled the replay memory for epochs * batch size experiences, for instance if epochs selected were 10, and batch size was 81, then 810 experiences were sampled out of the replay memory. With this sample the network was then trained for 10 epochs randomly using the batch size.This meant that I was training now effectively 10 (or the number of epochs selected) times more per game, but in batches of the same size and randomly shuffling the experiences each epoch.After still playing around with some hyperparameters I managed to get similar performance as I got before, reaching 83.15% win rate vs. the random player, so I decided to keep training in rounds of 2,000 games each to evaluate performance. With almost every round I could see improvement:As of today, my best result so far is 87.5%, I will leave it rest for a while and keep investigating to find a reason for not being able to reach at least 90%. I read about self play, and it looks like a viable option to test and a fun coding challenge. However, before embarking in yet another big change I want to ensure I have been thorough with the model and have tested every option correctly.I feel the end is near… should I continue to update this post as new events unfold or shall I make it a multi post thread?"
    } ,
  
    {
      "title"       : "Neural Network Optimization Methods and Algorithms",
      "category"    : "",
      "tags"        : "coding, machine learning, optimization, deep Neural networks",
      "url"         : "./neural-network-optimization-methods.html",
      "date"        : "2021-03-12 19:32:20 +0000",
      "description" : "Some neural network optimization algorithms mostly to implement momentum when doing back propagation.",
      "content"     : "For the seemingly small project I undertook of creating a machine learning neural network that could learn by itself to play tic-tac-toe, I bumped into the necesity of implementing at least one momentum algorithm for the optimization of the network during backpropagation.And since my original post for the TicTacToe project is quite large already, I decided to post separately these optimization methods and how did I implement them in my code.AdamsourceAdaptive Moment Estimation (Adam) is an optimization method that computes adaptive learning rates for each weight and bias. In addition to storing an exponentially decaying average of past squared gradients \\(v_t\\) and an exponentially decaying average of past gradients \\(m_t\\), similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface. We compute the decaying averages of past and past squared gradients \\(m_t\\) and \\(v_t\\) respectively as follows:\\(\\begin{align}\\begin{split}m_t &amp;= \\beta_1 m_{t-1} + (1 - \\beta_1) g_t \\\\v_t &amp;= \\beta_2 v_{t-1} + (1 - \\beta_2) g_t^2\\end{split}\\end{align}\\)\\(m_t\\) and \\(v_t\\) are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method. As \\(m_t\\) and \\(v_t\\) are initialized as vectors of 0's, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. \\(\\beta_1\\) and \\(\\beta_2\\) are close to 1).They counteract these biases by computing bias-corrected first and second moment estimates:\\(\\begin{align}\\begin{split}\\hat{m}_t &amp;= \\dfrac{m_t}{1 - \\beta^t_1} \\\\\\hat{v}_t &amp;= \\dfrac{v_t}{1 - \\beta^t_2} \\end{split}\\end{align}\\)We then use these to update the weights and biases which yields the Adam update rule:\\(\\theta_{t+1} = \\theta_{t} - \\dfrac{\\eta}{\\sqrt{\\hat{v}_t} + \\epsilon} \\hat{m}_t\\).The authors propose defaults of 0.9 for \\(\\beta_1\\), 0.999 for \\(\\beta_2\\), and \\(10^{-8}\\) for \\(\\epsilon\\).view on github# decaying averages of past gradientsself.v[\"dW\" + str(i)] = ((c.BETA1 * self.v[\"dW\" + str(i)]) + ((1 - c.BETA1) * np.array(self.gradients[i]) ))self.v[\"db\" + str(i)] = ((c.BETA1 * self.v[\"db\" + str(i)]) + ((1 - c.BETA1) * np.array(self.bias_gradients[i]) ))# decaying averages of past squared gradientsself.s[\"dW\" + str(i)] = ((c.BETA2 * self.s[\"dW\"+str(i)]) + ((1 - c.BETA2) * (np.square(np.array(self.gradients[i]))) ))self.s[\"db\" + str(i)] = ((c.BETA2 * self.s[\"db\" + str(i)]) + ((1 - c.BETA2) * (np.square(np.array( self.bias_gradients[i]))) ))if c.ADAM_BIAS_Correction: # bias-corrected first and second moment estimates self.v[\"dW\" + str(i)] = self.v[\"dW\" + str(i)] / (1 - (c.BETA1 ** true_epoch)) self.v[\"db\" + str(i)] = self.v[\"db\" + str(i)] / (1 - (c.BETA1 ** true_epoch)) self.s[\"dW\" + str(i)] = self.s[\"dW\" + str(i)] / (1 - (c.BETA2 ** true_epoch)) self.s[\"db\" + str(i)] = self.s[\"db\" + str(i)] / (1 - (c.BETA2 ** true_epoch))# apply to weights and biasesweight_col -= ((eta * (self.v[\"dW\" + str(i)] / (np.sqrt(self.s[\"dW\" + str(i)]) + c.EPSILON))))self.bias[i] -= ((eta * (self.v[\"db\" + str(i)] / (np.sqrt(self.s[\"db\" + str(i)]) + c.EPSILON))))SGD MomentumsourceVanilla SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum.Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction \\(\\gamma\\) of the update vector of the past time step to the current update vector:\\(\\begin{align}\\begin{split}v_t &amp;= \\beta_1 v_{t-1} + \\eta \\nabla_\\theta J( \\theta) \\\\\\theta &amp;= \\theta - v_t\\end{split}\\end{align}\\)The momentum term \\(\\beta_1\\) is usually set to 0.9 or a similar value.Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. \\(\\beta_1 &lt; 1\\)). The same thing happens to our weight and biases updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.view on githubself.v[\"dW\"+str(i)] = ((c.BETA1*self.v[\"dW\" + str(i)]) +(eta*np.array(self.gradients[i]) ))self.v[\"db\"+str(i)] = ((c.BETA1*self.v[\"db\" + str(i)]) +(eta*np.array(self.bias_gradients[i]) ))weight_col -= self.v[\"dW\" + str(i)]self.bias[i] -= self.v[\"db\" + str(i)]Nesterov accelerated gradient (NAG)sourceHowever, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We'd like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.Nesterov accelerated gradient (NAG) is a way to give our momentum term this kind of prescience. We know that we will use our momentum term \\(\\beta_1 v_{t-1}\\) to move the weights and biases \\(\\theta\\). Computing \\( \\theta - \\beta_1 v_{t-1} \\) thus gives us an approximation of the next position of the weights and biases (the gradient is missing for the full update), a rough idea where our weights and biases are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current weights and biases \\(\\theta\\) but w.r.t. the approximate future position of our weights and biases:\\(\\begin{align}\\begin{split}v_t &amp;= \\beta_1 v_{t-1} + \\eta \\nabla_\\theta J( \\theta - \\beta_1 v_{t-1} ) \\\\\\theta &amp;= \\theta - v_t\\end{split}\\end{align}\\)Again, we set the momentum term \\(\\beta_1\\) to a value of around 0.9. While Momentum first computes the current gradient and then takes a big jump in the direction of the updated accumulated gradient, NAG first makes a big jump in the direction of the previous accumulated gradient, measures the gradient and then makes a correction, which results in the complete NAG update. This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of Neural Networks on a number of tasks.Now that we are able to adapt our updates to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual weight and bias to perform larger or smaller updates depending on their importance.view on githubv_prev = {\"dW\" + str(i): self.v[\"dW\" + str(i)], \"db\" + str(i): self.v[\"db\" + str(i)]}self.v[\"dW\" + str(i)] = (c.NAG_COEFF * self.v[\"dW\" + str(i)] - eta * np.array(self.gradients[i]))self.v[\"db\" + str(i)] = (c.NAG_COEFF * self.v[\"db\" + str(i)] - eta * np.array(self.bias_gradients[i]))weight_col += ((-1 * c.BETA1 * v_prev[\"dW\" + str(i)]) + (1 + c.BETA1) * self.v[\"dW\" + str(i)])self.bias[i] += ((-1 * c.BETA1 * v_prev[\"db\" + str(i)]) + (1 + c.BETA1) * self.v[\"db\" + str(i)])RMSpropsourceRMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class.RMSprop was developed stemming from the need to resolve other method's radically diminishing learning rates.\\(\\begin{align}\\begin{split}E[\\theta^2]_t &amp;= \\beta_1 E[\\theta^2]_{t-1} + (1-\\beta_1) \\theta^2_t \\\\\\theta_{t+1} &amp;= \\theta_{t} - \\dfrac{\\eta}{\\sqrt{E[\\theta^2]_t + \\epsilon}} \\theta_{t}\\end{split}\\end{align}\\)RMSprop divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests \\(\\beta_1\\) to be set to 0.9, while a good default value for the learning rate \\(\\eta\\) is 0.001.view on githubself.s[\"dW\" + str(i)] = ((c.BETA1 * self.s[\"dW\" + str(i)]) + ((1-c.BETA1) * (np.square(np.array(self.gradients[i]))) ))self.s[\"db\" + str(i)] = ((c.BETA1 * self.s[\"db\" + str(i)]) + ((1-c.BETA1) * (np.square(np.array(self.bias_gradients[i]))) ))weight_col -= (eta * (np.array(self.gradients[i]) / (np.sqrt(self.s[\"dW\"+str(i)]+c.EPSILON))) )self.bias[i] -= (eta * (np.array(self.bias_gradients[i]) / (np.sqrt(self.s[\"db\"+str(i)]+c.EPSILON))) )Complete codeAll in all the code ended up like this:view on github@staticmethoddef cyclic_learning_rate(learning_rate, epoch): max_lr = learning_rate * c.MAX_LR_FACTOR cycle = np.floor(1 + (epoch / (2 * c.LR_STEP_SIZE)) ) x = np.abs((epoch / c.LR_STEP_SIZE) - (2 * cycle) + 1) return learning_rate + (max_lr - learning_rate) * np.maximum(0, (1 - x))def apply_gradients(self, epoch): true_epoch = epoch - c.BATCH_SIZE eta = self.learning_rate * (1 / (1 + c.DECAY_RATE * true_epoch)) if c.CLR_ON: eta = self.cyclic_learning_rate(eta, true_epoch) for i, weight_col in enumerate(self.weights): if c.OPTIMIZATION == 'vanilla': weight_col -= eta * np.array(self.gradients[i]) / c.BATCH_SIZE self.bias[i] -= eta * np.array(self.bias_gradients[i]) / c.BATCH_SIZE elif c.OPTIMIZATION == 'SGD_momentum': self.v[\"dW\"+str(i)] = ((c.BETA1 *self.v[\"dW\" + str(i)]) +(eta *np.array(self.gradients[i]) )) self.v[\"db\"+str(i)] = ((c.BETA1 *self.v[\"db\" + str(i)]) +(eta *np.array(self.bias_gradients[i]) )) weight_col -= self.v[\"dW\" + str(i)] self.bias[i] -= self.v[\"db\" + str(i)] elif c.OPTIMIZATION == 'NAG': v_prev = {\"dW\" + str(i): self.v[\"dW\" + str(i)], \"db\" + str(i): self.v[\"db\" + str(i)]} self.v[\"dW\" + str(i)] = (c.NAG_COEFF * self.v[\"dW\" + str(i)] - eta * np.array(self.gradients[i])) self.v[\"db\" + str(i)] = (c.NAG_COEFF * self.v[\"db\" + str(i)] - eta * np.array(self.bias_gradients[i])) weight_col += ((-1 * c.BETA1 * v_prev[\"dW\" + str(i)]) + (1 + c.BETA1) * self.v[\"dW\" + str(i)]) self.bias[i] += ((-1 * c.BETA1 * v_prev[\"db\" + str(i)]) + (1 + c.BETA1) * self.v[\"db\" + str(i)]) elif c.OPTIMIZATION == 'RMSProp': self.s[\"dW\" + str(i)] = ((c.BETA1 *self.s[\"dW\" + str(i)]) +((1-c.BETA1) *(np.square(np.array(self.gradients[i]))) )) self.s[\"db\" + str(i)] = ((c.BETA1 *self.s[\"db\" + str(i)]) +((1-c.BETA1) *(np.square(np.array(self.bias_gradients[i]))) )) weight_col -= (eta *(np.array(self.gradients[i]) /(np.sqrt(self.s[\"dW\"+str(i)]+c.EPSILON))) ) self.bias[i] -= (eta *(np.array(self.bias_gradients[i]) /(np.sqrt(self.s[\"db\"+str(i)]+c.EPSILON))) ) if c.OPTIMIZATION == \"ADAM\": # decaying averages of past gradients self.v[\"dW\" + str(i)] = (( c.BETA1 * self.v[\"dW\" + str(i)]) + ((1 - c.BETA1) * np.array(self.gradients[i]) )) self.v[\"db\" + str(i)] = (( c.BETA1 * self.v[\"db\" + str(i)]) + ((1 - c.BETA1) * np.array(self.bias_gradients[i]) )) # decaying averages of past squared gradients self.s[\"dW\" + str(i)] = ((c.BETA2 * self.s[\"dW\"+str(i)]) + ((1 - c.BETA2) * (np.square( np.array( self.gradients[i]))) )) self.s[\"db\" + str(i)] = ((c.BETA2 * self.s[\"db\" + str(i)]) + ((1 - c.BETA2) * (np.square( np.array( self.bias_gradients[i]))) )) if c.ADAM_BIAS_Correction: # bias-corrected first and second moment estimates self.v[\"dW\" + str(i)] = self.v[\"dW\" + str(i)] / (1 - (c.BETA1 ** true_epoch)) self.v[\"db\" + str(i)] = self.v[\"db\" + str(i)] / (1 - (c.BETA1 ** true_epoch)) self.s[\"dW\" + str(i)] = self.s[\"dW\" + str(i)] / (1 - (c.BETA2 ** true_epoch)) self.s[\"db\" + str(i)] = self.s[\"db\" + str(i)] / (1 - (c.BETA2 ** true_epoch)) # apply to weights and biases weight_col -= ((eta * (self.v[\"dW\" + str(i)] / (np.sqrt(self.s[\"dW\" + str(i)]) + c.EPSILON)))) self.bias[i] -= ((eta * (self.v[\"db\" + str(i)] / (np.sqrt(self.s[\"db\" + str(i)]) + c.EPSILON)))) self.gradient_zeros()"
    } ,
  
    {
      "title"       : "Machine Learning Library in Python from scratch",
      "category"    : "",
      "tags"        : "machine learning, coding, neural networks, python",
      "url"         : "./ML-Library-from-scratch.html",
      "date"        : "2021-02-28 18:32:20 +0000",
      "description" : "Single neuron perceptron that classifies elements learning quite quickly.",
      "content"     : "It must sound crazy that in this day and age, when we have such a myriad of amazing machine learning libraries and toolkits all open sourced, all quite well documented and easy to use, I decided to create my own ML library from scratch.Let me try to explain; I am in the process of immersing myself into the world of Machine Learning, and to do so, I want to deeply understand the basic concepts and its foundations, and I think that there is no better way to do so than by creating myself all the code for a basic neural network library from scratch. This way I can gain in depth understanding of the math that underpins the ML algorithms.Another benefit of doing this is that since I am also learning Python, the experiment brings along good exercise for me.To call it a Machine Learning Library is perhaps a bit of a stretch, since I just intended to create a multi-neuron, multi-layered perceptron.The library started very narrowly, with just the following functionality: create a neural network based on the following parameters: number of inputs size and number of hidden layers number of outputs learning rate forward propagate or predict the output values when given some inputs learn through back propagation using gradient descentI restricted the model to be sequential, and the layers to be only dense / fully connected, this means that every neuron is connected to every neuron of the following layer. Also, as a restriction, the only activation function I implemented was sigmoid:With my neural network coded, I tested it with a very basic problem, the famous XOR problem.XOR is a logical operation that cannot be solved by a single perceptron because of its linearity restriction:As you can see, when plotted in an X,Y plane, the logical operators AND and OR have a line that can clearly separate the points that are false from the ones that are true, hence a perceptron can easily learn to classify them; however, for XOR there is no single straight line that can do so, therefore a multilayer perceptron is needed for the task.For the test I created a neural network with my library:import Neural_Network as nninputs = 3hidden_layers = [2, 1]outputs = 1learning_rate = 0.03NN = nn.NeuralNetwork(inputs, hidden_layers, outputs, learning_rate)The three inputs I decided to use (after a lot of trial and error) are the X and Y coordinate of a point (between X = 0, X = 1, Y = 0 and Y = 1) and as the third input the multiplication of both X and Y. Apparently it gives the network more information, and it ends up converging much more quickly with this third input.Then there is a single hidden layer with 2 neurons and one output value, that will represent False if the value is closer to 0 or True if the value is closer to 1.Then I created the learning data, which is quite trivial for this problem, since we know very easily how to compute XOR.training_data = []for n in range(learning_rounds): x = rnd.random() y = rnd.random() training_data.append([x, y, x * y, 0 if (x &lt; 0.5 and y &lt; 0.5) or (x &gt;= 0.5 and y &gt;= 0.5) else 1])And off we go into training:for data in training_data: NN.train(data[:3].reshape(inputs), data[3:].reshape(outputs))The ML library can only train on batches of 1 (another self-imposed coding restriction), therefore only one “observation” at a time, this is why the train function accepts two parameters, one is the inputs packed in an array, and the other one is the outputs, packed as well in an array.To see the neural net in action I decided to plot the predicted results in both a 3d X,Y,Z surface plot (z being the network’s predicted value), and a scatter plot with the color of the points representing the predicted value.This was plotted in MatPlotLib, so we needed to do some housekeeping first:fig = plt.figure()fig.canvas.set_window_title('Learning XOR Algorithm')fig.set_size_inches(11, 6)axs1 = fig.add_subplot(1, 2, 1, projection='3d')axs2 = fig.add_subplot(1, 2, 2)Then we need to prepare the data to be plotted by generating X and Y values distributed between 0 and 1, and having the network calculate the Z value:x = np.linspace(0, 1, num_surface_points)y = np.linspace(0, 1, num_surface_points)x, y = np.meshgrid(x, y)z = np.array(NN.forward_propagation([x, y, x * y])).reshape(num_surface_points, num_surface_points)As you can see, the z values array is reshaped as a 2d array of shape (x,y), since this is the way Matplotlib interprets it as a surface:axs1.plot_surface(x, y, z, rstride=1, cstride=1, cmap='viridis', vmin=0, vmax=1, antialiased=True)The end result looks something like this:Then we reshape the z array as a one dimensional array to use it to color the scatter plot:z = z.reshape(num_surface_points ** 2)scatter = axs2.scatter(x, y, marker='o', s=40, c=z.astype(float), cmap='viridis', vmin=0, vmax=1)To actually see the progress while learning, I created a Matplotlib animation, and it is quite interesting to see as it learns. So my baby ML library is completed for now, but still I would like to enhance it in several ways: include multiple activation functions (ReLu, linear, Tanh, etc.) allow for multiple optimizers (Adam, RMSProp, SGD Momentum, etc.) have batch and epoch training schedules functionality save and load trained model to fileI will get to it soon…"
    } ,
  
    {
      "title"       : "Conway&#39;s Game of Life",
      "category"    : "",
      "tags"        : "coding, python",
      "url"         : "./conways-game-of-life.html",
      "date"        : "2021-02-10 19:32:20 +0000",
      "description" : "Taking on the challenge of picking up coding again through interesting small projects, this time it is the turn of Conway's Game of Life.",
      "content"     : "I&nbsp;am lately trying to take on coding again. It had always been a part of my life since my early years when I&nbsp;learned to program a Tandy Color Computer at the age of 8, the good old days.Tandy Color Computer TRS80 IIIHaving already programed in Java, C# and of course BASIC, I&nbsp;thought it would be a great idea to learn Python since I&nbsp;have great interest in data science and machine learning, and those two topics seem to have an avid community within Python coders.For one of my starter quick programming tasks, I&nbsp;decided to code Conway's Game of Life, a very simple cellular automata that basically plays itself.The game consists of a grid of n size, and within each block of the grid a cell could either be dead or alive according to these rules:If a cell has less than 2 neighbors, meaning contiguous alive cells, the cell will die of lonelinessIf a cell has more than 3 neighbors, it will die of overpopulationIf an empty block has exactly 3 contiguous alive neighbors, a new cell will be born in that spotIf an alive cell has 2 or 3 alive neighbors, it continues to liveConway’s rules for the Game of LifeTo make it more of a challenge I&nbsp;also decided to implement an \"sparse\" method of recording the game board, this means that instead of the typical 2d array representing the whole board, I&nbsp;will only record the cells which are alive. Saving a lot of memory space and processing time, while adding some spice to the challenge.The trickiest part was figuring out how to calculate which empty blocks had exactly 3 alive neighbors so that a new cell will spring to life there, this is trivial in the case of recording the whole grid, because we just iterate all over the board and find the alive neighbors of ALL&nbsp;the blocks in the grid, but in the case of only keeping the alive cells proved quite a challenge.In the end the algorithm ended up as follows:Iterate through all the alive cells and get all of their neighborsdef get_neighbors(self, cell): neighbors = [] for x in range(-1, 2, 1): for y in range(-1, 2, 1): if not (x == 0 and y == 0): if (0 &amp;lt;= (cell[0] + x) &amp;lt;= self.size_x) and (0 &amp;lt;= (cell[1] + y) &amp;lt;= self.size_y): neighbors.append((cell[0] + x, cell[1] + y)) return neighborsMark all the neighboring blocks as having +1 neighbor each time a particular cell is encountered. This way, for each neighboring alive cell the counter of the particular block will increase, and in the end it will contain the total number of live cells which are contiguous to it.def next_state(self): alive_neighbors = {} for cell in self.alive_cells: if cell not in alive_neighbors: alive_neighbors[cell] = 0 neighbors = self.get_neighbors(cell) for neighbor in neighbors: if neighbor not in alive_neighbors: alive_neighbors[neighbor] = 1 else: alive_neighbors[neighbor] += 1The trick was using a dictionary to keep the record of the blocks that have alive neighbors and the cells who are alive in the current state but have zero alive neighbors (thus will die).With the dictionary it became easy just to add cells and increase their neighbor counter each time it was encountered as a neighbor of an alive cell.Having the dictionary now filled with all the cells that have alive neighbors and how many they have, it was just a matter of applying the rules of the game:for cell in alive_neighbors: if alive_neighbors[cell] &amp;lt; 2 or alive_neighbors[cell] &gt; 3: self.alive_cells.discard(cell) elif alive_neighbors[cell] == 3: self.alive_cells.add(cell)Notice that since I am keeping an array of the coordinates of only the cells who are alive, I could apply just 3 rules, die of loneliness, die of overpopulation and become alive from reproduction (exactly 3 alive neighbors) because the ones who have 2 or 3 neighbors and are already alive, can remain alive in the next iteration.I&nbsp;found it very interesting to implement the Game of Life like this, it was quite a refreshing challenge and I am beginning to feel my coding skills ramping up again."
    } ,
  
    {
      "title"       : "Single Neuron Perceptron",
      "category"    : "",
      "tags"        : "machine learning, coding, neural networks",
      "url"         : "./single-neuron-perceptron.html",
      "date"        : "2021-01-25 19:32:20 +0000",
      "description" : "Single neuron perceptron that classifies elements learning quite quickly.",
      "content"     : "As an entry point to learning python and getting into Machine Learning, I decided to code from scratch the Hello World! of the field, a single neuron perceptron.What is a perceptron?A perceptron is the basic building block of a neural network, it can be compared to a neuron, And its conception is what detonated the vast field of Artificial Intelligence nowadays.Back in the late 1950’s, a young Frank Rosenblatt devised a very simple algorithm as a foundation to construct a machine that could learn to perform different tasks.In its essence, a perceptron is nothing more than a collection of values and rules for passing information through them, but in its simplicity lies its power.Imagine you have a ‘neuron’ and to ‘activate’ it, you pass through several input signals, each signal connects to the neuron through a synapse, once the signal is aggregated in the perceptron, it is then passed on to one or as many outputs as defined. A perceptron is but a neuron and its collection of synapses to get a signal into it and to modify a signal to pass on.In more mathematical terms, a perceptron is an array of values (let’s call them weights), and the rules to apply such values to an input signal.For instance a perceptron could get 3 different inputs as in the image, lets pretend that the inputs it receives as signal are: $x_1 = 1, \\; x_2 = 2\\; and \\; x_3 = 3$, if it’s weights are $w_1 = 0.5,\\; w_2 = 1\\; and \\; w_3 = -1$ respectively, then what the perceptron will do when the signal is received is to multiply each input value by its corresponding weight, then add them up.\\(\\begin{align}\\begin{split}\\left(x_1 * w_1\\right) + \\left(x_2 * w_2\\right) + \\left(x_3 * w_3\\right)\\end{split}\\end{align}\\)\\(\\begin{align}\\begin{split}\\left(0.5 * 1\\right) + \\left(1 * 2\\right) + \\left(-1 * 3\\right) = 0.5 + 2 - 3 = -0.5\\end{split}\\end{align}\\)Typically when this value is obtained, we need to apply an “activation” function to smooth the output, but let’s say that our activation function is linear, meaning that we keep the value as it is, then that’s it, that is the output of the perceptron, -0.5.In a practical application, the output means something, perhaps we want our perceptron to classify a set of data and if the perceptron outputs a negative number, then we know the data is of type A, and if it is a positive number then it is of type B.Once we understand this, the magic starts to happen through a process called backpropagation, where we “educate” our tiny one neuron brain to have it learn how to do its job.The magic starts to happen through a process called backpropagation, where we \"educate\" our tiny one neuron brain to have it learn how to do its job.For this we need a set of data that it is already classified, we call this a training set. This data has inputs and their corresponding correct output. So we can tell the little brain when it misses in its prediction, and by doing so, we also adjust the weights a bit in the direction where we know the perceptron committed the mistake hoping that after many iterations like this the weights will be so that most of the predictions will be correct.After the model trains successfully we can have it classify data it has never seen before, and we have a fairly high confidence that it will do so correctly.The math behind this magical property of the perceptron is called gradient descent, and is just a bit of differential calculus that helps us convert the error the brain is having into tiny nudges of value of the weights towards their optimum. This video series by 3 blue 1 brown explains it wonderfuly.My program creates a single neuron neural network tuned to guess if a point is above or below a randomly generated line and generates a visualization based on graphs to see how the neural network is learning through time.The neuron has 3 inputs and weights to calculate its output:input 1 is the X coordinate of the point,Input 2 is the y coordinate of the point,Input 3 is the bias and it is always 1Input 3 or the bias is required for lines that do not cross the origin (0,0)The Perceptron starts with weights all set to zero and learns by using 1,000 random points per each iteration.The output of the perceptron is calculated with the following activation function: if x * weight_x + y weight_y + weight_bias is positive then 1 else 0The error for each point is calculated as the expected outcome of the perceptron minus the real outcome therefore there are only 3 possible error values: Expected Calculated Error 1 -1 1 1 1 0 -1 -1 0 -1 1 -1 With every point that is learned if the error is not 0 the weights are adjusted according to:New_weight = Old_weight + error * input * learning_ratefor example: New_weight_x = Old_weight_x + error * x * learning rateA very useful parameter in all of neural networks is teh learning rate, which is basically a measure on how tiny our nudge to the weights is going to be.In this particular case, I coded the learning_rate to decrease with every iteration as follows:learning_rate = 0.01 / (iteration + 1)this is important to ensure that once the weights are nearing the optimal values the adjustment in each iteration is subsequently more subtle.In the end, the perceptron always converges into a solution and finds with great precision the line we are looking for.Perceptrons are quite a revelation in that they can resolve equations by learning, however they are very limited. By their nature they can only resolve linear equations, so their problem space is quite narrow.Nowadays the neural networks consist of combinations of many perceptrons, in many layers, and other types of “neurons”, like convolution, recurrent, etc. increasing significantly the types of problems they solve."
    } 
  
]
