<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Deep Learning | Tengku Hanis</title>
    <link>https://tengkuhanis.netlify.app/category/deep-learning/</link>
      <atom:link href="https://tengkuhanis.netlify.app/category/deep-learning/index.xml" rel="self" type="application/rss+xml" />
    <description>Deep Learning</description>
    <generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>©Tengku Hanis 2020-2025 Made with [blogdown](https://github.com/rstudio/blogdown)</copyright><lastBuildDate>Sun, 09 May 2021 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://tengkuhanis.netlify.app/images/icon_hua2ec155b4296a9c9791d015323e16eb5_11927_512x512_fill_lanczos_center_2.png</url>
      <title>Deep Learning</title>
      <link>https://tengkuhanis.netlify.app/category/deep-learning/</link>
    </image>
    
    <item>
      <title>Exponentially Weighted Average in Deep Learning</title>
      <link>https://tengkuhanis.netlify.app/post/exponentially-weighted-average-in-deep-learning/</link>
      <pubDate>Sun, 09 May 2021 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/exponentially-weighted-average-in-deep-learning/</guid>
      <description>
&lt;script src=&#34;https://tengkuhanis.netlify.app/post/exponentially-weighted-average-in-deep-learning/index.en_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;I have been reading about lost functions and optimisers in deep learning for the last couple of days when I stumble upon the term Exponentially Weighted Average (EWA). So, in this post I aims to explain my understanding of EWA.&lt;/p&gt;
&lt;div id=&#34;overview-of-ewa&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Overview of EWA&lt;/h2&gt;
&lt;p&gt;EWA basically is an important concept in deep learning and have been used in several optimisers to smoothen the noise of the data.&lt;/p&gt;
&lt;p&gt;Let’s see the formula for EWA:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;formula.png&#34; width=&#34;60%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;V&lt;sub&gt;t&lt;/sub&gt;&lt;/em&gt; is some smoothen value at point &lt;em&gt;t&lt;/em&gt;, while &lt;em&gt;S&lt;sub&gt;t&lt;/sub&gt;&lt;/em&gt; is a data point at point &lt;em&gt;t&lt;/em&gt;. &lt;em&gt;B&lt;/em&gt; here is a hyperparameter that we need to tune in our network. So, the choice of &lt;em&gt;B&lt;/em&gt; will determine how many data points that we average the value of &lt;em&gt;V&lt;sub&gt;t&lt;/sub&gt;&lt;/em&gt; as shown below:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;beta.png&#34; width=&#34;80%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ewa-in-deep-learnings-optimiser&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;EWA in deep learnings’ optimiser&lt;/h2&gt;
&lt;p&gt;So, some of the optimisers that adopt the approach of EWA are (red box indicates the EWA part in each formula):&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Stochastic gradient descent (SGD) with momentum&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The issue with SGD is the present of noise while searching for global minima. So, SGD with momentum integrated the EWA, which reduces these noises and helps the network converges faster.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;SGD-momentum2.png&#34; width=&#34;80%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Adaptive delta (Adadelta) and Root Mean Square Propagation (RMSprop)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Adadelta and RMSprop are proposed in attempt to solve the issue of diminishing learning rate of adaptive gradient (Adagrad) optimiser. The use of EWA in both optimisers actually helps to achieve this. Both optimisers have quite a similar formula, but attached below is the formula for Adadelta.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;adadelta2.png&#34; width=&#34;80%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;ol start=&#34;3&#34; style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Adaptive moment estimation (ADAM)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;ADAM basically combined the SGD with momentum with Adadelta. As shown earlier, both optimisers use EWA.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;more-details-on-ewa&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;More details on EWA&lt;/h2&gt;
&lt;p&gt;Now, let’s go back to EWA. Here is the example of calculation of EWA:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;seq1.png&#34; width=&#34;90%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Keep in mind that &lt;em&gt;t&lt;sub&gt;3&lt;/sub&gt;&lt;/em&gt; is the latest time point, followed by &lt;em&gt;t&lt;sub&gt;2&lt;/sub&gt;&lt;/em&gt; and &lt;em&gt;t&lt;sub&gt;1&lt;/sub&gt;&lt;/em&gt;, respectively. So, if we want to calculate &lt;em&gt;V&lt;sub&gt;3&lt;/sub&gt;&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;seq2.png&#34; width=&#34;90%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So, if we were to varies the value of &lt;em&gt;B&lt;/em&gt; across the equation (while the values of &lt;em&gt;a&lt;sub&gt;1&lt;/sub&gt;…a&lt;sub&gt;n&lt;/sub&gt;&lt;/em&gt; remain constant), we can do so in R.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) 

func &amp;lt;- function(b) (1 - b) * b^((20:1) - 1)
beta &amp;lt;- seq(0.1, 0.9, by=0.2)

dat &amp;lt;- t(sapply(beta, func)) %&amp;gt;% 
  as.data.frame()
colnames(dat)[1:20] &amp;lt;- 1:20

dat %&amp;gt;%  
  mutate(beta = as_factor(beta)) %&amp;gt;%
  pivot_longer(cols = 1:20, names_to = &amp;quot;data_point&amp;quot;, values_to = &amp;quot;weight&amp;quot;) %&amp;gt;% 
  ggplot(aes(x=as.numeric(data_point), y=weight, color=beta)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = 1:20) +
  labs(title = &amp;quot;Change of Exponentially Weighted Average function&amp;quot;, 
       subtitle = &amp;quot;Time at t20 is the recent time, and t1 is the initial time&amp;quot;) +
  scale_colour_discrete(&amp;quot;Beta:&amp;quot;) +
  xlab(&amp;quot;Time(t)&amp;quot;) +
  ylab(&amp;quot;Weights/Coefficients&amp;quot;) +
  theme_bw()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://tengkuhanis.netlify.app/post/exponentially-weighted-average-in-deep-learning/index.en_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Note that time at t&lt;sub&gt;20&lt;/sub&gt; is the recent time, and t&lt;sub&gt;1&lt;/sub&gt; is the initial time. Thus, two main points from the above plot are:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The EWA function acts in a decaying manner.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;As beta, &lt;em&gt;B&lt;/em&gt; increases we actually put more emphasize on the recent data point.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;em&gt;Side note: I have tried to do the plot in plotly, not sure why it did not work&lt;/em&gt; 😕&lt;/p&gt;
&lt;p&gt;References:&lt;br /&gt;
1) &lt;a href=&#34;https://towardsdatascience.com/deep-learning-optimizers-436171c9e23f&#34; class=&#34;uri&#34;&gt;https://towardsdatascience.com/deep-learning-optimizers-436171c9e23f&lt;/a&gt; (all the equations are from this reference)&lt;br /&gt;
2) &lt;a href=&#34;https://youtu.be/NxTFlzBjS-4&#34; class=&#34;uri&#34;&gt;https://youtu.be/NxTFlzBjS-4&lt;/a&gt;&lt;br /&gt;
3) &lt;a href=&#34;https://medium.com/@dhartidhami/exponentially-weighted-averages-5de212b5be46&#34; class=&#34;uri&#34;&gt;https://medium.com/@dhartidhami/exponentially-weighted-averages-5de212b5be46&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
