<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Data | Tengku Hanis</title>
    <link>https://tengkuhanis.netlify.app/tag/data/</link>
      <atom:link href="https://tengkuhanis.netlify.app/tag/data/index.xml" rel="self" type="application/rss+xml" />
    <description>Data</description>
    <generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>©Tengku Hanis 2020-2025 Made with [blogdown](https://github.com/rstudio/blogdown)</copyright><lastBuildDate>Thu, 18 Jul 2024 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://tengkuhanis.netlify.app/images/icon_hua2ec155b4296a9c9791d015323e16eb5_11927_512x512_fill_lanczos_center_2.png</url>
      <title>Data</title>
      <link>https://tengkuhanis.netlify.app/tag/data/</link>
    </image>
    
    <item>
      <title>Basic data wrangling with Python</title>
      <link>https://tengkuhanis.netlify.app/post/basic-data-wrangling-with-python/</link>
      <pubDate>Thu, 18 Jul 2024 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/basic-data-wrangling-with-python/</guid>
      <description>


&lt;p&gt;&lt;img src=&#34;images/img2.jpeg&#34; style=&#34;width:60.0%;height:40.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Python is one of the most popular programming language and software. In this post, I will demonstrate how to do a basic data wrangling with Python. This is going to be one of the several series of post related to Python (hopefully). My plan is to cover these topics:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Basic data wrangling with Python&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Basic plotting with matplotlib and seaborn&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Comparison of ggplot in R versus in Python&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Once I finish writing any of the topics, I will link it to the above.&lt;/p&gt;
&lt;p&gt;So, let’s start.&lt;/p&gt;
&lt;div id=&#34;loading-necessary-packages&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Loading necessary packages&lt;/h2&gt;
&lt;p&gt;Before loading the packages, you need to install the packages. Basically, there are two ways to install the Python packages. Either by pip command or conda command. I will skip this part, but you can refer to &lt;a href=&#34;https://packaging.python.org/en/latest/tutorials/installing-packages/#installing-packages&#34;&gt;this link to install the packages using pip command&lt;/a&gt; or &lt;a href=&#34;https://conda.io/projects/conda/en/latest/user-guide/concepts/installing-with-conda.html#installing-with-conda&#34;&gt;this link to install the packages using conda command&lt;/a&gt;. For those who has both R and Python in your machine, I suggest to use a conda command.&lt;/p&gt;
&lt;p&gt;Let’s load the required packages.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;import numpy as np 
import pandas as pd
from seaborn import load_dataset&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All the functions from each package can be assessed from the alias or the abbreviated text above. For example, functions in &lt;code&gt;pandas&lt;/code&gt; package can be accessed through &lt;code&gt;pd&lt;/code&gt; or to be specific &lt;code&gt;pd.&lt;/code&gt;. You will see this many times through out this blog post, so do not worry much about this. I am sure you will get the gist of it once you see this later on. In practice, you don’t actually need to use &lt;code&gt;pd&lt;/code&gt; for &lt;code&gt;pandas&lt;/code&gt; and &lt;code&gt;np&lt;/code&gt; for &lt;code&gt;numpy&lt;/code&gt;, but this is a convention or standard practice widely adopted in the Python community.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;load-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Load the data&lt;/h2&gt;
&lt;p&gt;We going to use &lt;a href=&#34;https://archive.ics.uci.edu/dataset/53/iris&#34;&gt;iris dataset&lt;/a&gt;. This dataset is readily available in seaborn package.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris = load_dataset(&amp;#39;iris&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once we load the data, we need to check the variable type.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.dtypes&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## sepal_length    float64
## sepal_width     float64
## petal_length    float64
## petal_width     float64
## species          object
## dtype: object&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Variable species, by right, is a categorical variable. So, we can use &lt;code&gt;Categorical()&lt;/code&gt; from &lt;code&gt;pandas&lt;/code&gt; to change it from an object variable type to a category. &lt;code&gt;pd.&lt;/code&gt; here, means we access the function from &lt;code&gt;pandas&lt;/code&gt; package as I explained it previously.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;species&amp;#39;] = pd.Categorical(iris[&amp;#39;species&amp;#39;])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we check the variable type again, we can see the species variable is a category.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.dtypes&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## sepal_length     float64
## sepal_width      float64
## petal_length     float64
## petal_width      float64
## species         category
## dtype: object&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we can also see the data. Let’s see the first 10 rows.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.head(10)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          3.9           1.7          0.4  setosa
## 6           4.6          3.4           1.4          0.3  setosa
## 7           5.0          3.4           1.5          0.2  setosa
## 8           4.4          2.9           1.4          0.2  setosa
## 9           4.9          3.1           1.5          0.1  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;slicing-and-indexing&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Slicing and indexing&lt;/h2&gt;
&lt;p&gt;To see a specific column, we can index as below. Notice, that the row number starts with 0 as opposed to R (if you have used R previously) in which the row number starts with 1.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;sepal_length&amp;#39;][0:10]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similarly, we can also index as below to get the first 10 rows of sepal_length variable.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;sepal_length&amp;#39;][:10]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 0    5.1
## 1    4.9
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## 6    4.6
## 7    5.0
## 8    4.4
## 9    4.9
## Name: sepal_length, dtype: float64&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next to access the first 5 rows, we can do as below.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[0:5]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can also use &lt;code&gt;iloc()&lt;/code&gt; and &lt;code&gt;loc()&lt;/code&gt; functions. The main difference between the two functions is that &lt;code&gt;iloc()&lt;/code&gt; can only accept a numerical value and &lt;code&gt;loc()&lt;/code&gt; function can accept a string value.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.iloc[0:2, 0:3] #rows, then columns&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length
## 0           5.1          3.5           1.4
## 1           4.9          3.0           1.4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.loc[0:2, [&amp;#39;sepal_length&amp;#39;, &amp;#39;species&amp;#39;]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length species
## 0           5.1  setosa
## 1           4.9  setosa
## 2           4.7  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Subsequently, we can also slice according a logical condition. Below, we slice the petal_length variable that is above the value of 6.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;ind = iris[&amp;#39;petal_length&amp;#39;] &amp;gt; 6
iris[&amp;#39;petal_length&amp;#39;][ind]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 105    6.6
## 107    6.3
## 109    6.1
## 117    6.7
## 118    6.9
## 122    6.7
## 130    6.1
## 131    6.4
## 135    6.1
## Name: petal_length, dtype: float64&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s say we want our data to include only setosa species.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;ind = iris[&amp;#39;species&amp;#39;] == &amp;#39;setosa&amp;#39;
iris.loc[ind, :].head()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           5.1          3.5           1.4          0.2  setosa
## 1           4.9          3.0           1.4          0.2  setosa
## 2           4.7          3.2           1.3          0.2  setosa
## 3           4.6          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once we know about slicing and indexing, we can use this knowledge to change certain values. For example, below we change:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;row 1, 2, 3, and 4 of sepal_length to NA values&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;row 6 of species and sepal_width to NA values&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.loc[0:3, &amp;#39;sepal_length&amp;#39;] = np.nan 
iris.iloc[5, [1, 4]] = np.nan&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s see the result.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.head(6)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    sepal_length  sepal_width  petal_length  petal_width species
## 0           NaN          3.5           1.4          0.2  setosa
## 1           NaN          3.0           1.4          0.2  setosa
## 2           NaN          3.2           1.3          0.2  setosa
## 3           NaN          3.1           1.5          0.2  setosa
## 4           5.0          3.6           1.4          0.2  setosa
## 5           5.4          NaN           1.7          0.4     NaN&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;missing-values&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Missing values&lt;/h2&gt;
&lt;p&gt;If we want to see if we have any missing values in our data, we can use &lt;code&gt;isnull()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.isnull().any().any() #For overall&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## True&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.isnull().any() #Check for each column&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## sepal_length     True
## sepal_width      True
## petal_length    False
## petal_width     False
## species          True
## dtype: bool&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can further calculate how many missing values that we have.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.isnull().sum()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## sepal_length    4
## sepal_width     1
## petal_length    0
## petal_width     0
## species         1
## dtype: int64&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;descriptive-statistics&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Descriptive statistics&lt;/h2&gt;
&lt;p&gt;To get a basic descriptive statistics, we can use &lt;code&gt;describe()&lt;/code&gt; function. Below, we additionally use &lt;code&gt;round()&lt;/code&gt; to round up the results into one decimal points.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.describe().round()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##        sepal_length  sepal_width  petal_length  petal_width
## count         146.0        149.0         150.0        150.0
## mean            6.0          3.0           4.0          1.0
## std             1.0          0.0           2.0          1.0
## min             4.0          2.0           1.0          0.0
## 25%             5.0          3.0           2.0          0.0
## 50%             6.0          3.0           4.0          1.0
## 75%             6.0          3.0           5.0          2.0
## max             8.0          4.0           7.0          2.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that the results above only include numerical variables. So, to get the results for categorical variables as well, we need to add &lt;code&gt;include = all&lt;/code&gt; as below.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris.describe(include = &amp;#39;all&amp;#39;).round()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         sepal_length  sepal_width  petal_length  petal_width     species
## count          146.0        149.0         150.0        150.0         149
## unique           NaN          NaN           NaN          NaN           3
## top              NaN          NaN           NaN          NaN  versicolor
## freq             NaN          NaN           NaN          NaN          50
## mean             6.0          3.0           4.0          1.0         NaN
## std              1.0          0.0           2.0          1.0         NaN
## min              4.0          2.0           1.0          0.0         NaN
## 25%              5.0          3.0           2.0          0.0         NaN
## 50%              6.0          3.0           4.0          1.0         NaN
## 75%              6.0          3.0           5.0          2.0         NaN
## max              8.0          4.0           7.0          2.0         NaN&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, we can also calculate the unique values for the categorical variable. &lt;code&gt;value_counts()&lt;/code&gt; only calculate the non-missing values.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;species&amp;#39;].value_counts()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## species
## versicolor    50
## virginica     50
## setosa        49
## Name: count, dtype: int64&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similarly, for numerical variable we can also do manually each statistics. For example to calculate mean, we can use &lt;code&gt;mean()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;python&#34;&gt;&lt;code&gt;iris[&amp;#39;sepal_width&amp;#39;].mean().round()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 3.0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That’s it. These are the basics of handling a dataset in Python. With this knowledge, I hope you feel ready to dive in and explore more on your own.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>What makes data &#34;good enough&#34; for a statistical analysis?</title>
      <link>https://tengkuhanis.netlify.app/post/what-makes-data-good-enough-for-a-statistical-analysis/</link>
      <pubDate>Thu, 29 Feb 2024 00:00:00 +0000</pubDate>
      <guid>https://tengkuhanis.netlify.app/post/what-makes-data-good-enough-for-a-statistical-analysis/</guid>
      <description>


&lt;p&gt;&lt;img src=&#34;images/_34facba1-9993-41d5-ae25-c0cab57f2184.jpg&#34; style=&#34;width:60.0%;height:40.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;A few days earlier, someone asked me to help her with the data analysis. However, the data that she gave me was so bad that it was completely impossible to run the analysis unless a serious data cleaning was done first.&lt;/p&gt;
&lt;p&gt;So, I am thinking about what is a general guideline to consider a data is good enough to run the statistical analysis with it.&lt;/p&gt;
&lt;p&gt;First thing first, what are the basic format of a “good enough” data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Each row represents an observation&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. For a categorical variable, make sure the levels are standardised.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, for gender variable, make sure to have only “male” and “female” instead of “male”, “female”, “men”, and “women”.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. For a numerical variable, make sure the value is numeric and do not contain any text.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, for height variable, do not put “1.68m” or “1.68 meter”.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. For numerical variable as well, make sure the numeric values in the variable are in the same scale.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, for weight variable, do not mix the weight in grams and kilograms. If you want to use grams, use it consistently throughout the data, or at least throughout the variable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. Do not use symbol in your data.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For example, do not use “X” as no and “/” as yes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;6. The data should be an individual data.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;An individual data means that each row consists of information about each sample or observation. Each observation in the dataset represents a single entity or unit (e.g., a person, a transaction, a product) and includes all relevant attributes or variables for that unit. Individual data allow for detailed analysis at the level of individual observations. Here is an example of individual data:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;Id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Age&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Obese&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Aggregated data, on the other hand, combines multiple individual observations into summary statistics or groups. Instead of representing individual units, aggregated data presents information at a higher level of abstraction, such as groups, categories, or intervals. This aggregation typically involves summarizing data using functions like sums, averages, counts, or percentages. Here is an example of aggregated data based on the individual data previously:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Age&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Obese&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;no&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;no&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;gt;50&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;yes&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I think the above six points are the basics of building a good enough dataset for a statistical analysis. While we are at it, let’s go through the two main formats of a dataset. These formats are more common when you have a repeated measure study design whereby each participant has several values/measurements/responds at several time points.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: lower-alpha&#34;&gt;
&lt;li&gt;Wide format&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In the wide format, the response of each participant will be in a single row. For example, below is the data of time taking by two participants in answering three questions in second. As we can see each row consists of the time in second in answering all three questions.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;ID&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Question1&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Question2&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Question3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;ol start=&#34;2&#34; style=&#34;list-style-type: lower-alpha&#34;&gt;
&lt;li&gt;Long format&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In the long format (also known as tidy format in R community), the response at each time point of each participant will be in a single row. By using the same data previously, below is the in the long format. As we can see the data is arrange in format that each row represents each time taking to answer the question by each participant.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;ID&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Time&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Question&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;50&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
</description>
    </item>
    
  </channel>
</rss>
