Jekyll2020-09-03T08:17:02+10:00https://hsteinshiromoto.github.io/feed.xmlHumberto STEIN SHIROMOTOHumberto's websiteHumberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comQuerying the Latest Record2020-08-13T00:00:00+10:002020-08-13T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/08/13/blog-post_querying_the_latest_record<p>In this gist, I show how to get the latest record or a user based on a datetime column.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">t1</span><span class="p">.</span><span class="n">row_id</span>
<span class="p">,</span><span class="nb">DATE</span><span class="p">(</span><span class="n">t1</span><span class="p">.</span><span class="n">start_dt</span><span class="p">)</span>
<span class="p">,</span><span class="nb">DATE</span><span class="p">(</span><span class="n">t1</span><span class="p">.</span><span class="n">end_dt</span><span class="p">)</span>
<span class="k">FROM</span> <span class="k">schema</span><span class="p">.</span><span class="k">table</span> <span class="n">t1</span>
<span class="k">INNER</span> <span class="k">JOIN</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">row_id</span>
<span class="p">,</span><span class="k">max</span><span class="p">(</span><span class="n">start_dt</span><span class="p">)</span> <span class="k">AS</span> <span class="n">MaxStartDate</span>
<span class="p">,</span><span class="k">max</span><span class="p">(</span><span class="n">end_dt</span><span class="p">)</span> <span class="k">AS</span> <span class="n">MaxEndDate</span>
<span class="k">FROM</span> <span class="k">schema</span><span class="p">.</span><span class="k">table</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">row_id</span>
<span class="p">)</span> <span class="n">t2</span>
<span class="k">ON</span> <span class="n">t1</span><span class="p">.</span><span class="n">row_id</span> <span class="o">=</span> <span class="n">t2</span><span class="p">.</span><span class="n">row_id</span>
<span class="k">AND</span> <span class="p">(</span><span class="n">t1</span><span class="p">.</span><span class="n">end_dt</span> <span class="o">=</span> <span class="n">t2</span><span class="p">.</span><span class="n">MaxEndDate</span> <span class="k">OR</span> <span class="n">t1</span><span class="p">.</span><span class="n">end_dt</span> <span class="k">IS</span> <span class="k">NULL</span> <span class="k">AND</span> <span class="n">t2</span><span class="p">.</span><span class="n">MaxEndDate</span> <span class="k">is</span> <span class="k">NULL</span><span class="p">)</span>
<span class="k">AND</span> <span class="n">t1</span><span class="p">.</span><span class="n">start_dt</span> <span class="o">=</span> <span class="n">t2</span><span class="p">.</span><span class="n">MaxStartDate</span>
</code></pre></div></div>
<h1 id="references">References</h1>
<p><a href="https://stackoverflow.com/questions/2411559/how-do-i-query-sql-for-a-latest-record-date-for-each-user/2411763#2411763">https://stackoverflow.com/questions/2411559/how-do-i-query-sql-for-a-latest-record-date-for-each-user/2411763#2411763</a></p>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comIn this gist, I show how to get the latest record or a user based on a datetime column.Pandas Value Counts2020-08-13T00:00:00+10:002020-08-13T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/08/27/blog-post_pandas_value_counts<p>The <code class="language-plaintext highlighter-rouge">value_counts()</code> function in the popular python data science library <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html">Pandas</a> is a quick way to count the unique values in a single column otherwise known as a series of data.</p>
<p>This function is extremely useful for very quickly performing some basic data analysis on specific columns of data contained in a Pandas DataFrame.</p>
<p>This article has been copied from a <a href="https://medium.com/m/signin?operation=login&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fvaluable-data-analysis-with-pandas-value-counts-d87bbdf42f79">towardsdatascience</a> you should visit the post to check other articles of the author. This post will show you how with a few additions to your code you can actually do quite a lot of analysis using this function.</p>
<div class="fluidMedia" style="height: 100vh;">
<iframe src="https://nbviewer.jupyter.org/github/hsteinshiromoto/blog/blob/dev/notebooks/gist.pandas.value_counts/gist.pandas.value_counts.ipynb" style="height: 100%; width: 100%;" frameborder="0" id="iframe"> </iframe>
</div>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comThe value_counts() function in the popular python data science library Pandas is a quick way to count the unique values in a single column otherwise known as a series of data.My Quick Reference Guide For A Few Natural Language Processing Techniques2020-07-02T00:00:00+10:002020-07-02T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/07/02/blog-post_my_quick_reference_guide_for_a_few_natural_language_processing_techniques<p>Natural language processing (NLP) is a field of study dedicated to
analyze of natural languages. In particular, using statistics and
algorithms.</p>
<p>This blog post provides a quick reference guide on what, when and how
each of the methods can be used.</p>
<ul>
<li><a href="#regular-expressions-regex">Regular Expressions (RegEx)</a>
<ul>
<li><a href="#what-it-is">What it is</a></li>
<li><a href="#when-to-use-it">When to use it</a></li>
<li><a href="#how-to-use-it">How to use it</a></li>
</ul>
</li>
<li><a href="#word-tokenization">Word Tokenization</a>
<ul>
<li><a href="#what-it-is-1">What it is</a></li>
<li><a href="#when-to-use-it-1">When to use it</a></li>
<li><a href="#how-to-use-it-1">How to use it</a></li>
</ul>
</li>
<li><a href="#bag-of-words">Bag of Words</a>
<ul>
<li><a href="#what-it-is-2">What it is</a></li>
<li><a href="#when-to-use-it-2">When to use it</a></li>
<li><a href="#how-to-use-it-2">How to use it</a></li>
</ul>
</li>
<li><a href="#word-as-vectors">Word as Vectors</a>
<ul>
<li><a href="#what-it-is-3">What it is</a></li>
<li><a href="#when-to-use-it-3">When to use it</a></li>
<li><a href="#how-to-use-it-3">How to use it</a></li>
</ul>
</li>
<li><a href="#tf-idf">Tf-idf</a>
<ul>
<li><a href="#what-it-is-4">What it is</a></li>
<li><a href="#when-to-use-it-4">When to use it</a></li>
<li><a href="#how-to-use-it-4">How to use it</a></li>
</ul>
</li>
<li><a href="#name-entity-recognition">Name Entity Recognition</a>
<ul>
<li><a href="#what-it-is-5">What it is</a></li>
<li><a href="#when-to-use-it-5">When to use it</a></li>
<li><a href="#how-to-use-it-5">How to use it</a></li>
</ul>
</li>
<li><a href="#references">References</a></li>
</ul>
<h1 id="regular-expressions-regex">Regular Expressions (RegEx)</h1>
<h2 id="what-it-is">What it is</h2>
<p>Regular expressions are strings with a special syntax, and for which
present a regular pattern that describes them.</p>
<h2 id="when-to-use-it">When to use it</h2>
<p>RegEx is a powerful technique to be used with programming languages such
as C, Python, Java and others.</p>
<p>It consists of verifying if a specific pattern exists in the string
under consideration.</p>
<p>In many cases, using RegEx may create problems instead of solve it, as
the definition of the patterns to the searched for are not easy to
understand at first.</p>
<p>I have found a website that provides a sandbox to test for RegEx in
[1].</p>
<h2 id="how-to-use-it">How to use it</h2>
<p>Python’s regex module comes with the default installation. To add to
your script just write<br />
<code class="language-plaintext highlighter-rouge">import re</code></p>
<p>The methods that I frequently use are</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">split(pattern, string)</code>: split a string</li>
<li><code class="language-plaintext highlighter-rouge">findall(pattern, string)</code>: find all patterns in the string</li>
<li><code class="language-plaintext highlighter-rouge">search(pattern, string)</code>: search for the pattern</li>
<li><code class="language-plaintext highlighter-rouge">match(pattern, string)</code>: match the entire string based on the pattern</li>
</ul>
<h1 id="word-tokenization">Word Tokenization</h1>
<h2 id="what-it-is-1">What it is</h2>
<p>Tokenization is process of transform a string or text into tokens
(smaller pieces).</p>
<h2 id="when-to-use-it-1">When to use it</h2>
<p>Tokenization is useful to remove unwanted part of a text, separate
punctuation and processing hashtags in social media. For example,
tokenization can be used to separate the content of a text into
sentences.</p>
<h2 id="how-to-use-it-1">How to use it</h2>
<p>Python has a module for word tokenization called nltk. The functions
that tokenize words and sentences are imported as follows.<br />
<code class="language-plaintext highlighter-rouge">from nltk.tokenize import word_tokenize, sent_tokenize</code><br />
<br />
These functions can be used as<br />
<code class="language-plaintext highlighter-rouge">word_tokenize(string)</code><br />
or<br />
<code class="language-plaintext highlighter-rouge">sent_tokenize(string)</code></p>
<h1 id="bag-of-words">Bag of Words</h1>
<h2 id="what-it-is-2">What it is</h2>
<p>Bag words is a method to obtain a set of words in a string (or text).</p>
<h2 id="when-to-use-it-2">When to use it</h2>
<p>This methodology provides a way to count the how many times a specific
word appears in a text or a string.</p>
<h2 id="how-to-use-it-2">How to use it</h2>
<p>A bag of words can be created with the tokenize functions as shown in
the section below.</p>
<h1 id="word-as-vectors">Word as Vectors</h1>
<h2 id="what-it-is-3">What it is</h2>
<p>A word vector is a multidimensional representation of a word. This
representation allows to analyze distance between words, according to an
appropriate metric.</p>
<p>These vectors are built using a model that has been trained over a
myriad of texts.</p>
<h2 id="when-to-use-it-3">When to use it</h2>
<p>Word vectors can be used to understand relationships between words. For
instance, the vectors (man, woman) and (king, queen) are similar,
according to the metric obtained from the model.</p>
<h2 id="how-to-use-it-3">How to use it</h2>
<p>The module <code class="language-plaintext highlighter-rouge">gensim</code> for Python allows the construction of the word
vectors. Gensim creates a corpus by transforming the tokens into ids
associated to the number of documents used.</p>
<h1 id="tf-idf">Tf-idf</h1>
<h2 id="what-it-is-4">What it is</h2>
<p>Term Frequency – Inverse Document Frequency (tf-idf) is a measurement
of the number of occurrences of a specific token in a document
normalized by the number of documents that contain this token.
Mathematically, this is expressed as</p>
\[w_{i,j}=tf_{i.j}\log\left(\dfrac{N}{df_i}\right)\;,\]
<p>where</p>
<ul>
<li>$w_{i,j}$ is the weight for the token $i$ in document $j$</li>
<li>$tf_{i,j}$ is the number of occurrences of token $i$ in document $j$</li>
<li>$df_i$ is the number of documents that contain token $i$</li>
<li>$N$ is total number of documents</li>
</ul>
<h2 id="when-to-use-it-4">When to use it</h2>
<p>Tf-idf is useful to analyze the distribution of a token across multiple
documents.</p>
<h2 id="how-to-use-it-4">How to use it</h2>
<p>The Python module Gensim provides a tf-idf model as follows<br />
<code class="language-plaintext highlighter-rouge">from gensim.models.tfidfmodel import TfidfModel </code><br />
The tf idf class is instantiate using the corpus built with gensim:<br />
<code class="language-plaintext highlighter-rouge">tfidf = TfidfModel(corpus) </code></p>
<h1 id="name-entity-recognition">Name Entity Recognition</h1>
<h2 id="what-it-is-5">What it is</h2>
<p>It is the process of identifying tokens present in the text under
consideration as people, places, organizations, dates, states etc.</p>
<h2 id="when-to-use-it-5">When to use it</h2>
<p>It is employed when tokens need to be identified into categories.</p>
<h2 id="how-to-use-it-5">How to use it</h2>
<p>The modules <code class="language-plaintext highlighter-rouge">nltk</code> and (<code class="language-plaintext highlighter-rouge">Di</code>)<code class="language-plaintext highlighter-rouge">SpaCy</code> from Python allow the
identification of the tokens into categories.</p>
<h1 id="references">References</h1>
<p>[1] <a href="http://www.pyregex.com/">http://www.pyregex.com/</a></p>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comNatural language processing (NLP) is a field of study dedicated to analyze of natural languages. In particular, using statistics and algorithms.Find Row Closest to a Value2020-06-25T00:00:00+10:002020-06-25T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/06/25/blog-post_find_row_closest_value_to_input<p>In this gist, I find what is the closest row to a given value.</p>
<script src="https://gist.github.com/2cfc1173b01d333139c3a60291359c3e.js"> </script>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comIn this gist, I find what is the closest row to a given value.Jupyter Notebook Header2020-06-18T00:00:00+10:002020-06-18T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/06/18/blog-post_jupyter_notebook_header<p>This gist contains my default settings for a Jupyter notebook as a header.</p>
<script src="https://gist.github.com/88d6d67ca10927de15501b4239fde8f2.js"> </script>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comThis gist contains my default settings for a Jupyter notebook as a header.Datetime Resample2020-06-11T00:00:00+10:002020-06-11T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/06/11/blog-post_datetime_resample<p>In this gist, I calculate aggregate the datetime column according to different periods (e.g. day, week, and month)</p>
<script src="https://gist.github.com/58b448e8398fd18ad374c940fab1436b.js"> </script>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comIn this gist, I calculate aggregate the datetime column according to different periods (e.g. day, week, and month)Cumulative Sum with Pandas2020-06-04T00:00:00+10:002020-06-04T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/06/04/blog-post_cumulative_sum_with_pandas<p>In this gist, I calculate the cumulative sum of the column <code class="language-plaintext highlighter-rouge">no</code>, based on the columns <code class="language-plaintext highlighter-rouge">name</code>and <code class="language-plaintext highlighter-rouge">day</code>.</p>
<script src="https://gist.github.com/19a87b5821760dbace62b4d42900fe2d.js"> </script>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comIn this gist, I calculate the cumulative sum of the column no, based on the columns nameand day.Signing git commits with gpg2020-05-28T00:00:00+10:002020-05-28T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/05/28/blog-post_signing_commits_with_gpg<p>I found this post on how to sign commits with gpg on Medium, and I copied to my blog so I can keep for my records. Please, visit the original source at:</p>
<p><a href="https://medium.com/better-programming/how-to-sign-your-git-commits-1014edaf1e85"><code class="language-plaintext highlighter-rouge">https://medium.com/better-programming/how-to-sign-your-git-commits-1014edaf1e85</code></a></p>
<p>Contents of this post:</p>
<ul>
<li><a href="#1-how-to-sign-your-git-commits">1. How to Sign Your Git Commits</a></li>
<li><a href="#2-why-sign-git-commits">2. Why Sign Git Commits?</a></li>
<li><a href="#3-cryptographic-signatures-and-gpg">3. Cryptographic Signatures and GPG</a>
<ul>
<li><a href="#31-asymmetric-cryptography">3.1. Asymmetric cryptography</a></li>
<li><a href="#32-about-signatures">3.2. About signatures</a></li>
<li><a href="#33-gpg-the-gnu-privacy-guard">3.3. GPG: the GNU privacy guard</a></li>
</ul>
</li>
<li><a href="#4-set-up-your-git-to-sign-commits">4. Set Up Your Git to Sign Commits</a>
<ul>
<li><a href="#41-install-gpg">4.1. Install GPG</a></li>
<li><a href="#42-generate-a-gpg-key-pair">4.2. Generate a GPG key pair</a></li>
<li><a href="#43-add-multiple-emails">4.3. Add multiple emails</a></li>
<li><a href="#44-configure-git-to-sign-your-commits">4.4. Configure Git to sign your commits</a></li>
<li><a href="#45-add-the-gpg-key-to-github">4.5. Add the GPG key to GitHub</a></li>
<li><a href="#46-make-a-signed-commit">4.6. Make a signed commit</a></li>
<li><a href="#47-configure-visual-studio-code-for-signing-commits">4.7. Configure Visual Studio Code for signing commits</a></li>
<li><a href="#48-using-hardware-tokens">4.8. Using hardware tokens</a></li>
</ul>
</li>
<li><a href="#5-a-troubleshoot-when-brew-updates">5. A Troubleshoot when Brew Updates</a></li>
<li><a href="#6-references">6. References</a></li>
</ul>
<h1 id="1-how-to-sign-your-git-commits">1. How to Sign Your Git Commits</h1>
<p><img src="https://miro.medium.com/max/1400/0*RKNk1lKu0vGzguDe.jpg" alt="" /></p>
<p>Even if you don’t know about signed Git commits, you might have seen the screen above on GitHub.</p>
<p>Let’s leave everything else aside from a moment — isn’t it oddly satisfying to have a large, green “Verified” badge on your work?</p>
<p>Making a commit verified, or to be more precise, signed, is not as hard as you might think. Just as it sounds, signed commits are well, signed, cryptographically using a GPG key.</p>
<h1 id="2-why-sign-git-commits">2. Why Sign Git Commits?</h1>
<p>Before we get into the <em>how</em> let’s talk for a moment about <em>why</em> you
should sign your Git commits. Besides the desire to get that green
“Verified” badge on your work on GitHub, there are some concrete
benefits.</p>
<p>When you commit a change with Git, it accepts as author whatever value
you want. This means you could claim to be whoever you want when you
create a commit.</p>
<p>For example, here’s a repo I just created. As you can see, my esteemed
colleague and friend <a href="https://twitter.com/martinwoodward">\@MartinWoodward</a> from GitHub committed in it right away:</p>
<p><img src="https://miro.medium.com/max/1400/0*CPCqmuJmjGLNXeh3.jpg" alt="" /></p>
<p>There’s only one problem: Martin did not do that; I did.</p>
<p>To make GitHub (and everyone) believe that Martin authored that really
terrible commit, I just had to run <code class="language-plaintext highlighter-rouge">git config user.name</code> and <code class="language-plaintext highlighter-rouge">git config user.email</code> with values that match Martin's. Those are not hard to get at all. It only
took me one minute to clone one of his repos and then run <code class="language-plaintext highlighter-rouge">git log</code> in it.</p>
<p>From the point of view of Git, this is actually working as intended. The
committer details are designed just to identify who of your
collaborators made a change and are not meant to be used for
authenticating people. Being able to impersonate other committers does
not introduce a vulnerability per se. For example, just by setting my
<code class="language-plaintext highlighter-rouge">user.name</code> to Martin's, I do not get the
ability to push code to his repositories. GitHub would require me to
authenticate with his credentials before I could do that.</p>
<p>However, while this is not a security vulnerability per se, it can cause
other issues. When you see an unsigned commit, you have no guarantee
that:</p>
<ul>
<li>The author is really the person whose name is on the commit</li>
<li>The code change you see is really what the author wrote (i.e., it’s
not been tampered with)</li>
</ul>
<p>Making a habit of signing your Git commits instead gives you the ability
to prove that you were the author of a specific code change. It also
gives you the ability to ensure that no one can modify your commit (or
its metadata, such as the time you claimed it was made at) in the
future.</p>
<p>The more sensitive the code you’re working on (e.g., things related to
security, or mission-critical applications), the more you should pay
attention. Attacks on the software supply chain are getting more common,
and their potential consequences more dangerous. <a href="https://www.zdnet.com/article/fbi-warns-about-ongoing-attacks-against-software-supply-chain-companies/">The FBI has warned
us.</a></p>
<p>Here’s how two hypothetical attacks on the software supply chain could
look, with unsigned commits. First, imagine the case of a disgruntled
employee who might purposely want to introduce a backdoor into an app
they’re working on (on a repo they already have write access to), so
they impersonate one of their teammates when submitting the code, to
keep the blame away from themself.</p>
<p>Another example is someone creating a malicious pull request in an
open-source project. They could make it look like someone else (for
example, someone with a great reputation) co-authored it, to make it
more likely that the PR will be accepted (if you maintain open-source
libraries, you know how time-consuming it can be to fully, thoroughly
review every PR).</p>
<p>Please note that just because you sign your Git commits, it doesn’t stop
others from impersonating you. I have been regularly signing my commits
for about a year, but you could still make a code change and put my name
on it. There’s no way I can stop you from doing that. However, whoever
reads your code won’t see my digital signature (or the “Verified”
badge), and so they at least have the ability to question the
authenticity of that commit or its integrity. On the other hand, people
who do follow my repositories can see that I’ve authored all the commits
in the last year.</p>
<p>For your own projects, if your Git hosting service allows that, you can
also require a policy that all commits must be signed. On GitHub, that’s
done with <a href="https://help.github.com/en/github/administering-a-repository/about-required-commit-signing">protected
branches</a></p>
<h1 id="3-cryptographic-signatures-and-gpg">3. Cryptographic Signatures and GPG</h1>
<p>If you’ve never heard of cryptographic signatures or GPG, this brief,
simplified explanation might help you.</p>
<h2 id="31-asymmetric-cryptography">3.1. Asymmetric cryptography</h2>
<p>You might have heard that there are two main kinds of cryptographic
algorithms: symmetric and asymmetric ones. Symmetric cryptography is the
most understood one: first you encrypt your data using a passphrase, and
then you use the same passphrase to decrypt the message and get it in
clear-text again. If you want to share the encrypted data with another
person, you need to give them the passphrase too. This is how algorithms
like AES work, conceptually.</p>
<p>Asymmetric cryptography uses two separate keys: a public key and a
secret (or private) one. As their names suggest, while the secret key
must be protected at all costs, the public one can (and as will be our
case later on, must) be shared with the world. With asymmetric
cryptography, you encrypt a message using your public key and then
decrypt it using the private one. If you wanted to share an encrypted
message with your friend, you’d use your friend’s public key to encrypt
it. Your friend could then use their own private key to decrypt and read
your message. Algorithms like RSA or the various elliptic curves work
this way. Despite being lesser-known among the general public,
asymmetric cryptography is widely used, and it’s what makes TLS used by
HTTPS possible, too, among other things.</p>
<p>In addition to encrypting data, asymmetric cryptography can also be used
to sign messages (and verify signatures). This works the opposite way:
You sign a message using your private key, and others can verify the
signature using your public key.</p>
<h2 id="32-about-signatures">3.2. About signatures</h2>
<p>When you sign a message, you’re adding a cryptographically strong proof
that you ( or someone in possession of your private key) wrote it and
that the message was not tampered with.</p>
<p>For example, let’s say that you want to send a message to your friend
saying, “You and I will meet tomorrow at 11:30 a.m.” You want your
friend to be 100% sure that the message came from you, and you want to
make sure that no one can change its content (e.g., changing from 11:30
a.m. to 1 p.m.). You can do that by adding a cryptographic signature to
the message.</p>
<p>To do that, you have to do two things in principle:</p>
<ol>
<li>You calculate a hash (or checksum) of your message. You can use a
hashing function such as SHA-256. As you know, hashing functions are
one-way operations that generate a unique set of bytes from each
message, and they cannot be reversed. The hex-encoded SHA-256 digest
of “You and I will meet tomorrow at 11:30 a.m.” is:
<code class="language-plaintext highlighter-rouge">579c4547d8dec2c4513de8c858a490a8a2679db205a0b3471f81d5b129d29b88</code>. If you changed even just one bit in the
original message (e.g., change the time to 11:31 a.m.), the final
digest would be completely different (<a href="https://emn178.github.io/online-tools/sha256.html">try
it</a>).</li>
<li>You use your private key to sign the calculated hash, using
algorithms like RSA.</li>
</ol>
<p>You can now send the signature together with the clear-text message, and
your friend will have no doubt that you were the one writing those
precise words.</p>
<p>Note that signatures are added to clear-text messages. Signing a message
alone does not encrypt it! So, anyone could still read your original
message and could see that you signed it. It is possible to use RSA to
both sign and encrypt a message, and that’s what’s called <em>authenticated
encryption</em>, but that’s outside the scope of this article.</p>
<h2 id="33-gpg-the-gnu-privacy-guard">3.3. GPG: the GNU privacy guard</h2>
<p>By now, I hope you at least have a general understanding of the idea
behind asymmetric cryptography. Let’s see how we can use it.</p>
<p>The <a href="https://www.openpgp.org/">OpenPGP</a>
standard contains specifications on algorithms, encodings, etc. for
real-world usage of solutions based on cryptography. Among the various
implementations of the OpenPGP standard, the most widely adopted one is
likely GPG (also known as GnuPG). This is a free, open source (libre)
application that works on Windows, macOS, and Linux as a command-line
tool. Countless tools and applications depend on GPG (or the standards
it uses) to deal with cryptography in a standardized, interoperable way.</p>
<p>One of the (many) things GPG does is give you the ability to sign
arbitrary messages or files. This works great with Git, and we’ll see
how in just a moment.</p>
<p>GPG is a really large tool, with a lot of different functionality, and
just like many things that are related to cryptography, it can get very
complicated very fast. Personally, I have been dealing with GPG for
various reasons for years, and I still have a partial understanding of
how it works. However, the good news is that signing Git commits is a
relatively simple operation, and after you set GPG up, you’ll be able to
forget it.</p>
<p>In addition to being a command-line tool, GPG also has a standard for
distributing public keys. Remember how I wrote that the public key not
only can but often needs to be distributed to the world? Public keys are
identified by an ID and map to a person’s email address(es), including
the ones used by GitHub.</p>
<p>For example, my public key’s ID is <code class="language-plaintext highlighter-rouge">0x30a525d4</code>, which also maps to <code class="language-plaintext highlighter-rouge">[email protected]</code>. One of the sub-keys, <code class="language-plaintext highlighter-rouge">0x4b33ea4c</code> is used for signing,
and that's what is used to sign my Git commits too.</p>
<h1 id="4-set-up-your-git-to-sign-commits">4. Set Up Your Git to Sign Commits</h1>
<p>Ok, we’re finally ready to get started.</p>
<h2 id="41-install-gpg">4.1. Install GPG</h2>
<p>Besides Git, the only requirement is that you must have GPG installed. I
recommend using GPG version 2.2 or higher:</p>
<ul>
<li>
<p>On Windows, you can download the Gpg4win distribution from the <a href="https://gnupg.org/download/">GPG website</a></p>
</li>
<li>
<p>On macOS, the easiest thing is to use Homebrew: <code class="language-plaintext highlighter-rouge">brew install gpg</code>.</p>
</li>
<li>
<p>Most Linux distributions come with GPG pre-installed; if not, you can always find it on their official repositories.</p>
</li>
</ul>
<p>Note: In some Linux distributions, the application is called <code class="language-plaintext highlighter-rouge">gpg2</code>, so you might need to replace <code class="language-plaintext highlighter-rouge">gpg</code> with <code class="language-plaintext highlighter-rouge">gpg2</code> in the commands below. In
this case, you might also need to run <code class="language-plaintext highlighter-rouge">git config --global gpg.program $(which gpg2)</code>.</p>
<p><strong>For macOS only</strong></p>
<p>On macOS, you might also want to install a graphical pin entry application with <code class="language-plaintext highlighter-rouge">brew install pinentry-mac</code>, then add this line to <code class="language-plaintext highlighter-rouge">~/.gnupg/gpg-agent.conf</code> (if the file doesn't exist, create it):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pinentry-program /usr/local/bin/pinentry-mac
</code></pre></div></div>
<p><strong>Additional configuration for Linux and macOS</strong></p>
<p>On Linux and macOS, you can enable the GPG agent to avoid having to type
the secret key’s password every time. To do that, add this line to
<code class="language-plaintext highlighter-rouge">~/.gnupg/gpg.conf</code> (if the file doesn't exist, create it):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Enable gpg to use the gpg-agent</span>
use-agent
</code></pre></div></div>
<p>You will also need to add these two lines to your profile file
(<code class="language-plaintext highlighter-rouge">~/.bashrc</code>, <code class="language-plaintext highlighter-rouge">~/.bash_profile</code>, <code class="language-plaintext highlighter-rouge">~/.zprofile</code>, or wherever
appropriate), then re-launch your shell (or run <code class="language-plaintext highlighter-rouge">source ~/.bashrc</code> or similar):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">GPG_TTY</span><span class="o">=</span><span class="si">$(</span><span class="nb">tty</span><span class="si">)</span>
gpgconf <span class="nt">--launch</span> gpg-agent
</code></pre></div></div>
<h2 id="42-generate-a-gpg-key-pair">4.2. Generate a GPG key pair</h2>
<p>To start, generate a new GPG key pair (public and private):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gpg <span class="nt">--full-gen-key</span>
</code></pre></div></div>
<p>Configure the key with:</p>
<ul>
<li>Kind of key: type <code class="language-plaintext highlighter-rouge">4</code> for <code class="language-plaintext highlighter-rouge">(4) RSA (sign only)</code></li>
<li>Key size: <code class="language-plaintext highlighter-rouge">4096</code></li>
<li>Expiration: Choose a reasonable value, for example <code class="language-plaintext highlighter-rouge">2y</code> for two years (it can be renewed).</li>
</ul>
<p>Then answer a few questions:</p>
<ul>
<li>[Your real name. You could use your GitHub username here if you’d
like.]</li>
<li>[Email address. If you plan to use this key for more than just Git,
you might want to put your real email address. If it’s just for
GitHub, you can use the <code class="language-plaintext highlighter-rouge">@users.noreply.github.com</code> email that GitHub generates for you. You can find it on the
<a href="https://github.com/settings/emails">Email settings</a>.</li>
</ul>
<p>You will be asked to type a passphrase which is used to encrypt your
secret key on disk. This is important; otherwise, attackers could steal
your secret key, and then they’d be able to sign messages and Git
commits pretending to be you.</p>
<p>You can verify your key was created with:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>gpg <span class="nt">--list-secret-keys</span> <span class="nt">--keyid-format</span> SHORT
/root/.gnupg/pubring.kbx
<span class="nt">------------------------</span>
sec rsa4096/674CB45A 2020-05-16 <span class="o">[</span>SC] <span class="o">[</span>expires: 2022-05-16]
65B8A7455C949E73FC3B7330C16132F5674CB45A
uid <span class="o">[</span>ultimate] ItalyPaleAle-demo <43508+ItalyPaleAle@users.noreply.github.com>
</code></pre></div></div>
<p>In the example above, my new key ID is <code class="language-plaintext highlighter-rouge">rsa4096/674CB45A</code>, or just <code class="language-plaintext highlighter-rouge">674CB45A</code>.</p>
<p>You can confirm that GPG is working and able to sign messages with:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"hello world"</span> | gpg <span class="nt">--clearsign</span>
</code></pre></div></div>
<p>Note: If your GPG agent is having issues, you can restart it with:</p>
<p><code class="language-plaintext highlighter-rouge">gpgconf --kill gpg-agent && gpgconf --launch gpg-agent</code></p>
<h2 id="43-add-multiple-emails">4.3. Add multiple emails</h2>
<p>You can add multiple email addresses by editing the key:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Replace 674CB45A with your key ID</span>
gpg <span class="nt">--edit-key</span> 674CB45A
</code></pre></div></div>
<p>In the GPG prompt, then type:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gpgp> adduid
</code></pre></div></div>
<p>Again, type the real name and the email address you want to add. To
confirm, you’ll be asked to type the password to decrypt the private
key.</p>
<p>Then, still in the GPG prompt, update the trust for the new identity:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Use the number of the UID of the identity</span>
gnupg> uid 2
gnupg> trust
<span class="c"># Type "5" (for "I trust ultimately")</span>
</code></pre></div></div>
<p>Lastly, save and exit with:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gnupg> save
</code></pre></div></div>
<h2 id="44-configure-git-to-sign-your-commits">4.4. Configure Git to sign your commits</h2>
<p>Once you have your private key, you can configure Git to sign your
commits with that:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Replace 674CB45A with your key ID</span>
git config <span class="nt">--global</span> user.signingkey 674CB45A
</code></pre></div></div>
<p>Now, you can sign Git commits and tags with:</p>
<ul>
<li>[Add the <code class="language-plaintext highlighter-rouge">-S</code> flag when creating a commit:
<code class="language-plaintext highlighter-rouge">git commit -S</code>]</li>
<li>[Create a tag with <code class="language-plaintext highlighter-rouge">git tag -s</code> rather than
<code class="language-plaintext highlighter-rouge">git tag -a</code>]</li>
</ul>
<p>You can also tell Git to automatically sign all your commits:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git config <span class="nt">--global</span> commit.gpgSign <span class="nb">true
</span>git config <span class="nt">--global</span> tag.gpgSign <span class="nb">true</span>
</code></pre></div></div>
<h2 id="45-add-the-gpg-key-to-github">4.5. Add the GPG key to GitHub</h2>
<p>In order for GitHub to accept your GPG key and show your commits as
“verified,” you first need to ensure that the email address you use when
committing a code change is both included in the GPG key and verified on
GitHub.</p>
<p>To set what email address Git uses when creating a commit, use:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git config <span class="nt">--global</span> user.email your@email.com
</code></pre></div></div>
<p>You can use your <code class="language-plaintext highlighter-rouge">@users.noreply.github.com</code>
email (from the <a href="https://github.com/settings/emails">Email settings</a> page on GitHub) or any other email address that is added to your GitHub account and verified (in the same settings page).</p>
<p>If it’s not already, that same email address must also be added to your
GPG key, as per instructions above.</p>
<p>Once you’ve done it, upload your public GPG key to GitHub and associate
it with your account. In the <a href="https://github.com/settings/keys">SSH and GPG Keys
settings</a>
page, add a new GPG key and paste your public key, which you can get
with:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Replace 674CB45A with your key ID</span>
gpg <span class="nt">--armor</span> <span class="nt">--export</span> 674CB45A
</code></pre></div></div>
<p>Your public GPG key begins with
<code class="language-plaintext highlighter-rouge">-----BEGIN PGP PUBLIC KEY BLOCK-----</code> and ends
with <code class="language-plaintext highlighter-rouge">-----END PGP PUBLIC KEY BLOCK-----</code>.</p>
<h2 id="46-make-a-signed-commit">4.6. Make a signed commit</h2>
<p>After configuring all of the above, your Git commits can now be signed
with your GPG key:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Add the -S flag if you did not configure Git to sign commits by default</span>
git commit <span class="nt">-a</span> <span class="nt">-m</span> <span class="s2">"Making my first signed commit"</span>
</code></pre></div></div>
<p>You can check that the commit was signed with:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git log <span class="nt">--show-signature</span> <span class="nt">-1</span>
commit 8beed807e820d34cc7a35a0d69e9913bed7b1b03 <span class="o">(</span>HEAD -> master<span class="o">)</span>
gpg: Signature made Sun May 17 01:44:55 2020 UTC
gpg: using RSA key 674CB45A
gpg: Good signature from <span class="s2">"ItalyPaleAle-demo <43508+ItalyPaleAle@users.noreply.github.com>"</span> <span class="o">[</span>ultimate]
Author: ItalyPaleAle-demo <43508+ItalyPaleAle@users.noreply.github.com>
Date: Sun May 17 01:44:55 2020 +0000
Making my first signed commit
</code></pre></div></div>
<h2 id="47-configure-visual-studio-code-for-signing-commits">4.7. Configure Visual Studio Code for signing commits</h2>
<p>If you’re using <a href="https://code.visualstudio.com">VS Code</a>, you can configure it to sign your Git commits with the “Git:
Enable commit signing” flag ( <code class="language-plaintext highlighter-rouge">git.enableCommitSigning</code>).</p>
<p><img src="https://miro.medium.com/max/1400/0*DJbQRNcfTmiup3g9.png" alt="" /></p>
<h2 id="48-using-hardware-tokens">4.8. Using hardware tokens</h2>
<p>Your GPG secret key is now stored (encrypted) in your GPG keyring inside
your laptop. While this should provide enough protection for most users,
it is still possible to export it and thus steal it. Given that the key
is encrypted with a passphrase, your key is as safe as the passphrase
(choose it wisely!).</p>
<p>Additionally, having a private key in a file leaves open questions of
how to (securely) back it up and possibly sync it across multiple
devices. <a href="https://security.stackexchange.com/questions/51771/where-do-you-store-your-personal-private-gpg-key">This
Q&A</a> on Stack Exchange Information Security contains various ideas, although it’s a bit dated. Services like <a href="https://keybase.io/">Keybase</a> can help store
your secret keys on a dedicated cloud service.</p>
<p>A safer alternative, however, is to use a hardware token, for example
security keys such as a <a href="https://www.yubico.com/">YubiKey</a>. This is what I use too. Among the various technologies a YubiKey supports, it can store a GPG key in a secure enclave, from where
it cannot be extracted.</p>
<p>Setting up a YubiKey for its various functions, including storing a GPG
key (and using that for signing Git commits or for connecting to a SSH
server), takes a bit of time. If you just got a YubiKey and want to know
how to best set it up, I highly recommend <a href="https://github.com/drduh/YubiKey-Guide">this guide from
\@drduh</a> published on GitHub.</p>
<h1 id="5-a-troubleshoot-when-brew-updates">5. A Troubleshoot when Brew Updates</h1>
<p>A few days ago, I tried to commit and got the error message</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gpg failed to sign the data fatal: failed to write commit object
</code></pre></div></div>
<p>I found the following answer in Stack Overflow that solves this problem</p>
<p>Upgrade to gpg2 in Homebrew</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># brew upgrade gnupg</span>
<span class="c"># brew link --overwrite gnupg</span>
<span class="c"># brew install pinentry-mac</span>
<span class="nv">$ </span><span class="nb">echo</span> <span class="s2">"pinentry-program /usr/local/bin/pinentry-mac"</span> <span class="o">>></span> ~/.gnupg/gpg-agent.conf
<span class="nv">$ </span>killall gpg-agent
</code></pre></div></div>
<p>Check if the installation went well, run the following command</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">echo</span> <span class="s2">"test"</span> | gpg <span class="nt">--clearsign</span>
</code></pre></div></div>
<p>If this test is successful (no error/output includes PGP signature), you have successfully updated to the latest gpg version.</p>
<p>It’s worth noting git should be configured to use it:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git config <span class="nt">--global</span> gpg.program gpg <span class="c"># perhaps you had this already?</span>
<span class="nv">$ </span>git config <span class="nt">--global</span> commit.gpgsign <span class="nb">true</span> <span class="c"># To sign every commit</span>
</code></pre></div></div>
<h1 id="6-references">6. References</h1>
<p><a href="https://medium.com/better-programming/how-to-sign-your-git-commits-1014edaf1e85">https://medium.com/better-programming/how-to-sign-your-git-commits-1014edaf1e85</a></p>
<p><a href="https://stackoverflow.com/questions/39494631/gpg-failed-to-sign-the-data-fatal-failed-to-write-commit-object-git-2-10-0">https://stackoverflow.com/questions/39494631/gpg-failed-to-sign-the-data-fatal-failed-to-write-commit-object-git-2-10-0</a></p>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comI found this post on how to sign commits with gpg on Medium, and I copied to my blog so I can keep for my records. Please, visit the original source at:Hypothesis Tests Part 2: Statistical Inference2020-05-21T00:00:00+10:002020-05-21T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/05/21/blog-post_hypothesis_tests_part_2_statistical_inference<p>In this post, I present an overview of statistical tests. The goal of calculating a test statistic is to decided if the null hypothesis is true. Once value of the test-statistic is obtained, it is compared with a pre-defined critical value. If the test statistic is found to be greater than the critical value, then hypothesis is rejected.</p>
<ul>
<li><a href="#1-the-statistical-inference-setting">1. The Statistical Inference Setting</a></li>
<li><a href="#2-tests">2. Tests</a>
<ul>
<li><a href="#21-t-test">2.1. T-test</a>
<ul>
<li><a href="#211-population-and-sample">2.1.1. Population and Sample</a></li>
<li><a href="#212-different-samples">2.1.2. Different Samples</a></li>
</ul>
</li>
<li><a href="#22-anova">2.2. ANOVA</a></li>
<li><a href="#23-math-xmlnshttpwwww3org1998mathmathmlsemanticsmrowmsupmiχmimn2mnmsupmrowannotation-encodingapplicationx-texchi2annotationsemanticsmathχ2-test">2.3. $\chi^2$ Test</a></li>
</ul>
</li>
<li><a href="#3-errors">3. Errors</a></li>
<li><a href="#4-dealing-with-non-normal-distributions">4. Dealing with Non-normal Distributions</a></li>
<li><a href="#5-further-reading">5. Further Reading</a></li>
</ul>
<h1 id="1-the-statistical-inference-setting">1. The Statistical Inference Setting</h1>
<p>A null hypothesis, proposes that no (statistical) significant difference exists between two sets of observations. Example:</p>
<ul>
<li>$H_0$: The means of the two sets are equal</li>
<li>$H_1$: The means of the two sets are not equal</li>
</ul>
<p>To decide if the null hypothesis must be rejected, a test statistic is calculated. This number is used to compare with the <em>critical value</em> which is a pre-defined threshold used to decide if the null hypothesis must be rejected.</p>
<p>In general, critical values are the boundaries of <em>critical regions</em>. If the value of the test statistic falls within one of these regions, then the null hypothesis is rejected.</p>
<p>If the considered data follow a set of assumptions regarding its distribution, one interpretation for the rejection of the null hypothesis can be stated as follows: The probability of the null hypothesis be true is smaller than a probability $\alpha$. Therefore, this hypothesis must be reject.</p>
<p>From the statistical inference viewpoint, hypothesis testing is the process that establishes whether the measurement of a given statistic, such as the sample mean, is consistent with a theoretical distribution. The process of hypothesis testing requires a considerable amount of care in the definition the hypothesis to test and in drawing conclusions.</p>
<p>Let $E\subset\mathbb{R}^n$, and $f:E\times\mathbb{R}^n\to\mathbb{R}$ be a probability distribution function. Every hypothesis consists of an assumption about the parameter $\lambda$ of the distribution $x\mapsto f(x, \lambda)$.</p>
<p>The null hypothesis $H_0$ is made with respect to the parameter $\lambda$ and formulated as</p>
<p>\(H_0: \lambda=\lambda_0\;,\)
for a given $\lambda_0\in\mathbb{R}^n$. The alternative hypothesis is given as</p>
\[H_1: \lambda\neq\lambda_0\;.\]
<p>Since the null hypothesis makes a statement about the probability density in the sample space, it also predicts the probability for observing a point $x\in E$. This probability is used to define the rejection region $S_c\subset E$ with a significance level $\alpha\in(0,1)$ as</p>
\[P(x\in S_c\|H_0)=\alpha\;.\]
<p>In other words, $S_c$ is defined as the probability to observe a point $x$ within $S_c$ equals to $α$, under the assumption that $H_0$ is true. If the point $X$ from the sample actually falls into the region $S_c$, then the hypothesis $H_0$ is rejected. Note that this equation does not define the critical region $S_c$ uniquely.</p>
<p>The probability $P(x\in S_c|H_0)$ is also known as <strong>p-value</strong>: the probability, under the null hypothesis $H_0$, about the unknown distribution $f$ of the random variable, for the variable to be observed as a value equal to or more extreme than the values observed $x\in S_c$.</p>
<p>In practice, the distribution $f$ is not available due to the lack of knowledge of the population. Instead one constructs a test statistic $T:\mathbb{R}^n\to\mathbb{R}$, and defines a region $U$ of the variable $T$ that corresponds to the critical region $S_c$, i.e., $x\mapsto T(x), S_c(x)\mapsto U(x)$. The null hypothesis is rejected, whenever $T\in U$</p>
<p>The hypothesis test using statistical inference process can be summarized in three steps.</p>
<ol>
<li>
<p>Determine the statistics to use for the null hypothesis. The choice of statistic means that we are in a position to use the theoretical distribution function for that statistic to tell whether the actual measurements are consistent with its expected distribution, according to the null hypothesis.</p>
</li>
<li>
<p>Determine the probability or confidence level for the agreement between the statistic and its expected distribution under the null hypothesis. This confidence level defines a range of values for the statistics that are consistent with its expected distribution. This range is called <em>acceptable region</em> for the statistic. Values of the statistics outside of the acceptable range define the <em>rejection region</em>.</p>
</li>
<li>
<p>At this point two cases are possible:</p>
<p>3.1. The measure value of the statistic falls into the rejection region. This means that the distribution function of the statistic of interest, under the null hypothesis, does not allow the measured value at the confidence level $\alpha$. In this case the null hypothesis must be rejected at the stated confidence level $\alpha$.</p>
<p>3.2. The measured value of the statistic is within the acceptable region. In this case the null hypothesis cannot be rejected. Sometimes this situation can be referred to as the null hypothesis being acceptable. This is, however, not the same as stating that the null hypothesis is the correct hypothesis and that the null hypothesis is accepted. In fact, there could be other hypotheses that could be acceptable and one cannot be certain that the null hypothesis tested represents the parent model for the data.</p>
</li>
</ol>
<h1 id="2-tests">2. Tests</h1>
<h2 id="21-t-test">2.1. T-test</h2>
<h3 id="211-population-and-sample">2.1.1. Population and Sample</h3>
<p>In this scenario, assuming that samples are taken from a population that is normally distributed. The hypothesis to be tested is whether the population mean $\bar{x}$ is equal to a predefined value $\mu_0$:</p>
<ul>
<li>$H_0$: $\bar{x}=\mu_0$</li>
<li>$H_1$: $\bar{x}\neq\mu_0$</li>
</ul>
<p>When the population variance is unknown, the population variance can only be estimated from the data via the sample variance $s^2$, and it is necessary to allow for such uncertainty when estimating the distribution of the sample mean. This additional uncertainty leads to a deviation of the distribution function from the simple Gaussian shape to the Student’s t-distribution</p>
\[T=\dfrac{\bar{x}-\mu_0}{s/\sqrt{n}}\]
<h3 id="212-different-samples">2.1.2. Different Samples</h3>
<p>Given two groups (1, 2), this test is only applicable when:</p>
<ul>
<li>the two sample sizes (that is, the number n of participants of each group) are equal;</li>
<li>it can be assumed that the two distributions have the same variance;</li>
</ul>
<p>The $t$ statistic to test whether the means are different can be calculated as follows:</p>
\[T=\dfrac{\bar{x}_1-\bar{x}_2}{s_p\sqrt{2/n}}\;,\]
<p>where</p>
\[s_p^2=\dfrac{s_1^2+s_2^2}{2}\]
<p>and $s_p$ is the pooled standard deviation for $n = n_1 = n_2$ and $s_1$ and $s_2$ are the unbiased estimators of the variances of the two samples.</p>
<h2 id="22-anova">2.2. ANOVA</h2>
<p>Is used to analyze the differences among group means in a sample. To use this statistical test. The hypothesis being tested in ANOVA is</p>
<ul>
<li>Null: All groups have the same mean</li>
<li>Alternate: There exists one group with a statistically significantly different mean</li>
</ul>
<p>The statistics that measures the significance is called F-statistics which is used to compare factors from the total standard deviation. For comparing two groups, the formula for the F-statistics is given as</p>
\[F=\dfrac{\text{between-group variability}}{\text{within-group variability}}\;.\]
<p>The component between-group variability is calculated as</p>
\[\text{between-group variability}=\dfrac{1}{K-1}\sum_{i=1}^K n_i(\bar{x}_i-\bar{x})^2\;,\]
<p>where</p>
<ul>
<li>$\bar{x}_i$ is the sample mean of group $i$.</li>
<li>$n_i$ is the number of observations in the $i-$th group.</li>
<li>$\bar{x}$ is the overall mean of the data.</li>
<li>$K$ is the number of groups.</li>
</ul>
<p>The component within-group variability is computed by the formula</p>
\[\text{within-group variability}=\sum_{i=1}^K\sum_{j=1}^{n_i}\dfrac{(x_{ij}-\bar{x}_i)^2}{N-K}\;,\]
<p>where</p>
<ul>
<li>$x_{ij}$ is the observation $j$ in group $i$.</li>
<li>$N$ is the overall sample size.</li>
</ul>
<h2 id="23-chi2-test">2.3. $\chi^2$ Test</h2>
<p>The $\chi^2$ distribution appears from the result of the sum of independent, standard normal random variables, i.e., $X_1^2 + X_2^2 + \cdots + X_k^2\sim\chi^2(k)$.</p>
<p>In this test, the test statistic is $\chi^2$ distributed under the null hypothesis. This test is used used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories. For this reason, it can be applied to compare categorical variables. Also, the $\chi^2$ test can be used to test whether the variance of the population has a pre-determined value.</p>
<p>The hypothesis being tested for chi-square is</p>
<ul>
<li>Null: Variable A and Variable B are independent</li>
<li>Alternate: Variable A and Variable B are not independent.</li>
</ul>
<p>There are two types of $\chi^2$ tests:</p>
<ol>
<li>
<p>Goodness of fit test, which determines if a sample matches the population.</p>
<p>a. A small chi-square value means that data fits</p>
<p>b. A high chi-square value means that data doesn’t fit.</p>
</li>
<li>
<p>A $\chi^2$ fit test for two independent variables is used to compare two variables in a contingency table to check if the data fits.</p>
</li>
</ol>
<p>The formula used for calculating the statistic is</p>
\[\chi^2=\sum_{i=1}^n\dfrac{(O_i-E_i)^2}{E_i}\;,\]
<p>where</p>
<ul>
<li>$O_i$ is the number of observations of type $i$</li>
<li>$E_i$ is the expected count of type $i$</li>
</ul>
<h1 id="3-errors">3. Errors</h1>
<p>Because of the statistical nature of the sample, it is possible to infer the following errors</p>
<ol>
<li><em>Type I Error</em>: The null hypothesis is true but the decision based on the testing process is that the null hypothesis should be rejected</li>
<li><em>Type II Error</em>: The null hypothesis is false but the testing process concludes that it should be accepted.</li>
</ol>
<p>The probability of a Type I error (denoted by $\alpha$) is also called the significance level of the test. The Type II error occurs if one does not reject the hypothesis $H_0$ because $X$ was not in the critical region $S_c$, even though the hypothesis was actually false and an alternative hypothesis was true. This error is formalized as,</p>
\[P(X\notin S_c\|H_1)=\beta\;.\]
<p>The following table summarizes the two types of errors.</p>
<table>
<thead>
<tr>
<th> </th>
<th>H0 is True</th>
<th>H0 is False</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>H0 is accepted</strong></td>
<td>Correct</td>
<td>Type II Error</td>
</tr>
<tr>
<td><strong>H0 is rejected</strong></td>
<td>Type I Error</td>
<td>Correct</td>
</tr>
</tbody>
</table>
<p>This connection with the alternative hypothesis $H_1$ provides us with a method to specify the critical region $S_c$. A test is clearly most reasonable if for a given significance level $\alpha$ the critical region is chosen such that the probability $\beta$ for an error of the second kind is a minimum. The critical region and therefore the test itself naturally depend on the alternative hypothesis under consideration.</p>
<h1 id="4-dealing-with-non-normal-distributions">4. Dealing with Non-normal Distributions</h1>
<p>As the reader may have notice, all these tests assume that the population is normally distributed. When this is not case, it is necessary to normalize the data. To see more information on how to deal with non-normal distribution, please check the following link https://www.statisticshowto.com/probability-and-statistics/non-normal-distributions/</p>
<h1 id="5-further-reading">5. Further Reading</h1>
<ul>
<li>[1] M. Bonamente, “Statistics and Analysis of Scientific Data”, Springer, 2017</li>
<li>[2] S. Brandt, “Data Analysis”, Springer, 2014</li>
<li>[3] L.-G. Johansson, “Philosophy of Science for Scientists”, Springer 2016</li>
</ul>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comIn this post, I present an overview of statistical tests. The goal of calculating a test statistic is to decided if the null hypothesis is true. Once value of the test-statistic is obtained, it is compared with a pre-defined critical value. If the test statistic is found to be greater than the critical value, then hypothesis is rejected.Hypothesis Tests Part 1: Bayesian Inference2020-04-17T00:00:00+10:002020-04-17T00:00:00+10:00https://hsteinshiromoto.github.io/posts/2020/04/17/blog-post_hypothesis_tests_part_1_bayesian_inference<p>Every quantity that is estimated from data, such as the mean or the variance, is subject to uncertainties of the measurements due to data collection. If a different sample of measurements is collected, value fluctuations will certainly give rise to a different set of measurements, even if the experiments are performed under the same conditions. The use of different data samples to measure the same value results in a sampling distribution that characterize the quantity in consideration. This distribution is used to characterize the “true” value of the quantity in consideration. This blog post is dedicated to present how the collected data is employed to test hypotheses of the quantity being measured.</p>
<p>Hypothesis testing is the process that establishes whether the measurement of a given quantity, such as the mean, is consistent with respect to another set of observations or a distribution. The process of hypothesis testing requires a considerable amount of care in the definition the hypothesis to test and in drawing conclusions. Hypothesis tests can be divided into two main schools of thought based on Bayesian or Statistical inference. [1]</p>
<p>For a given hypothesis $H$ to be tested, it can be formulated in the following form</p>
<ul>
<li>$H_0$ (null hypothesis): The quantity in consideration is no different from the other set</li>
<li>$H_1$ (alternative hypothesis): The quantity in consideration is different from the other set</li>
</ul>
<p>In this blog post, I present an example of hypothesis testing using Bayesian inference. In a future version, I will present an example using statistical inference.</p>
<p>Table of Contents:</p>
<ul>
<li><a href="#1-general-overview">1. General Overview</a></li>
<li><a href="#2-example">2. Example</a></li>
<li><a href="#3-references">3. References</a></li>
<li><a href="#4-further-reading">4. Further Reading</a></li>
<li><a href="#5-appendix">5. Appendix</a>
<ul>
<li><a href="#51-code-to-generate-the-prior-distribution">5.1. Code to generate the prior distribution</a></li>
</ul>
</li>
<li><a href="#6-code-to-calculate-the-conditional-probability-of-the-null-hypothesis">6. Code to calculate the conditional probability of the null hypothesis</a></li>
</ul>
<h1 id="1-general-overview">1. General Overview</h1>
<p>In the Bayesian inference framework, the problem is formulated as follows: The goal is to calculate the conditional probability for a hypothesis being true, given a certain outcome of an experiment or measurement</p>
\[P(H_0|O)=\dfrac{P(O|H_0)P(H_0)}{P(O)}\]
<p>where $P(H_0)$ is the probability of the null hypothesis, $P(O)$ is the probability of observation and</p>
\[P(H_0|O)\]
<p>is the probability of the null (respectively, alternative) hypothesis be true given the outcome</p>
<p>By fixing a <em>critical value</em> $\alpha$, one either can verify whether the probability of the hypothesis</p>
\[P(H_0|O)\]
<p>is higher or smaller than $\alpha$. Then, one accepts the null hypothesis if</p>
\[P(H_0|O)\geq\alpha\;.\]
<h1 id="2-example">2. Example</h1>
<p>Consider a coin that has been tossed 100 times. Given that number of tails is 70, is this coin fair?</p>
<p>Assumptions:</p>
<ul>
<li>Only two outcomes are possible: heads or tails</li>
<li>Coin toss does not affect other tosses, i.e. coin tosses are independent of each other.</li>
<li>All coin tosses come from the same distribution.</li>
</ul>
<p>Thus, the random variable coin toss is an example of an iid variable. Under these assumptions the selected likelihood is the binomial distribution:</p>
\[P(y|\theta, N)=\dfrac{N!}{y!(N-y)!}\theta^y(1-\theta)^{N-y}\;,\]
<p>where</p>
<ul>
<li>$y$ is the number of tails</li>
<li>$\theta$ proportion of tails</li>
<li>$N$ is the number of tosses</li>
</ul>
<p>The hypothesis to be tested is the following:</p>
<ul>
<li>$H_0$: $\theta=0.5$</li>
<li>$H_1$: $\theta\neq0.5$</li>
</ul>
<p>The hypothesis $H_0$ will be rejected if</p>
\[P(H_0|O)\leq0.5 = \alpha.\]
<p>Under this hypothesis formulation, let the observations be the parameters $y$ and $N$, the conditional probability of $H_0$ is given by</p>
\[P(H_0|O) = P(\theta=0.5|N=100,y=70)\]
<p>The conditional probability of $\theta$ is calculated as</p>
\[P(\theta|N,y) = \dfrac{P(y|\theta,N)P(\theta)}{\displaystyle\int_0^1 P(y|\theta,N)P(\theta)\,d\theta}\;.\]
<p>The prior distribution is chosen as the beta distribution</p>
\[P(\theta)=\dfrac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\;,\]
<p>where</p>
<ul>
<li>$\Gamma$ is the gamma function</li>
<li>$\alpha$ and $\beta$ are parameters.</li>
</ul>
<p>The following graph plots the prior for difference choices of the parameters $\alpha$ and $\beta$.</p>
<p><img src="https://raw.githubusercontent.com/hsteinshiromoto/blog/master/notebooks/statistical_tests/prior_distribution.svg" alt="svg" /></p>
<p>The posterior probability is</p>
\[P(\theta|N,y)=\dfrac{\theta^{y+\alpha-1}(1-\theta)^{N-y+\beta-1}}{\displaystyle \int_0^1\theta^{y+\alpha-1}(1-\theta)^{N-y+\beta-1}\;d\theta}\]
<p>By letting $\alpha=\beta=1$, the prior distribution becomes a uniform distribution, and the probability of $\theta=0.5$ is upper bounded by</p>
\[P(0.49\leq\theta\leq0.51|N=100,y=30)=\dfrac{\displaystyle\int_{0.49}^{0.51}\theta^{30}(1-\theta)^{70}\;d\theta}{\displaystyle\int_0^1\theta^{30}(1-\theta)^{70}\;d\theta} \;=5.158837081428554\times10^{-5}\;.\]
<p>Since the probability of the null hypothesis is less than 50%, hypothesis $H_0$ is rejected.</p>
<h1 id="3-references">3. References</h1>
<ul>
<li>[1] L.-G. Johansson. “Philosophy of Science for Scientists”. Springer. 2016</li>
</ul>
<h1 id="4-further-reading">4. Further Reading</h1>
<ul>
<li>O. Martin. “Bayesian Analysis with Python”. Packt. 2ed. 2018</li>
<li>A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin. “Bayesian Data Analysis”. CRC Press. 3ed. 2014</li>
</ul>
<h1 id="5-appendix">5. Appendix</h1>
<h2 id="51-code-to-generate-the-prior-distribution">5.1. Code to generate the prior distribution</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">scipy.stats</span> <span class="k">as</span> <span class="n">stats</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">params</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.5</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
<span class="n">f</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">params</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">params</span><span class="p">),</span> <span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">20</span><span class="p">),</span> <span class="n">constrained_layout</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">):</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">params</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">params</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">beta</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">).</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"α = {:2.1f}</span><span class="se">\n</span><span class="s">β = {:2.1f}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">),</span> <span class="n">alpha</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">ax</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">].</span><span class="n">legend</span><span class="p">()</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_yticks</span><span class="p">([])</span>
<span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_xticks</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
<span class="n">f</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.025</span><span class="p">,</span> <span class="s">'θ'</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s">'center'</span><span class="p">)</span>
<span class="n">f</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="o">-</span><span class="mf">0.025</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="s">'p(θ)'</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s">'left'</span><span class="p">,</span> <span class="n">va</span><span class="o">=</span><span class="s">'center'</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"prior_distribution.svg"</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="6-code-to-calculate-the-conditional-probability-of-the-null-hypothesis">6. Code to calculate the conditional probability of the null hypothesis</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">scipy.integrate</span> <span class="k">as</span> <span class="n">integrate</span>
<span class="n">integrate</span><span class="p">.</span><span class="n">quad</span><span class="p">(</span><span class="k">lambda</span> <span class="n">theta</span><span class="p">:</span> <span class="n">theta</span><span class="o">**</span><span class="mi">30</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">theta</span><span class="p">)</span><span class="o">**</span><span class="mi">70</span><span class="p">,</span> <span class="mf">0.49</span><span class="p">,</span> <span class="mf">0.51</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="n">integrate</span><span class="p">.</span><span class="n">quad</span><span class="p">(</span><span class="k">lambda</span> <span class="n">theta</span><span class="p">:</span> <span class="n">theta</span><span class="o">**</span><span class="mi">30</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">theta</span><span class="p">)</span><span class="o">**</span><span class="mi">70</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>Humberto STEIN SHIROMOTOh.stein.shiromoto@gmail.comEvery quantity that is estimated from data, such as the mean or the variance, is subject to uncertainties of the measurements due to data collection. If a different sample of measurements is collected, value fluctuations will certainly give rise to a different set of measurements, even if the experiments are performed under the same conditions. The use of different data samples to measure the same value results in a sampling distribution that characterize the quantity in consideration. This distribution is used to characterize the “true” value of the quantity in consideration. This blog post is dedicated to present how the collected data is employed to test hypotheses of the quantity being measured.