https://betanalpha.github.io/assets/case_studies/probability_theory.html

#1 dbranes:
This is excellent. Amazing (equation + picture)/text ratio.
Complaint: you don't define your notion of "space". In chapter 1 it's some informal notion that you use to motivate the definition of a set (??), in 1.3 and 1.4 it becomes clear by space you mean "set". Then later you start talking about dimension of spaces, implying not only do they come with a topology now they have a well defined dimension, so a locally Euclidean Hausdorff space or something  but maybe you just mean R^n.
Comment for other commentators in this thread: not all expositions is tailored for the masses. A piece of pedagogical literature that does not appeal to your background doesn't mean it's not good. There's a very clear need for exposition on basic structures in probability theory and this fits there.

#2 joker3:
I'm not really sure who the intended audience is here. There's a lot of material covered very briefly in a very short space, and not enough details that anyone who doesn't already know it would be able to pick up anything substantive.

#3 nafizh:
This is an excellent book by L. V. Tarasov on probability.
https://archive.org/details/TheWorldIsBuiltOnProbability
I have always found Russian math book writers to be on point, not going too much over your head, also respecting the reader's intelligence. If you like it, then you will love his calculus book, that one is also a real gem.

#4 egonschiele:
Echoing other comments here, this seems like a hard way to start learning probability. It sounds like the goal is to make probability easier to understand based on what you say here (https://betanalpha.github.io/writing)
> In this case study I attempt to untangle this pedagogical knot to illuminate the basic concepts and manipulations of probability theory and how they can be implemented in practice
But I think this is too hard. I really loved "Probability For The Enthusiastic Beginner" http://a.co/2kp5PZd

#5 baryphonic:
The goal is worthy, but the product is inadequate to say the least. This thing is littered with typos, and enough of the exposition is sufficiently irrelevant or incorrect to be unintuitive. That said, I like the graphics and layout.
For example, when he discusses power sets in order to introduce sigma algebras, he implies that a sigma algebra is a betterbehaved alternative to a power set. However, a power set is always itself a sigma algebra (after all, even a power set of an uncountable set still is closed under complements and countable unions).
Later, when discussing probability distributions, he writes:
> [W]e want this allocation [of a conserved quantity] to be selfconsistent – the allocation to any collection of disjoint sets, A_n ∩ A_m=0, n≠m, should be the same as the allocation to the union of those sets, > ℙπ[∪(n=1 to N) A_n]=∑(n=1 to N)ℙπ[A_n].
The condition `A_n ∩ A_m=0, n≠m` is actually incorrect, since A_n and A_m are sets and 0 is an integer. The author means the empty set, but typo'd.
Sometimes he frequently uses words like "conserved" or "welldefined" without giving us a clue as to what these mean. In what context are probabilities "conserved"? What distinguishes "welldefined" from "not welldefined"?
I'm a software engineer. A nontrivial amount of my time is devoted to reading code and finding bugs. Sloppy reasoning, inconsistencies and outright errors like that are big red flags to me. It doesn't help that the whole section on sigma algebras is somewhat irrelevant, since he doesn't really explore measure theory as the basis for modern probability.
IMO a better resource is the series of "Probability Primer" videos from mathematicalmonk on YouTube[1]. He does an excellent job (IMO) of covering all pertinent prerequisites and being mostly rigorous without necessarily proving every single fact or exhaustively covering all edge and corner cases. He also makes a good effort to recommend advanced (and rigorous) treatments of the subject (and ancillary ones like measure theory). A readable version of this YouTube series would be a great resource, and if Michael Betancourt is reading, I'd encourage him to pursue that in his next iteration of this product.

#6 mayankkaizen:
I confused this with the book "Probability and Statistics for Engineers and Scientists" by Anthony Hayter and I got excited.
I am kind of a beginner in Machine Learning and was struggling badly with basic probability and Statistics concepts. I went through so many resources and somehow none of them clicked. Then I stumbled upon this book and I realized this is exactly the kind of book I needed. It assumes no prior knowledge and is very heavy on examples. Other books just dive into jargon/symbol laded theory without giving simple examples or building concepts from ground up.
I mentioned this because I feel someone might benefit from this suggestion.

#7 nl:
Wow, this seems like a particularly hard way to learn probability.
One thing I noticed about myself as I did more and more work with probability is that I started thinking in terms of distributions a lot more.
These days I find it very difficult to think without using them. In just about everything I do now I tend to think about moving probability mass around.

#8 graycat:
Here's my nonstandard, nutshell, IMHO advice in using probability theory:
(1) Random Variables. Go outside. Observe a number. Then that is the value of a random variable. To have a random variable, that the number be random in the sense of unpredictable is not needed. For the phrase and/or criterion "truly random", mostly f'get about it, but we return to that for the subject of random number generation below. So, net, your data, all your data, are the values of random variables.
(2) Distributions. Sure, each random variable has a distribution. And there is the Gaussian, uniform, binomial, exponential, Poisson, etc. distributions.
Sometimes in practice can use some assumptions to conclude that a random variable has such a known distribution; this is commonly the case for exercises about flipping coins, rolling dice, shuffling cards.
For another example, suppose customers are arriving at your Web site. Well maybe the number of arrivals since noon have stationary (over time) independent increments  maybe you can confirm this just intuitively. Then, presto, bingo, the arrivals are a Poisson process, and the times between arrivals are independent, identically distributed exponential random variables  see E. Cinlar, Introduction to Stochastic Processes. Further, since might be willing to assume that the arrivals are from many users acting independently, the renewal theorem says that the arrivals will be approximately Poisson, more accurately for more users  see W. Feller's second volume.
Sometimes the central limit theorem can be used to justify a Gaussian assumption.
Still, net, in practice, mostly we don't and can't know the distribution. To have much detail on a distribution of one variable takes a lot of data; the joint distribution on several variables takes much more data; the amount of data needed explodes exponentially with the number of joint variables. So, net, don't expect to know or find the distribution.
Often you will be able to estimate mean and variance, etc. but not the whole distribution. So, usually need to proceed without knowing distributions. In simple terms: Distributions  they exist? Yup. We can find them? Nope!
(3) Independence. Probability theory is, sure, part of math, but, really, the hugely important, unique feature is the concept of independence.
One of the main techniques in applied math is divide and conquer. Well, where you can make an independence assumption lets you so divide.
Independence? A simple criterion for practice is, suppose you are given random variables X and Y. You are even given their probability distributions (but NOT their joint probability distribution). Then X and Y are independent if and only if knowing the value of one of them tells you nothing more than you already know about the value of the other one.
The hope here is that often in practice you can check this criterion just intuitively from what you know about the real situation. E.g., does a butterfly flapping its wings in Tokyo tell you more about weather tomorrow in NYC? My intuitive guess is that this is a case of independence which means that for predicting weather of NYC tomorrow, we can just f'get about that butterfly.
(4) Conditioning. For random variables X and Y, can have the conditional expectation of Y given X, E[YX]. Such conditioning is the main way X tells you about Y. Then there is a function f(X) = E[YX], and f(X) is the best nonlinear least squares estimate of Y. Note that E[E[YX]] = E[Y] which means that E[YX] is an unbiased estimate of Y.
(5) Correlation. If you don't have independence, then likely use the Pearson correlation  it works like the cosine of an angle. If random variables X and Y are independent, then their Pearson correlation coefficient is 0  proof is an easy exercise just from the basic definition and properties of independence.
(6) The Classic Limit Theorems. Pay close attention to the central limit theorem (CLT) and the weak and strong laws of large numbers (LLN). The CLT is the main reason we get a Gaussian distribution, and the LLN is the main reason we take averages.
(7) Random Number Generation. A sequence of random numbers are to look, for some practical purposes, like a sequence of random variables that are all independent and have uniform distribution on [0,1]. Are they "truly random"? Maybe not. But if they are, then they are independent and identically distributed (i.i.d.) on [0,1]  and that's all there is to it, and don't have to struggle to say or understand more.

#9 ginnungagap:
I don't see the point of introducing sigmaalgebras if you're not doing probability based on measure theory.
As others have said I wouldn't suggest this exposition to someone learning probability for the first time, but it's not as bad if you're familiar with the material and need a quick review.

#10 nicbou:
> The set of all sets in a space, X, is called the power set, P(X). The power set is massive and, even if the space X is wellbehaved, the corresponding power set can often contain some less mathematically savory elements. Consequently when dealing with sets we often want to consider a restriction of the power set that removes unwanted sets.
I wish people could teach math in plain English. I don't know why the math and physics world refuses to write for the reader. I took this class before, and I still don't know what the author means by "less mathematically savory elements".
Here's you explain things to humans:
> There is a set called the power set that contains all the sets in a space. This set is huge, and it contains [less mathematically savory elements]. This is why we usually use a restricted version that removes the unwanted sets.
Seriously, there's no point to this sort of fancy language. Math is already hard. No need to make it harder.

#11 tree_of_item:
I thought this was really nice, strange to see that so many people dislike it.

#12 ak_yo:
I find that this guide unhelpfully conflates probability and inference in a few places. Probability theory on its own is interesting but not terribly useful without the infrastructure of estimation.

#13 mlevental:
i think these are mistakes
>2.5 Conditional Probability Distributions As we saw in Section 3.4,
>It turns out that in this case a σalgebra on Z naturally defines a σalgebra

#14 madengr:
NO NO NO!!! Don’t start with Venn diagrams, sets, and other such fluff. Reminds me of the thin, little book they tried sticking on us in my probability class; undergrad EE. It was meant for math majors.
There is a book “Probability and Statistics for Engineers and Scienctists” by Raymond Walpole. That book is excellent. Rolling dice and pulling colored marbles from jars is how you teach probability.