/sci/ psg/ - probability and statistics general

932x951

taxes.png

🧵 psg/ - probability and statistics general

Anonymous at Wed, 6 Mar 2024 06:36:47 UTC No. 16059442

Welcome to /psg/ - Probability and Statistics General!

if you love stats, weird numbers and counterintutive science, this is your general. Because one of the things with statistics is that nothing ever seems to be what it tries to show you on a first glance or glimpse. Doesn't matter if you are a seasoned professional, NEET or some disgruntled grad student. All are welcome.

Some people may not like it if you try to make them do your homework, others won't care and will just help you. Let's discuss theories together, ask questions and try to meme a little about this field.

So, grab your favorite statistical software, dust off your textbooks, and join me in this exciting journey through the world of /psg/ - Probability and Statistics General! Let's embark on this adventure together and unravel the mysteries of data one statistical concept at a time.

I like Sankeyplots, what do you like?

Anonymous at Wed, 6 Mar 2024 06:44:52 UTC No. 16059451

>>16059442
What would explain the statistic of 13% of the population committing 50% of the violent crime?

Anonymous at Wed, 6 Mar 2024 06:48:01 UTC No. 16059456

>>16059451
There is a genetic component in this that is forbidden to talk about.

Anonymous at Wed, 6 Mar 2024 06:59:42 UTC No. 16059467

I think it's the right place to ask. How are statistical tests and distribution related? Like t-test and t-distribution? What does it mean 95% significance?

Anonymous at Wed, 6 Mar 2024 07:00:37 UTC No. 16059468

Tax money (real wealth) all goes directly to the Rothschild bankers, the government then pays its expenses using freshly printed debt AKA fiatbux.

Anonymous at Wed, 6 Mar 2024 08:13:22 UTC No. 16059543

>>16059467
There is a couple of different takes on this. First a statistical test can be a test of anything, like "is this data normal distributed?" "If I want to know XYZ about this data, I do this test and see if it is significant".

95% significance means that the observed effect is 95% real and not due to random noise. P-hacking is basically putting in so much random noise in your data that you will see real observed effects despite you overfitting it to the hilt.

Anonymous at Wed, 6 Mar 2024 09:51:11 UTC No. 16059646

When we doing imputation on data which is MAR (missing at random), the stochastic regression imputation seems able to impute missing data such that the mean, variance and covariance of the imputed data is representative of the data which has no missing value (let's say in a simulation). However that is not the case for the variance of the estimator of the mean. Clearly the stochastic part in the regression doesn't modelize the variance of the estimator. Now apparently multiple imputation can help to have a more accurate variance of the estimator, I'm not sure I understand how exactly, is it from the analysis part of the multiple imputation?

I've tried running some simulation where I impute the data using stochastic regression but the mean of the variance of the estimator of those imputation doesn't seem to converge to the value obtained from [math]\sigma^2/n[/math]. Anyone knowledgeable on this could help me understand?

Anonymous at Wed, 6 Mar 2024 10:12:31 UTC No. 16059667

>>16059646
I would check back ab initio on how it is calculated. See if you can understand exactly how it is calculated and then go from there in your code and see if you do it the same way. The problem here is most probably some kind of cascading problem in how you input the values into your programme.

Anonymous at Wed, 6 Mar 2024 21:19:04 UTC No. 16060374

What is the best and why? R or Python?

Anonymous at Wed, 6 Mar 2024 23:04:53 UTC No. 16060494

>>16059667
I'm not sure what you mean, I calculated the variance estimator of the mean by both using the formula and empirically on the non imputed data (which gave me similar result give or take 0.03), and then I also found empirically the estimator of the data with the missing value removed (which was different let's say 0.07) and when I did checked the Imp. by stochastic regression, I obtained 0.17 and even after doing multiple imputation the average was still around 0.17 give or take. I don't think I did any error in the code since the result of the estimator the non missing data set was giving me the same result as the theoric one of sigma^2/n. I simply calculated the variance of the mean of multiple simulation for that specific variable.

>>16060374
Always depends on the use case

Anonymous at Wed, 6 Mar 2024 23:14:24 UTC No. 16060507

>>16059442
>Where does tax money go?
I am able to answer this question with two letters in the alphabet.

View Same Yandex ImgOps iqdb SauceNAO

1241x963

38028124.jpg

Anonymous at Wed, 6 Mar 2024 23:24:14 UTC No. 16060529

>>16059442
Is /psg/ new?

Anonymous at Thu, 7 Mar 2024 00:08:45 UTC No. 16060591

>>16060529
No idea. But it is a legit science. It's interesting and it is very redpilled if done correctly. I like it. It's also pretty hard to bullshit statisticians.

Anonymous at Thu, 7 Mar 2024 01:09:52 UTC No. 16060656

>>16060529
its disgusting how scientists wrap their lack of understanding into probability functions

View Same Yandex ImgOps iqdb SauceNAO

1080x958

file.jpg

Anonymous at Thu, 7 Mar 2024 01:33:26 UTC No. 16060680

>>16060591
I just took my first stats class last quarter, and opted to change my major to focus more on stats. So I appreciate having a general about it

Anonymous at Thu, 7 Mar 2024 01:51:09 UTC No. 16060696

>>16060374
R has a lot more support for specifically statistics oriented tasks. Python is a lot better for just about everything else (including ML/approximate inference)

Anonymous at Thu, 7 Mar 2024 01:53:29 UTC No. 16060699

>>16060680
Doctors getting got by Bayes theorem isn't all that surprising. What will be surprising is if statisticians stop getting got by Bayes theorem and start actually making contributions to Bayesian stats beyond rehashing what engineering people have done for decades.

Anonymous at Thu, 7 Mar 2024 02:03:51 UTC No. 16060720

>>16060374
Making your own stats library with Lisp.

View Same Yandex ImgOps iqdb SauceNAO

680x496

statistics.png

Anonymous at Thu, 7 Mar 2024 11:36:22 UTC No. 16061227

what are the odds?

Anonymous at Thu, 7 Mar 2024 17:10:31 UTC No. 16061534

When should I use cumulants rather than moments?

Anonymous at Thu, 7 Mar 2024 17:42:29 UTC No. 16061572

>>16059451
There is no explanation, because no such statistic exists. It is simply false to say that 13% of the population commits 50% of the violent crime. Less than 5% of the 13% demographic has a violent crime conviction, so over 95% of them do not have such a conviction. The statistic should read that around 0.6% of the population commits 50% of the violent crime.

Now, it IS correct to say that, of convicted violent criminals, 50% of it is committed by people who share a "racial" classification also possessed by 13% of the population. These criminals but have many other classifications as well, but I understand that white supremacy demands that you focus on race. You are a loyal servant and you are doing well, so I don't want to discourage you; you just missed the mark a bit this time.

Anonymous at Thu, 7 Mar 2024 17:48:35 UTC No. 16061581

>>16060680
what is the answer?

Anonymous at Thu, 7 Mar 2024 17:54:26 UTC No. 16061592

>>16061581
Roughly 16.7%. To gain the intuition, picture it this way:
Imagine you have 1000 patients.
As given in the question, 1 out of 1000 actually has the disease.
Also given in the question, 5 out of the 1000 will falsely test positive.
So, when you test everyone, you get 6 positives. Out of the 6 positives, only 1 is real.
and 1/6 ~= 0.166667.

Anonymous at Thu, 7 Mar 2024 17:56:03 UTC No. 16061593

>>16060529
It's as new as it can be. And in a time where numbers,
the news and people confuse us, this is the general worth having.

Anonymous at Thu, 7 Mar 2024 18:36:07 UTC No. 16061632

>>16061227
I could tell you, but then the thought police are going to have a chat with me.

Anonymous at Thu, 7 Mar 2024 18:37:09 UTC No. 16061634

>>16061572
Post a picture of your skin colour. Because I can sense the smell of shit reeking from here.

Anonymous at Thu, 7 Mar 2024 18:40:07 UTC No. 16061637

>>16060696
I am thinking of learning both desu.
>>16060720
I see you are a redpilled gentleman who likes your oldtech.

Anonymous at Thu, 7 Mar 2024 20:03:52 UTC No. 16061727

>>16061592
Reasoning is sound but.. 5% of 1000 is 50

🗑️ Anonymous at Thu, 7 Mar 2024 20:11:28 UTC No. 16061733

>>16061227
We need genetic tests to separate apes and marranos from humans.

Anonymous at Thu, 7 Mar 2024 20:16:28 UTC No. 16061742

>>16061727
So it is 1/51 (~1.96%)?

Anonymous at Thu, 7 Mar 2024 21:26:04 UTC No. 16061845

>>16061733
this

View Same Yandex ImgOps iqdb SauceNAO

1200x600

mv5bogq2yzg2mtmtm....jpg

Anonymous at Thu, 7 Mar 2024 21:32:52 UTC No. 16061854

>>16061733
too expensive. let's just use a box and a gom jabbar.

Anonymous at Thu, 7 Mar 2024 22:19:19 UTC No. 16061916

>>16061727
>>16061592
Thanks, good catch.

>>16061742
Correct. You get the idea.

>>16061634
Did you mean you want a picture of my skin? Is the point to confirm whether or not I am nonwhite?

Anonymous at Thu, 7 Mar 2024 22:53:07 UTC No. 16061962

>>16060374
If you're working with very large datasets, it's quite difficult to get R to handle them efficiently. Maybe easier to write the code in R than Python but then it takes 50 times longer to run.

Anonymous at Thu, 7 Mar 2024 22:54:27 UTC No. 16061965

Weighted statistics seem like a massive scam to me

Anonymous at Thu, 7 Mar 2024 22:55:30 UTC No. 16061971

>>16061572
based post, anon.

Anonymous at Thu, 7 Mar 2024 23:06:22 UTC No. 16061990

>>16061962
Read a book.
- https://h2oai.github.io/db-benchmark/
- https://r4ds.hadley.nz/arrow

Anonymous at Thu, 7 Mar 2024 23:07:42 UTC No. 16061995

>>16061572
>These criminals but have many other classifications as well, but I understand that white supremacy demands that you focus on race.
Is there some classification (besides something trivial like "violent crime conviction") that shows a bigger disparity than the racial one?

Anonymous at Thu, 7 Mar 2024 23:58:24 UTC No. 16062053

>>16061916
A picture of your nose needs to be supplied as well. I need to see how big it is.

Anonymous at Fri, 8 Mar 2024 10:16:09 UTC No. 16062599

>>16061965
Why?

Anonymous at Fri, 8 Mar 2024 17:00:10 UTC No. 16063058

>>16060374
R has some nice touches with ggplot2 and stan where it outshines Python.
For production work Python outshines R with FastAPI, type hints and nicer db tooling (sqlalchemy etc).
My professional work is exclusively Python.

Anonymous at Fri, 8 Mar 2024 21:11:21 UTC No. 16063550

>>16063058
I have often wondered how much faster Julia is compared to both of them. What do you think?

Anonymous at Sat, 9 Mar 2024 05:04:31 UTC No. 16064406

>>16061990
Thanks for this

Anonymous at Sat, 9 Mar 2024 05:13:11 UTC No. 16064419

>>16063550
Why would you want to do direct stats work in Julia? Do you plan on programming your own statistical/ML systems and optimization from scratch or are you going to interface with external libraries (in which case you'll just be using Python libraries anyways but with Julia as a wrapper)?

I've heard that Julia is quite fast for optimizing computation time for the kinds of programs which call many simple functions and absurd number of times. Maybe that could make your data searching and sorting parts for statistics marginally faster? The slow part in data sets (from a computational perspective) is always integration/summation if you're doing classical statistics and training if you're doing ML/adaptive stats. I don't know if Julia would give you much of an advantage here unless you were to literally write everything from scratch yourself.

Anonymous at Sat, 9 Mar 2024 07:15:58 UTC No. 16064560

>>16063058
it's because code written in R tends to be messier and hard to maintain, but I would say the killer is that not everyone is familiar with R but everyone knows python.

Anonymous at Sat, 9 Mar 2024 09:47:46 UTC No. 16064755

>>16061742
Technically, I would include "<" since nothing was stated about the false negative rate.

I wonder how many of the 1/5 "correct" responders considered this extra nuance.

Anonymous at Sat, 9 Mar 2024 13:34:48 UTC No. 16064914

>>16064419
Because R is hard to optimize with certain stuff, like options. I need something that doesn't take minutes to get done.

Anonymous at Sat, 9 Mar 2024 14:13:25 UTC No. 16064961

>>16064914
You could probably do just about anything you'd need in Python while still being fairly optimal.

As far as I'm aware, most of the Julia libraries for existing statistical/ML programs are just using Julia as a wrapper to run Python anyways.

Anonymous at Sat, 9 Mar 2024 14:17:28 UTC No. 16064966

>>16064961
Ok, so Julia is basically just a Cython code app?

Anonymous at Sat, 9 Mar 2024 14:26:28 UTC No. 16064979

>>16064961
>most of the Julia libraries for existing statistical/ML programs are just using Julia as a wrapper to run Python anyways.
That's such an incredibly retarded claim on all counts. The "Python" libraries that are used in Python are not in Python, they're in C++. Julia ML packages try to be pure Julia, such as Turing.jl (https://turing.ml). Imagine thinking somebody would want to wrap Python in a language that outperforms C++ itself in numeric computing (which is Julia).

Anonymous at Sat, 9 Mar 2024 14:40:17 UTC No. 16064989

>>16061592
Your analysis is wrong, since you can have a malfunction in the test and be positive too. This is the gettier problem aspect of the question. Also your arithmetic was poor.

Anonymous at Sat, 9 Mar 2024 14:54:23 UTC No. 16065001

>>16064979
Are you alright?

As far as I'm aware, most of the major python ML libraries are written in Python. Scikit-learn (as an example) has their entire source code available on their GitHub. It's in Python.

Pytorch is mixed, but the vast majority of its toolkit is written in python, with only the most speed critical portions written in C++ (at least if their GitHub is to be believed).

Also, I don't doubt that some people try to make Julia packages as pure Julia. My doubt is that there are a large number of people who are spending time rewriting standard tools for things like xgboost or pytorch/tensorflow interfaces for Julia for free. Tensorflow.jl, as an example, explicitly states that it's a Julia wrapper for tensorflow, which is a mix of python and c++ (but mostly python aside from model compilers).

Anonymous at Sat, 9 Mar 2024 14:59:24 UTC No. 16065010

>>16064966
No, Julia is very versatile and used for a lot more than just this stuff. It's just new enough and a small player relative to python so currently a lot of the ML infrastructure for Julia packages are just wrappers for existing Python ML/statistics libraries.

I'm sure this will change, and there are entirely native Julia ML packages like the example given >>16064979

As far as I'm aware, these are exceptions to the rule though. Most of the Julia ML work I'm aware of is using Julia to interface with Python ML libraries like Pytorch/Tensorflow/Keras/SKL. The turing.ml package they pointed to looks interesting though.

Anonymous at Sat, 9 Mar 2024 15:16:37 UTC No. 16065031

>>16065001
>major python ML libraries
>Scikit-learn
Your example of a major ML library is a teaching/hobbyist library, not something people actually deploy in the last decade.
>their entire source code
Oh, does that include BLAS, LAPACK, and other hard dependencies? Julia's default BLAS isn't in C++.
>some people
Nearly all people. The norm among Julia users is to rewrite everything in Julia.
>rewriting standard tools for things like xgboost or pytorch/tensorflow
These are not tools used in Julia. I've just given you a link to one of the standard Julia libraries, Turing.jl, which is pure Julia. Another one is Flux.jl, also pure Julia. The interfaces are made by a fringe of Julia users who cling to Python and don't understand the point of Julia as a scientific computing and research language. Julia isn't culturally a stats/ML language, it's a scientific computing language, famous for its best-in-any-language ODE/PDE solvers and adjacent packages, https://sciml.ai. Speaking of which, see what the SciML ecosystem depends on: https://docs.sciml.ai/Overview/stable/overview/. No Tensorflow there, the dependencies are Turing.jl and Flux.jl.

Anonymous at Sat, 9 Mar 2024 15:17:07 UTC No. 16065033

>>16065031
Great baby day

Anonymous at Sat, 9 Mar 2024 15:18:43 UTC No. 16065037

>>16065033
>no u
But good on you for recognizing the implication that you're at most an undergrad with your "example" of sklearn.

Anonymous at Sat, 9 Mar 2024 16:09:02 UTC No. 16065108

>>16065037
Sklearn is literally used every day for decision tree based learning.

Sklearn's xgboost module is literally the most popular implementation of xgboost in industry (a.k.a one of the most popular methods for decision tree based regression in data limited/under-determined problems).

I understand that you are ignorant of this because you're actually a hobbyist who is more interested in memory optimization than actually using these tools to do anything, but you couldn't have it more backwards.

Anonymous at Sat, 9 Mar 2024 16:57:00 UTC No. 16065155

>>16065108
>Sklearn's xgboost module
>implementation of xgboost
Holy shit, the confusion in your head.

XGBoost is not a "sklearn module." It's its own project in C — https://github.com/dmlc/xgboost — and definitely isn't part of sklearn's "entire source code [...] in Python". Generic gradient boosting algorithms aren't all XGBoost. "XGBoost" isn't the word for any classifier based on decision trees. Random forests are random forests, not XGBoost. LightGBM is LightGBM, not XGBoost. CatBoost is CatBoost, not XGBoost. XGBoost is XGBoost. The classifier natively implemented in sklearn isn't XGBoost.

>industry
Oh sorry, I didn't recognize the holder of a Coursera certificate of accomplishment. My bad!

Anonymous at Sat, 9 Mar 2024 17:08:36 UTC No. 16065164

>>16065031
How much faster is julia at solving PDEs compared to Python or R?

Anonymous at Sat, 9 Mar 2024 17:09:49 UTC No. 16065166

>>16065108
When I used XGBoost, I didn't use it in Python but in R.

Anonymous at Sat, 9 Mar 2024 17:17:40 UTC No. 16065175

>>16065166
XGboost is just a specific implementation of gradient boosted decision trees. It can in principle be implemented in any language, and >>16065155 is correct that it doesn't need to be implemented via Sklearn or really any specific language library (including the original XGboost library itself).

Where the guy is missing the point is that the XGboost approach is used for a lot more than just the native classification. Sklearn has a very popular module for doing regression based on the XGboost framework.

You can in principle do this sort of thing in any language so long as you properly handle the objective function, regularization, and tree updates.

My point in the post is that Sklearn's regression based on XGboost is very popular and not some "undergrad only" system. It's very commonly used for "short and fat" regression problems which are common throughout industry.

Anonymous at Sat, 9 Mar 2024 17:29:23 UTC No. 16065187

>>16065031
Okay, so basically, we agree. Julia could in principle be used as a stats/ML language, but it isn't. There's early attempts at this like turing and sciML, but they are minor players currently (and sciML has a lot of interaction with Python if you actually look at it. It's literally one of the first things on their front page).

At some point your darling language could be supported enough that it's viable without needing to rewrite all of the existing infrastructure from scratch to be purely Julia. It isn't there yet. That's where we are.

Anonymous at Sat, 9 Mar 2024 17:33:10 UTC No. 16065192

>>16065164
Anywhere from several times to a few orders of magnitude, depending on the problem and the solver(s) you pick. I don't know of a cross-language benchmark for PDEs, but here are some for ODEs:

- https://docs.sciml.ai/SciMLBenchmarksOutput/stable/MultiLanguage/ode_wrapper_packages/
- https://docs.sciml.ai/SciMLBenchmarksOutput/stable/MultiLanguage/special_benchmarks/

You can notice these are both not super up to date. That's because the SciML benchmark suite is comprehensive, and people rightly take for granted that it's been massively outperforming other options for quite some time. The comparisons are mostly to earlier revisions of SciML solvers now. (DEs in Julia are the pet project of Chris Rackauckas at MIT's Julia Lab, search him speak or comment about the performance and benchmarks if you're interested.)

View Same Yandex ImgOps iqdb SauceNAO

531x66

dude.png

Anonymous at Sat, 9 Mar 2024 17:37:36 UTC No. 16065199

>>16065187
>things on their front page
Front page: https://sciml.ai
Ctrl-F, "python": picrel, only result
This "interaction with Python" is them letting Python and R users access Julia's DifferentialEquations.jl, not the other way round.

Anonymous at Sat, 9 Mar 2024 17:43:55 UTC No. 16065205

>>16065199
I think you are intentionally being obtuse because you have some weird chip on your shoulder.

My whole point is that Julia is fine. SciML in particular looks like a great Julia native toolkit for physics informed neural networks.

Different languages have their strengths based on their support. In my field (nonlinear programming based estimation) it's still super common for people to use MATLAB because of how good matlab's computational linear algebra and nonlinear optimization support is. I wouldn't recommend MATLAB someone who is asking "what language do I learn to do stats" just because there is one particular part of the field that MATLAB does very well.

Recommending someone new to stats/ML to learn Julia based on the premise that in the future it may be as well supported as Python currently, is fucking stupid.

Anonymous at Sat, 9 Mar 2024 17:54:09 UTC No. 16065215

>>16065205
I'm not recommending to learn Julia for stats/ML. I'm arguing against the claim that Julia is a wrapper for Python. For stats/ML, it isn't a wrapper for Python because it's a research language: the norm is to write in pure Julia, not call external libraries, and everything can be adjusted — there are even @edit, @code_llvm, @code_native convenience macros that drop you arbitrarily low so you can edit everything in tight feedback loops, including the source of Julia itself. Julia is what you turn to when, for example, you write custom Monte Carlo samplers. This kind of research isn't possible if you use CmdStan, for example, You wouldn't be able to @edit it.

Anonymous at Sat, 9 Mar 2024 18:02:57 UTC No. 16065226

>>16065215
I never said that Julia is a wrapper for python. I said that a lot of popular Julia stats/ML implementations (e.g., Julia's tensorflow implementation, Julia's Keras implementation and Julia's Pytorch implementation) are. That doesn't mean all of it is. There's plenty that isn't.

I just wouldn't recommend Julia to someone who wants to learn ML, because a lot of the more popular transformer and diffusion models, their current Julia support is a python wrapper.

That doesn't mean it will stay like this forever, or that all of the Julia libraries are wrappers. There just are many that are currently because Julia doesn't have a lot of ML support at the moment (outside of physics informed ML where it does seem to be pretty good).

Anonymous at Sat, 9 Mar 2024 18:18:07 UTC No. 16065243

>>16063550
I've not tried Julia myself so can't comment. I've been wanting to take it for a spin to try out mamba + turing to see how it compares to pymc + stan.

>>16065215
> This kind of research isn't possible if you use CmdStan
CmdStan is command line tool. That's why you wouldn't use it to write custom samplers lmao.

Anonymous at Sat, 9 Mar 2024 18:29:24 UTC No. 16065266

>>16065243
>That's why you wouldn't use it to write custom samplers lmao.
2 + 2 = 4 lmao.
You don't use any interface to Stan to write a sampler because Stan itself is the sampler, in C++. Writing a sampler means writing your own Stan, not programs that Stan accepts. I typed "CmdStan" specifically because the authors of Stan aren't really interested in maintaining rstan (and even more so other interfaces), which has been an issue recently with deprecations Stan syntax. You'd know if you actually used rstan/brms on a regular basis.

Anonymous at Sat, 9 Mar 2024 18:53:30 UTC No. 16065293

>>16065266
> You don't use any interface to Stan to write a sampler because Stan itself is the sampler, in C++
So you've agreed with me. All that attitude just to agree with me lmao.

Anonymous at Sat, 9 Mar 2024 18:58:36 UTC No. 16065297

>>16065293
>CmdStan is command line tool. That's why you wouldn't use it to write custom samplers lmao.
You sounded like you thought you made some sort of point that contradicted me, but you didn't. I charitably assumed that you thought that non-standalone/non-CLI Stan interfaces could be used to write custom samplers. But you seem to be even more confused somehow, whoops.

Anonymous at Sat, 9 Mar 2024 18:58:49 UTC No. 16065299

>>16059456
What a unique opinion.
I'm sure nobody ever forced that opinion on /sci/ before you heckin buffoon.
race and IQ, what a unique opinion

Anonymous at Sat, 9 Mar 2024 19:01:25 UTC No. 16065306

>>16065297
You suggested editing the sampler source in Julia was special because interfaces (specifically a command line one) couldn't do this. That is hilarious.
Charitable my arse.

Anonymous at Sat, 9 Mar 2024 19:04:52 UTC No. 16065311

>>16065299
Poast nose and skincolour SAAR. I must redeem.

Anonymous at Sat, 9 Mar 2024 19:08:09 UTC No. 16065319

>>16065306
Julia is special because you can quickly edit the sampler source in a tight feedback loop, in the same language in which your model is specified, and at no loss in performance. You can clone Stan and edit its C++ source, then wait for it to compile, then adjust your R/Python/Stan script, sure. If you don't see the difference then maybe an example closer to your experience will help: you can also shove a pencil in your ass and push the keys to type that script in with that pencil rather than with your fingers.

Anonymous at Sat, 9 Mar 2024 19:43:41 UTC No. 16065375

>>16065192
Hey Julia-anon, >>16065205 NLP estimation guy here.

Do you think SciML has good support/is feasible for underwater acoustics simulation? I'm (admittedly) much more familiar with the signal processing side of UWA than the actual acoustician wave propagation stuff, but most of the industry standard toolkits for this stuff seems to be very MATLAB focused (e.g., AcTUP is pretty much all MATLAB based as far as I'm aware).

Do you think learning Julia to do some of this UWA stuff would be worthwhile considering its capabilities for ODE/PDE systems? Not being a smart-ass, I'm just curious if you think it's worth trying to switch over.

Anonymous at Sat, 9 Mar 2024 19:47:33 UTC No. 16065378

>>16065319
Editing julia's source is essentially cloning it, similar to any language. Changing dependency sources as you describe is messy as it won't be tracked nicely by vcs. Python is also capable of changing sources in this manner - it's not unique to julia.
If the C++ flow is too painful for you, pymc has good custom sampler support which doesn't require changing source.
You've pivoted your argument from comparing Julia to a terminal tool, to arguing it's easier to write than C++. Goalposts shifted, I assume point conceded.

Anonymous at Sat, 9 Mar 2024 19:52:56 UTC No. 16065390

Let's say for arguments sake you get a BSc in stats. How do you futureproof that shit? Because it would be useless to work for ten years and then get laidoff because AI fucked you over and you only knew one thing.

Note that I don't think a stats degree is going to ever become obsolete, but I think you have to learn new things over time and so on. Because even MBAs do pandas and numpy shit now.

Anonymous at Sat, 9 Mar 2024 20:00:19 UTC No. 16065405

>>16065390
That's true no matter what STEM field you work in. I wouldn't worry too much about it so long as you find a niche you find interesting and skills that are worthwhile.

Don't make the mistake of being a hard-line frequentist though. That's a great way to leave yourself in the past.

Anonymous at Sat, 9 Mar 2024 20:04:37 UTC No. 16065413

>>16065405
Nah, I am interested in the entire field honestly. I don't work in it but it's my passion. I work in another field where data analysis is very important and it helps knowing stats. I like ML, Bayesian approaches, frequentist. As long as it can help me with understanding the world, I want to know it.

Anonymous at Sat, 9 Mar 2024 20:06:03 UTC No. 16065416

>>16065390
MBAs typically know enough for dashboarding and visualisation. Regression analysis and the like will give you an edge over them.
Hard to say with AI. Incorporate it into your workflow like any tool. Good people / communication skills are also already heavily in-demand.

Anonymous at Sat, 9 Mar 2024 20:36:49 UTC No. 16065460

Pretty funny that programming languages for stats will give this much of a heated debate.

I often wonder why not researchers don't do everything in C++ since it's fast as fuck.

Anonymous at Sat, 9 Mar 2024 21:14:38 UTC No. 16065510

>>16065460
It's because C++ sucks and is a pain in the ass. If you don't need to get autistic about memory allocation, and can afford marginal inefficiencies why bother?

Anonymous at Sat, 9 Mar 2024 21:58:29 UTC No. 16065572

>>16065510
because i have a need for speed.

Anonymous at Sat, 9 Mar 2024 22:38:37 UTC No. 16065605

>>16065572
Fair. All the power to you. I've done some ROS work in C++, and I can do cpp if I need to. I'd definitely prefer not to do cpp if I can at all avoid it (the same for ROS for what it's worth).

Anonymous at Sat, 9 Mar 2024 22:47:19 UTC No. 16065616

>>16065605
I was just meming. I looked at a tutorial for ML in C++ and the amount of code was insane. Even if Python/R is a bit clunky and slow, the code is quite efficient.

Anonymous at Sun, 10 Mar 2024 12:28:40 UTC No. 16066313

>>16065375
Julia is a disappointing language that fails to deliver on big promises.
When they tell you of it's great performance in toy benchmarks, they don't tell you about the minutes long startup times as it compiles everything at runtime (every time). They adopted the phrase "time to first graph" because startup made visual work so painful.
They'll tell you about it's mature stats packages, then fail to mention the core language is so unstable not long ago if-else statements were broken.
> https://github.com/JuliaLang/julia/issues/41096
Seasoned Julia devs dropped the language for serious work citing serious correctness issues. They could no longer trust the language.
> https://yuri.is/not-julia/
> https://danluu.com/julialang/
> https://viralinstruction.com/posts/badjulia/

Jullia has a lot of potential but right now it's more of a teaching/hobbyist library, not something people actually deploy in the last decade.

Anonymous at Sun, 10 Mar 2024 12:50:02 UTC No. 16066325

>>16066313
Now this is what makes /sci/ a gem. This comment alone is worth making a couple of /psg/ threads over. I didn't know this but I am happy to know it now.

Anonymous at Sun, 10 Mar 2024 14:53:40 UTC No. 16066434

>>16066313
This is a very informative response. Thank you.

Large parts of my sector appear to be waffling between converting older Matlab code to python and converting everything to Julia. I'll probably need to learn it eventually anyways, but it's good to know my instincts for being hesitant with their claims weren't completely unfounded.

Anonymous at Sun, 10 Mar 2024 16:18:18 UTC No. 16066561

>>16066434
I am thinking Python isn't as buggy as Julia though? Code done in Python should work robustly.

Anonymous at Sun, 10 Mar 2024 17:30:17 UTC No. 16066668

>>16066561
Python kind of sucks at optimization relative to Matlab due to it not directly supporting matrices. You can kind of get away with it via the tricks that numpy and scipy pull, but a lot of the nonlinear programming implementations people use are just harder on python because of it not easily handling matrices and matrix operations.

It definitely might be more stable in some sense but unless something major in Python changes to allow for more straightforward implementation of ODEs/PDEs and NLP, it will be avoided by people doing real research in many fields that rely on it.

At the moment everyone in my world is using MATLAB, but people want Julia to be a replacement that doesn't rely solely on Mathworks proprietary software. Who knows if that will actually happen.

Anonymous at Mon, 11 Mar 2024 04:05:17 UTC No. 16067561

>>16059442
Pretty shit general

Anonymous at Mon, 11 Mar 2024 05:44:46 UTC No. 16067638

>>16067561
They couldn't even fill maths general and thought they needed another, /sci/ generals are doomed to fail anyways because it's supposed to be a gem in a catalog of shit, the shit is gonna leak in

Anonymous at Mon, 11 Mar 2024 08:03:59 UTC No. 16067737

>>16067561
You could always go to a general you like instead?
>>16067638
I don't think the mathbros are going to like to talk about coding and stats desu.

Anonymous at Mon, 11 Mar 2024 10:02:28 UTC No. 16067833

>>16066668
What field do you work in?

Anonymous at Mon, 11 Mar 2024 10:54:17 UTC No. 16067913

>>16067833
I guess you could call it "sonar," as a general field. The area between underwater acoustic physics, underwater acoustic signal processing, and target tracking.

Anonymous at Mon, 11 Mar 2024 11:13:39 UTC No. 16067932

>>16067913
I am slightly jelly. Sounds more interesting than finance.

Anonymous at Mon, 11 Mar 2024 11:43:26 UTC No. 16067961

>>16067932
It is pretty interesting, but it's also frustrating at times.

I'm much more on the "statistical detection" and "statistical estimation" side of the field and you tend to see it split between acousticians who know a lot about wave propagation and nothing about signal processing, and signal processing people who know nothing about acoustics. As a result you get a lot of miscommunication and doubled work because we never quite know what the other people have figured out in our area and vice versa.

Anonymous at Mon, 11 Mar 2024 12:31:12 UTC No. 16068007

>>16067961
That sounds really interesting. Would you be able to recommend a book on the topic?

Anonymous at Mon, 11 Mar 2024 13:32:19 UTC No. 16068063

>>16068007
If you want to learn the basics of estimation, your best bets are:

1) Fundamentals of Statistical Signal Processing Vol. 1, Estimation: Kay
2) Estimation with Applications to Tracking and Navigation: Bar-Shalom
3) Optimal State Estimation: Simon

1) focuses primarily on parameter estimation and is good to learn about how measurement devices work for these signal estimators. 2) is primarily focused on motivating Kalman filtering and the basics of tracking. 3) focuses almost exclusively on tracking/Kalman filtering and introduces optimal and non-linear state estimators.

Kalman filters and particle filters are used all over the place in time series estimation and data science, so it might be worth looking into anyways, even if you don't get into sonar/radar land.

For signal detection, your best bets are probably:
1) Fundamentals of Statistical Signal Processing, Vol. 2, Detection: Kay
2) Elements of Signal Detection and Estimation: Helstrom

Signal detectors are basically a special case of statistical hypothesis testing, so any book that covers hypothesis testing would also be good.

In terms of the underwater acoustics stuff, there's a million of them. A good introductory resource is USN-SP411 from the US naval academy's physics program. Hodges' Underwater Acoustics is also good but a bit more advanced.

Anonymous at Mon, 11 Mar 2024 14:37:03 UTC No. 16068136

>>16068063
Thanks anon - that's very helpful.
I'd been looking for a new topic to dive into. I'll be grabbing some of those books.

Anonymous at Mon, 11 Mar 2024 20:23:46 UTC No. 16068829

>>16068063
Are signals about detecting strong asymptotic behaviour and treat it? Just guessing a bit since I have no idea.

Anonymous at Tue, 12 Mar 2024 05:44:57 UTC No. 16069588

>>16068829
There's a million different ways of doing signal detection, but most statistical detection approaches come down to the following general idea:

1) Figure out what the signal you are looking for is supposed to look like under ideal conditions (may be trivial or may be very involved). Is it an acoustic signal or radar signal where it's more or less a scalar time series, or is it something more complicated?

2) Figure out what the noise looks like. Is it simply ambient WSS Gaussian noise or is it colored in some way? Are the errors correlated via some sort of interference?

3) Parameterize your test hypotheses. This could be as simple as two Gaussians for H_0 and H_1, or it could involve some complicated many parameter composite hypothesis for your signal being present. Generally, you want to keep things as simple as you can justify.

4) Formulate your hypothesis test. Can you simply do a likelihood ratio test or do you need to do something more sophisticated? Are your observations coming in batches (meaning you can do a simple NP style test) or are they coming sequentially as a time series (meaning you need to do an SPRT or Darling Robbins test or something)?

5) Determine the appropriate threshold conditions. If you're doing an NP/LLR style test, do you have specified Pfa and Pmd values you can use to set your detection thresholds? If you're doing a sequential test, do you have some metric for average sample number you need to hit?

6) Determine asymptotics. How do your error exponents work out for your test specifications? Is there a point of diminishing value (e.g., your Pfa is in the neighborhood of 1E-33 but you needed half a day's worth of observations to make your decision). We generally use asymptotics to determine appropriate batch sizes, as well as predict error performance for large enough sample sets that we can't directly produce MC trials with those realized error rates.

Anonymous at Tue, 12 Mar 2024 21:31:25 UTC No. 16070762

>>16069588
Very good post I wil think about it and then post some questions.

Anonymous at Tue, 12 Mar 2024 21:32:41 UTC No. 16070765

>>16059442
How do I into Bayesian stats?

Anonymous at Tue, 12 Mar 2024 21:47:45 UTC No. 16070791

>>16059442
Is it actually useful to know statistical theory for your own understanding of how things relate rather than for proving shit to somebody else, like the FDA or a grant-making body? What does one need statistics for, when one can just make a ggplot2 chart and look at the actual relationship shown by the data points? You don't need theory to see things like Simpson's paradox instantly, that's just not being retarded. You don't need theory to fish for tiny signal in noisy data because you would just look for better data (and you don't have to prove the existence of the tiny signal to someone else to justify funding).

Anonymous at Tue, 12 Mar 2024 23:15:50 UTC No. 16070929

>>16070765
Ironically, your best bet might be a machine learning book like Bishop's Pattern Recognition.

Most statisticians aren't as interested in Bayesian statistics as people who need to implement statistics in real practice (e.g., engineers, operations researchers, scientists etc.) and as a result a lot of what is useful in Bayesian inference gets called by other names (e.g., Maximum A Posteriori estimation).

Anonymous at Tue, 12 Mar 2024 23:20:32 UTC No. 16070935

>>16070791
> Is it actually useful to know statistical theory?

It depends on what you are trying to do. If you just want to see how well one set of data relates to another without really thinking too much about it, maybe a plot is sufficient.

If you want to make an estimate of something you can't directly observe based on a set of correlated things you can observe, statistics is pretty helpful. The same thing for using a set of different observable signals to make a good guess on whether the unobservable object of interest belongs to group A vs. group B.

If you don't need to do estimation, regression, classification or hypothesis testing, you might not need statistical theory very much.

Anonymous at Tue, 12 Mar 2024 23:24:15 UTC No. 16070938

>>16070929
Interesting, thanks anon.
>Bishop's Pattern Recognition
I'll have a look at that.