/sci/ wtf computers cant calculate bio-info

1217x743

evolution-mtcox.jpg

🧵 wtf computers cant calculate bio-info

Anonymous at Wed, 13 Mar 2024 11:52:27 UTC No. 16071858

its seems like nobody knows?
you just can't align complete genomes with a computer:
1. human genome (3 gigabytes of strings of A,T,C and G)
2. mouse genome (2.7 gigabytes of the same)

if you try to examine how much same is in these two, you must do it by hand and there must be thousands of scientists involved and it takes a decade (it has been done)

if you put both in a computer and try to do an alignment of them it looks like this:
AAAATTG (human)
ACAATTG (mouses)
computer gradually finds the best parts where they align so it can be quantified where in the genome they differ and how much (like one gene could be 99% identical and some other only 80%)

computers cannot do this if you input 3 gigabytes for one string and 2.7 gigabytes for other

it takes decades to calculate by a computer and it will run out of memory

while this looks like /sci/ its actually more like /g/ stuff

because : there is no suitable programs (where are the programmers?) do you expect a biologist codes his own software? he is not a computer scientist

a biologist should work together with a coder to create something great but it has not happened so far

Anonymous at Wed, 13 Mar 2024 12:33:27 UTC No. 16071902

>>16071858
>do you expect a biologist codes his own software?
No, it'll be a collaboration between dozens (maybe hundreds) of different researchers. Some will be biologists, some bioinformatics (which could come from any background really) and probably more.

Cult of Passion at Wed, 13 Mar 2024 12:36:47 UTC No. 16071907

>>16071858
You do fascinating work, man, its approaching this from an angle I cant so its like seeing behind a door I dont have the key to.

Anonymous at Wed, 13 Mar 2024 12:44:37 UTC No. 16071922

>>16071858
>while this looks like /sci/ its actually more like /g/ stuff
LMAO, this is more /math/ (error 404) than anything else.

Anonymous at Wed, 13 Mar 2024 12:51:00 UTC No. 16071932

>>16071858
This is a common problem. Google can probably do it on their servers but you would have to pay them enough money

Anonymous at Wed, 13 Mar 2024 14:23:08 UTC No. 16072076

>>16071858
OP create a $2M grant and I'll jump on it

oh, you want it for free for your lab in pakistan? no thanks

Anonymous at Wed, 13 Mar 2024 14:26:15 UTC No. 16072079

>>16071858
The problem is probably NP-complete or maybe NP-Hard.

Anonymous at Wed, 13 Mar 2024 14:28:32 UTC No. 16072082

Is this a bot? Why does this get spammed?

Anonymous at Wed, 13 Mar 2024 14:31:30 UTC No. 16072087

"The individual in the screenshot is discussing the challenges associated with aligning and comparing complete genomes—specifically, the human and mouse genomes. They are addressing the technical difficulty of computational genome alignment, which involves comparing the sequences of nucleotides (represented by the letters A, T, C, and G) to identify regions of similarity and difference.

They seem to be under the impression that there's no existing software capable of handling such a large amount of genetic data without manual intervention, which they suggest would take thousands of scientists decades to complete. Additionally, they touch upon the concern that computational resources might be insufficient for such tasks, leading to issues like memory overload.

However, this information is outdated or incorrect. Genome alignment and analysis is a common practice in bioinformatics, and there are several advanced computational tools and algorithms designed specifically for these purposes, such as BLAST, Bowtie, and BWA. These tools can handle large quantities of genetic data and perform alignments relatively quickly, given adequate computational resources.

The mention of collaboration between biologists and computer scientists (or programmers) is accurate. Interdisciplinary collaboration is crucial in the field of bioinformatics, where complex biological data is analyzed using computational methods. Teams often consist of individuals with expertise in biology, computer science, mathematics, and related fields. They work together to create software and algorithms that can efficiently process and analyze vast biological datasets like genomes."
>t. ChatGPT 4

Anonymous at Wed, 13 Mar 2024 14:52:52 UTC No. 16072112

>>16072076
I can give you 10 000 monies

Anonymous at Wed, 13 Mar 2024 14:57:56 UTC No. 16072126

>>16072087
>Genome alignment and analysis is a common practice in bioinformatics, and there are several advanced computational tools and algorithms designed specifically for these purposes, such as BLAST, Bowtie, and BWA.

You cant handle 3GB and 2.7GB

You must indiscriminately split them into smaller and smaller segments, lets say you take 10 megabytes from the start of human genome an then you take 9 megabytes from start of mouse genome (due to the fact that human is slightly bigger anyway) and then you hope the start of the genome is very much alike in both, you will probably find areas which align well

BUT you have no way of knowing if somewhere in the middle of genome there is a jumped gene in mouse and in a human its entirely in a different place

Human and mouse genomes alreay were completed but it took 10+ years from hundreds or maybe even a thousand scientist, they used computers of course but computers were quite slow in lets say year 2002

Anonymous at Wed, 13 Mar 2024 15:14:57 UTC No. 16072153

>>16071858
>1. human genome (3 gigabytes of strings of A,T,C and G)
is that 100% full sequence?

Anonymous at Wed, 13 Mar 2024 15:26:13 UTC No. 16072169

>>16072153
yes entire complete human genome is 3GB its not bigger than that
it fits on one DVD disc

Anonymous at Wed, 13 Mar 2024 15:46:11 UTC No. 16072205

>>16072169
>it fits on one DVD disc
Pretty sure it fits on one cell.

Anonymous at Wed, 13 Mar 2024 15:51:51 UTC No. 16072222

>>16072126
Are you retarded? Do you even know how computers work

Anonymous at Wed, 13 Mar 2024 16:48:49 UTC No. 16072336

this is trivial work for a cray-cray

Anonymous at Wed, 13 Mar 2024 16:53:34 UTC No. 16072347

>>16072336
even a home computer CPU is fast enough

but nobody has written a program that can do this

Anonymous at Wed, 13 Mar 2024 16:53:57 UTC No. 16072348

>>16071858
Why hasn't been done? Is it a size problem? Sounds like a program that you could easily write on python...

Anonymous at Wed, 13 Mar 2024 16:57:39 UTC No. 16072354

>>16072348
biologists are retards who can't into algorithmic thinking, let along software programming.

Anonymous at Wed, 13 Mar 2024 17:03:07 UTC No. 16072362

Theres a concept called Genetic distance, with many variations.
The idea is to count how many individual variations do you need to change one chain into the other. Say by doing permutations of nucleotides or by adding new ones. This gives an idea of how many mutations would be required to turn one chain into another.

Anonymous at Wed, 13 Mar 2024 17:05:12 UTC No. 16072365

>>16072169
try compressing it curious how small it can get.
>>16072205
kek

Anonymous at Wed, 13 Mar 2024 17:36:14 UTC No. 16072425

>>16072362
now do that between human an mouse entire genome to see where it leads up to

Anonymous at Wed, 13 Mar 2024 18:27:14 UTC No. 16072490

>>16072425
I fucking hate mathlets with all my heart. All you needed to say in your first post is that you have two datasets of 3GB and you have to combinatorially compare them.
But no, tens of sentences of word salad.

Anonymous at Wed, 13 Mar 2024 18:49:09 UTC No. 16072538

>>16072490
there still is no software for it

Anonymous at Wed, 13 Mar 2024 18:57:48 UTC No. 16072557

>>16072354
>>16072348
probably because it doesn't matter. What's the use of comparing the entire genome when you can just as easily compare parts. Imagine having supercomputers with easily a few hundred gb of RAM and maybe even 64+ gb if VRAM and still being unable to read a text file consisting of literally 4 different letters. OP is a fag and a liar. It's just that no one has taken the effort to write a program and execute it on a supercomputer.

Hell if I were to be set to this task I wouldn't bother using the entire dataset. That's retarded and slow. Break the genome up in like 10mb files or maybe a little larger. That's easily loaded by any local computer and you can still do a comparison. Gotta handle edge cases well but maybe you could separate only after a codon and then you won't a separate a protein into two files. But even it'd be easy to just open 2 10mb files. Much easier than opening a fucking 2gb text file. Ever tried that in vs code? takes ages and it'll probably crash because your local pc can't handle it. Simply put there's no reason to do it in this special way.
Basically it's the same reason why the human genome is split over 24 chromosomes. Separating the information makes it easier to use it rather than having one huge bulky thing.

Anonymous at Wed, 13 Mar 2024 18:59:03 UTC No. 16072559

>>16072538
it's impossible to have software for it

Anonymous at Wed, 13 Mar 2024 19:06:20 UTC No. 16072575

>>16072559
no it isn't. Any Linux supercomputer can open even a 10gb txt file easily. And then compare the letters in said file with another 10gb txt file. It's not hard. You could write the code in a single day.
>>16072490
>>16072362
but like this guy is already describing. even if you load the data. Taking permutations over such a huge txt file. It's going to take a fuck ton of time. Yeah it's possible and if you really did it properly you would multithread it and it'll be efficient. But all that takes some sofware skills whereas you could also just separate the file into hundreds of smaller files and then your program would be a lot shorter and easier and you could just run it locally or anywhere for that matter.
Simply put it doesn't make sense to use the whole file if you just don't need it.

Anonymous at Wed, 13 Mar 2024 19:07:17 UTC No. 16072578

>>16072575
you literally said you need to compare all the different variations between the two genomes you fucking idiot this is basically impossible in the lifetime of the universe

Anonymous at Wed, 13 Mar 2024 19:14:24 UTC No. 16072599

>>16071858
two questions for you:
1) How many characters long is 3 gigabytes worth of text?
2) where is the start of the genome (any genome)?

Anonymous at Wed, 13 Mar 2024 19:24:50 UTC No. 16072633

>>16072599

1) it will be 3 billion characters, you see these gene data files are plain textfiles (they could be in some packaged format, usually a linux format, but before they are used they will be unpacked onto plain UTF text where)

2) there is no universal rule for that

Anonymous at Wed, 13 Mar 2024 19:44:37 UTC No. 16072687

>>16072578
No it's not. Just use more computers lol. 3gb / 10mb is like 300 files. So take first file of man dataset and compare to the 300 files of the rat dataset
And yeah it's going to be slow but if you use a good server farm and had the money to run it all and the time to set up an efficient process. The hard part really is to know when you should compare which file with which file but that's just some smart software stuff and pretty doable. Others have done it before for other things. I honestly don't even understand your issue. OP wanted to align the genomes. This has already been done by splitting the files into smaller files. It's just not been done by directly comparing the two huge files because as I've said it's silly to do it like that because it's a lot of effort programmatically. Also the multi-tasking becomes harder if it's a
single file.

Anonymous at Wed, 13 Mar 2024 20:38:57 UTC No. 16072823

>>16072633
1) You see nucleic acid length is expressed in nucleotides (nt), base pairs (bp) or bases (b), not in the amount of memory necessary to store a file containing the sequence.

2) > lets say you take 10 megabytes from the start of human genome
you propose to compare the starts of two different genomes while also saying that there is no universal rule to assign a start.

You couldn't be bothered to look up the unit used to express genome size, you have nothing to say.

Anonymous at Wed, 13 Mar 2024 20:56:46 UTC No. 16072849

>>16072823

>you propose to compare the starts of two different genomes while also saying that there is no universal rule to assign a start.

thats why it makes most sense to compare entire 3GB of human to the 2.7GB of moouse, then we dont even need to know anything about where is the start exactly

Anonymous at Wed, 13 Mar 2024 20:58:18 UTC No. 16072854

>>16072823
>You see nucleic acid length is expressed in nucleotides (nt), base pairs (bp) or bases (b), not in the amount of memory necessary to store a file containing the sequence.

3GB file is made from 3 billion characters
there is no more and no less

Anonymous at Wed, 13 Mar 2024 23:01:07 UTC No. 16073036

>>16072222
The quads have spoken

Anonymous at Wed, 13 Mar 2024 23:08:46 UTC No. 16073047

>>16071858
Is this the retard who keeps making threads and posts talking about evolutionary relationships and getting it completely wrong? Like the thread about calcium in mollusc shells being evolutionarily linked to calcium teeth in mammals. Or in the recent insect larva thread where he claimed there was no genetic reason to suggest silverfish were basal insects, before posting a phylogenetic tree that showed they were exactly that

Anonymous at Wed, 13 Mar 2024 23:19:23 UTC No. 16073072

>>16071858
how about you link these dna databases? you academia cunts are gatekeeping everything

Anonymous at Thu, 14 Mar 2024 04:17:52 UTC No. 16073450

>>16073047

>Or in the recent insect larva thread where he claimed there was no genetic reason to suggest silverfish were basal insects, before posting a phylogenetic tree that showed they were exactly that

Nah I believe it was someone else, he merely took an evolutionary three were silverfish is, to prove his point but he interpreted it wrong due to not actually being a biologist

The tree he is used was likely from the weird evolutionary biologist who has posted lots of these trees onto /sci/ from time to time

As to what was the original thread for from the tree where the silverfish was, who knows

Anonymous at Thu, 14 Mar 2024 04:19:42 UTC No. 16073453

>>16073072

human genome in a .gz package (I ont know how it's opened in Winows OS)

https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz

Anonymous at Thu, 14 Mar 2024 04:38:06 UTC No. 16073475

>>16072575
>but like this guy is already describing
Im not "some guy". Fuck you

View Same Yandex ImgOps iqdb SauceNAO

255x180

IMG_3456.gif

Anonymous at Thu, 14 Mar 2024 12:18:11 UTC No. 16074316

Sometimes I can’t believe we don’t have a way to filter undergrads from posting here. Why the fuck would you want to align an entire genome that isn’t a small bacteria or virus you retard nigger faggot most genomes have lots of junk. You wanna use 10000gb to align transposable elements and repeats? We already have programs that can make alignments and gene trees out of all the orthologous coding regions and we have programs that can make a species tree from it. I do bioinformatics for a living we have people who do algorithms for assembly and alignment (not me). Have you ever taken a genomics or bioinformatics course? Your curiosity is good to see but you ask retarded questions and have even more retarded answers

Anonymous at Thu, 14 Mar 2024 12:19:50 UTC No. 16074327

>>16074316
>Sometimes I can’t believe we don’t have a way to filter undergrads from posting here

why should it be filtered

Anonymous at Thu, 14 Mar 2024 12:20:53 UTC No. 16074333

>>16074316

>We already have programs that can make alignments and gene trees out of all the orthologous coding regions and we have programs that can make a species tree from it

So what, they can handle merely 1 megabit per species and when you put 60 species innit, even that becomes impossible

Anonymous at Thu, 14 Mar 2024 12:27:05 UTC No. 16074380

>>16072849
>t makes most sense to compare entire 3GB of human to the 2.7GB of moouse
What do you mean exactly when you say "compare". To my understanding the genetic information is encoded as a massive string? Or several different strings. What kind of comparison are you looking for? Exact locations where one differs from the other? How does that work exactly when you said theres no clear start or end? And how about the different lengths? Do you stop comparing at the end?

Anonymous at Thu, 14 Mar 2024 12:32:41 UTC No. 16074432

>>16074380

>What kind of comparison are you looking for?

A proper way to position human, mouse, lizard, bird, some fish and a urochordate into a tree of life to get some clue on how many hundreds of millions of years have passed since their divergence from each other

Mitochondria only comparisons may not tell the whole story at all

View Same Yandex ImgOps iqdb SauceNAO

1552x675

alignmentsoftware.jpg

Anonymous at Thu, 14 Mar 2024 12:37:22 UTC No. 16074482

>>16074380
>And how about the different lengths? Do you stop comparing at the end?

Check this out

View Same Yandex ImgOps iqdb SauceNAO

679x807

IMG_3316.jpg

Anonymous at Thu, 14 Mar 2024 12:38:00 UTC No. 16074487

>>16074327
So I don’t accidentally become stupider through sharing a board with subhumans who make posts like this
>>16074380
You’re right, but like I said before, we can determine from translation tables where the coding domain that gets transcribed begins and ends. Sometimes there is valuable information thay is not in a coding region, but that still cannot be analyzed by aligning entire species genomes unless they are small

>>16074333
So you want a program that can align the entire free of life whole genomes? What purpose does that serve? You will get incredibly biased results because of difference in genome size. For example, some flies have only four chromosomes, while closely related ones have 8. If you aligned their whole genome, it’s going to show them as being very un-related. You have to look at things that translate to phenotypic traits and are subject to the similar forces of evolution like brownian motion that can be simulated with maximum parsimony or MCMC chains

Anonymous at Thu, 14 Mar 2024 12:44:16 UTC No. 16074528

>>16074487
>If you aligned their whole genome, it’s going to show them as being very un-related

Nobody actually knows this for certainty because no program can do it, too large files

Anonymous at Thu, 14 Mar 2024 12:50:11 UTC No. 16074559

>>16074487
>You’re right
About what? I only asked questions. I didn't make any statements

Anonymous at Thu, 14 Mar 2024 12:59:33 UTC No. 16074643

>>16074559
well you arent wrong

Anonymous at Thu, 14 Mar 2024 14:39:00 UTC No. 16076524

>>16074482
Cool

Anonymous at Thu, 14 Mar 2024 14:55:38 UTC No. 16076714

>>16071858
Thanks for answering my question (from when I got pissed at this) about red/grey squirrel differences

Anonymous at Thu, 14 Mar 2024 15:07:31 UTC No. 16076799

>>16071858
The fields of bioinformatics and computational biology are dedicated to solving the problems you're talking about.

Anonymous at Thu, 14 Mar 2024 15:33:30 UTC No. 16076921

>>16076799
yes it is, now go create an algorithm that is capable of taking in real big single strings of data

Anonymous at Thu, 14 Mar 2024 15:40:34 UTC No. 16076968

>>16074316
Jokes on you, its been found that junk dna wasnt junk after all. Get on with the times old man