Sequencing Intro

This is really neat. Thanks for helping me build a technical base (eyyyy) for understanding my partner's work.

One thing that wasn't clear to me is: if you can only sequence 150 bases at a time, how do you build a complete picture of a genome? One might come away from your post thinking that next gen sequencing is only useful for getting the edges of the genome, say for simple identification purposes. Based on "chopped up your fragments" and my preexisting knowledge, I expect you do some sort of chopping and then reassembly, but I'd be curious to learn more.

Another thing I'd be excited about reading in a follow-up is even the basics of the technical hurdles of sequencing so much stuff in wastewater and how the NAO intends to overcome it, explained in jefftk-style.

[-]tgb3y50

Yeah, you break the big strand randomly up and sequence the starts and ends of each of the little fragment. Then you have to assemble all the little bits together, which is called either assembling a genome (if you don’t have anything else to work off of) or alignment (if you’re just comparing you sequenced data to an existing reference genome). These are computationally nontrivial, though alignment at least is well solved these days through tools like bwa or STAR. For something like wastewater survey you do ”meta genomics” since it comes from many bacterial/viral genomes.

This process wasn’t really enough to get a complete picture of the genome. The reads were too short and so some parts of the genome were hard to read: anything that is highly repetitive cant be assembled accurately since it all looks the same. The recent “T2T” genome is really the first complete human genome we have despite being two decades after the human genome project finished. But the earlier reference was good enough for most things. Actually the very start and end of the chromosomes, the telomeres, were some of the hardest part part to sequence, hence the name of “telomere 2 telomere” genome for the new build. Older builds have like a million ”N”s at the start and end of every chromosome, denoting an unknown sequence (ACTG all possible) in the telomeres.

[-]Tapatakt3y30

>how do you build a complete picture of a genome

You put a lot of identical DNA strands in a sequencer and use an algorithm that identifies overlaps and restores the strand sequence.

Toy example: if you have reads "GCTTTAGCCCCTAGG" and "TAGCCCCTAGGAATCC", there is propably "GCTTTAGCCCCTAGGAATCC" in the original DNA. Of course, real reads and overlaps are usually much longer.

(Epistemic status: studied bioinformatics for 1 year, in 2016-2017, forgot a lot)

$ head -n 4 SRR14530724_1.fastq @SRR14530724.1 1/1 GTTGTTATCCTGCGCGGCGACGCCAGCCTTGCTCAATTGCTCGAGCAGGGC CTGTCTCTTATACACATCTCCGAGCCCACGAGACAGGTCAGATAATCTCGT ATGCCGTCTTCTGCTTGAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF F:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFF FFF:FF,FFF::FF::F::::,FFFFFFFFFFFFFFFFFFFFFFFFFFF

Phred character	Numeric Value	Accuracy
F	37	99.98%
:	25	99.7%
,	11	92%
#	2	37%

Phred character

Numeric Value

Accuracy

99.98%

99.7%

92%

37%

>SRR14530724.1 1/1 GTTGTTATCCTGCGCGGCGACGCCAGCCTTGCTCAATTGCTCGAGCAGGGC CTGTCTCTTATACACATCTCCGAGCCCACGAGACAGGTCAGATAATCTCGT ATGCCGTCTTCTGCTTGAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGG >SRR14530724.2 2/1 CATTTTCGACGGCGTCGATGTACAAAGGTTATACCATAGTAAGTCCGAAGC TACAGGCTTATGACACCGCAGAGTCAATGTATTCCGGTGACAATGTACTGA TGTACAGTGGGACTGACACTGTCTCTTATACACATCTCCGAGCCCACGA

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

39

39

39