A couple weeks ago I wrote a short sequencing intro. Here's a bit more, starting with a minor puzzle. I'm working with California wastewater sequencing data ( Rothman et al 2021) and I found a read that was a partial match for HIV:

>SRR14530740.1578405 1578405/2
AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
GGTAACTAGAGATCCCTCAGACCCTTTTAGTCCTGTCTCTTATACACATCT
GACGCTGCCGACGACCTTCGTGATGTGTAGATCTCGGGGGGCGGCGGGG

The 83 highlighted bases at the beginning of the read are an exact match for this section near the beginning of the HIV genome:

>AF033819.3 HIV-1, complete genome
GGTCTCTCTGGTTAGACCAGATCTGAGCCTGGGAGCTCTCTGGCTAACTA
GGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTCAAGT
AGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGAC
CCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGAC
...

This dataset was sequenced with paired-end reads, which means there's more information we can get on this particular genetic snippet. This was the 'reverse' read, so let's look at the corresponding forward read:

>SRR14530740.1578405 1578405/1
GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
GCACACACTACTTGAAGCACTCAAGGCAAGCTCTGTCTCTTATACACATCT
CCGAGCCCACGAGACCTTGCCATTAATCTCGTATGCCGTCTTCTGCTTG

When working with paired-end reads, they're sequenced in opposite directions working towards each other:

   forward read -->
5' --------------------------------------------- 3'
   |||||||||||||||||||||||||||||||||||||||||||||
3' --------------------------------------------- 5'
                                <-- reverse read

Because they're reading in different directions you need to reverse one of the reads, and since they're reading complementary strands you need to take the genetic complement. Here's the reverse complement of the forward read, to match the reverse read we were already looking at:

>SRR14530740.1578405 1578405/1, reverse complement
CAAGCAGAAGACGGCATACGAGATTAATGGCAAGGTCTCGTGGGCTCGGAG
ATGTGTATAAGAGACAGAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCC
GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTC

Most commonly your paired end reads go together like:

[read 1] [gap you didn't sequence] [read 2]

In this case, however, they overlap, allowing us to assemble a larger sequence. Sometimes you might have a read error, where the two don't perfectly match, but we're lucky here and there's no disagreement:

CAAGCAGAAGACGGCATACGAGATTAATGGCAAGGTCTCGTGGGCTCGGAG
ATGTGTATAAGAGACAGAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCC
GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTCCT
GTCTCTTATACACATCTGACGCTGCCGACGACCTTCGTGATGTGTAGATCT
CGGGGGGCGGCGGGG

Now here's the puzzle: the overlapping portion of the two happens to be exactly the sequence that matches HIV. This isn't something you'd expect to see by chance, right? How much overlap (or distance) you get between two sequences should be unpredictable. So, why is this happening?

In the sequencing process, your input DNA fragment gets more bits of DNA ("adapters") stuck on its ends, to allow the sequencer to manipulate it. At the beginning (5' end) of the target sequence this works well: sequencing uses the adapter to determine where to start the read, which then will nearly always start with the first base of your original fragment. If your initial fragment is very short, however, it will run past the end of the original sequence and into the adapter. Illumina has some documentation with figures explaining the process.

Here are the original reads again with the portion immediately following the HIV match highlighted:

>SRR14530740.1578405 1578405/2
AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
GGTAACTAGAGATCCCTCAGACCCTTTTAGTCCTGTCTCTTATACACATCT
GACGCTGCCGACGACCTTCGTGATGTGTAGATCTCGGGGGGCGGCGGGG
>SRR14530740.1578405 1578405/1
GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
GCACACACTACTTGAAGCACTCAAGGCAAGCTCTGTCTCTTATACACATCT
CCGAGCCCACGAGACCTTGCCATTAATCTCGTATGCCGTCTTCTGCTTG

For the kit used in this paper this sequence is the start of the adapter, and nothing from this bit on is part of our input fragment.

With the adapters removed, we're left with just:

>SRR14530740.1578405 1578405/2
AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
GGTAACTAGAGATCCCTCAGACCCTTTTAGTC
>SRR14530740.1578405 1578405/1
GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
GCACACACTACTTGAAGCACTCAAGGCAAGCT

This is now an exact match for HIV, without any junk at the end. Most quality control pipelines contain a step where you remove adapters, just like they remove the poly-G sequences I described last time.

12

New Comment

New to LessWrong?