CS-6643-Bioinformatics-Lab-1/README.md

# Bioinformatics Lab 1

## Part A - seq function

a)  Create a vector where the first element is 1, the last element is 33, with an increment of 2 between elements.
b)  Create a vector with 15 equally spaced elements in which the first element is 7 and the last element is 40.
c)  Use the sample function to create a vector with variable name my.dna that consists of 20 uniformly-random letters “A”, “C”, “G”, and “T”.
d)  Use the == logic operator and other R functions on your my.dna variable to determine how many of the letters are “A”. Hint: you can use sum on a TRUE/FALSE vector or you can use the functions which and length.
e)  Confirm your answer in d with the table(my.dna). From the output of table, create a pie chart and barplot. Add x and y labels to your barplot.
f)  Use the sample function with the option prob=c(.1,.4,.4,.1)to create a vector with variable name my.dna2 that consists of 20 non-uniformly random letters “A”, “C”, “G”, and “T”.  Use table to show the nucleotide counts.

## Part B - NCBI Search

### Setup

Search NCBI (http://www.ncbi.nlm.nih.gov/) for “Alzheimer human.” This will take you to Entrez gene, which shows you the hits in the NCBI databases.  Choose the top hit for Alzheimer under “Gene” information.

### Evaluation

a)  What is the name of the gene?
b)  What chromosome is the gene on?
c)  What species has the most similar gene to the human version?

## Part C - Reading fasta files, nucleotide and dinucleotide frequencies

### Setup

Install and load the seqnir library
Download the fasta file found from Part B
Read the fasta file in as a string

### Evaluation

a)  What data type is the fasta?
b)  Create a function that converts the fasta string to a vector
c)  Using the function from C.2, how long is the sequence?
d)  Show the first 20 nucleotides of the sequence
e)  How many of each nucleotide are there in the sequence?
f)  Create a barplot of the counts, including axes labels
g)  Calculate the probability of each nucleotide

## Part D - GC Content

a)  Add code to your R script to calculate the G+C content of the fasta vector
b)  How many gc pairs are there?
c)  Show a barplot of all dinucleotide counts

## Part E - Coronavirus

### Setup
Paper: https://www.ncbi.nlm.nih.gov/pubmed/32015508
DNA/RNA: https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3?report=fasta
Protein: https://www.ncbi.nlm.nih.gov/protein/QHD43415.1?report=fasta

### Evaluation
a)  Download the DNA/RNA fasta file and determine the nucleotide frequencies. Comment on how the frequencies compare with the human APOE gene.