commit 501bdb1979adcb96e4867fddbf1ab453a6cef9b5 Author: noah Date: Tue Aug 30 14:48:14 2022 -0500 README and R script skeleton diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5b6a065 --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +.Rproj.user +.Rhistory +.RData +.Ruserdata diff --git a/README.md b/README.md new file mode 100644 index 0000000..4f4e118 --- /dev/null +++ b/README.md @@ -0,0 +1,65 @@ +# Bioinformatics Lab 1 + +## Part A - seq function + +### a: Create a vector where the first element is 1, the last element is 33, with an increment of 2 between elements. + +### b: Create a vector with 15 equally spaced elements in which the first element is 7 and the last element is 40. Hint: use ?seq for help and the option length.out option. + +### c: Use the sample function to create a vector with variable name my.dna that consists of 20 uniformly-random letters “A”, “C”, “G”, and “T”. + +### d: Use the == logic operator and other R functions on your my.dna variable to determine how many of the letters are “A”. Hint: you can use sum on a TRUE/FALSE vector or you can use the functions which and length. + +### e: Confirm your answer in d with the table(my.dna). From the output of table, create a pie chart and barplot. Add x and y labels to your barplot. + +### f: Use the sample function with the option prob=c(.1,.4,.4,.1)to create a vector with variable name my.dna2 that consists of 20 non-uniformly random letters “A”, “C”, “G”, and “T”. Use table to show the nucleotide counts. + +## Part B: NCBI Search + +### Setup: Search NCBI (http://www.ncbi.nlm.nih.gov/) for “Alzheimer human.” This will take you to Entrez gene, which shows you the hits in the NCBI databases. Choose the top hit for Alzheimer under “Gene” information. + +### 1. What is the name of the gene? + +### 2. What chromosome is the gene on? + +### 3. What species has the most similar gene to the human version? + +## Part C: Reading fasta files, nucleotide and dinucleotide frequencies + +### Setup: + Install and load the seqnir library + Download the fasta file found from Part B + Read the fasta file in as a string + +### 1. What data type is the fasta? + +### 2. Create a function that converts the fasta string to a vector + +### 3. Using the function from C.2, how long is the sequence? + +### 4. Show the first 20 nucleotides of the sequence + +### 5. How many of each nucleotide are there in the sequence? + +### 6. Create a barplot of the counts, including axes labels + +### 7. Calculate the probability of each nucleotide + +## Part D: GC Content + +### 1. Add code to your R script to calculate the G+C content of the fasta vector + +### 2. How many gc pairs are there? + +### 3. Show a barplot of all dinucleotide counts + +## Part E: Coronavirus + +### Setup + Paper: https://www.ncbi.nlm.nih.gov/pubmed/32015508 + DNA/RNA: https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3?report=fasta + Protein: https://www.ncbi.nlm.nih.gov/protein/QHD43415.1?report=fasta + + +### 1. Download the DNA/RNA fasta file and determine the nucleotide frequencies. Comment on how the frequencies compare with the human APOE gene. + diff --git a/Schrick-Noah_CS-6643_Lab1.R b/Schrick-Noah_CS-6643_Lab1.R new file mode 100644 index 0000000..d18c0c6 --- /dev/null +++ b/Schrick-Noah_CS-6643_Lab1.R @@ -0,0 +1,38 @@ +# Lab 1 for the University of Tulsa's CS-6643 Bioinformatics Course +# Introduction to R, Online bioinformatics resources, nucleotide frequency statistics +# Professor: Dr. McKinney, Fall 2022 +# Noah L. Schrick - 1492657 + +#### Part A: Seq Function +## a + +## b + +## c + +## d + +## e + +## f + + +#### Part B: NCBI (no supporting R code for this part) + +#### Part C: Reading fasta files, nucelotide and dinucleotide frequencies + +## Pre-cursor: Load associated supportive libraries + +## 1 + +## 2 + +## 3 + +#### Part D: GC Content + +## 1 + +#### Part E: Coronavirus + +## 1