Inspiration
When a doctor takes a cotton swab sample from somebody’s mouth, the majority of the DNA will come from bacteria rather than human cells. Analyzing the DNA of these samples can help identify disease-causing bacteria, but DNA contains many millions of nucleotides. Recognizing patterns in bacteria and analyzing samples to get a breakdown of the bacteria present is a job that is well suited for machine learning. Machine learning allows computers to take in large amounts of data and get better over time analyzing this data, allowing us to make this analysis easier, faster, and potentially more robust and accurate than ever before.
What it does
ForensX Bacterial Contaminant Analyzer receives a genome in the form of a 100 character string of nucleotides and analyzes the bacterial contamination. It does this by training on bacterial genomes from the NCBI database and using machine learning to determine ‘flags’ that indicate a species of bacteria. It then searches for these flags in the given sample genome and identifies the bacteria present.
How I built it
Our machine learning model was coded with the programming language, Python, specifically using pandas and scikit learn modules for analysis and machine learning. First, we had to process and analyze our data. To do this, we took a fasta file, obtained the base pairs, then BLASTed these sequences. We put our results in a .csv file for the machine learning model to read into and analyze. The data we obtained included the Bacteria present in the genome, and the E Score. After obtaining all this data, our first step in the actual program was to engineer this data. Our next step was to convert all of our non-integer data into integers. We converted our genome sequences to integers recursively using a simple script we wrote. We classified and converted our bacteria to integer values using Pandas. This was the completion of our data engineering step. To implement the KNN machine learning algorithm, we used the Python library Scikit-learn, and then verified accuracy with cross evaluation metrics. We achieved approximately 50% accuracy, with minimal data for training.
Challenges I ran into
The first challenge we had was determining how to use the data and precisely what data we needed to use to create our machine learning model. We overcame this problem by choosing only to use a few of the data files instead of trying to use all of them. We had to make this decision due to time constraints, as training on a lot of data would take a very long time. Analyzing all this data would only need to be done once during setup once we finish the program, though. Another challenge we had was figuring out how we could get reference genomes for all of the species of bacteria we were trying to identify. We had 350 different species and each one had its own link to a reference genome document we had to download. To work around this, we wrote a script that would be able to download the reference genomes; however, we didn't have time to use this program for the current version of the program. We planned to primarily analyze our data with a program called BLAST that is hosted by NCBI. BLAST can detect sequences that are the same or very similar between two strings of nucleotides. Unfortunately, we had difficulty connecting to the online version of BLAST during the competition. The NCBI database was getting too many requests from hackathon participants and blocked us. To overcome this, we created a local database with reference sequences for bacteria species instead of reference genomes. This became our training data for our final machine learning algorithm. The downside to this approach is that we have a smaller data set and therefore a less accurate program; But, as previously mentioned, this program could be expanded to use all of the bacterium genomes from NCBI.
Accomplishments that I'm proud of
We were able to successfully build a model that can predict which type of bacteria is in a given genome sequence of a DNA sample from someone’s mouth. This model can predict the bacteria species with approximately 50% accuracy. In the process of creating the program, we learned a lot about both genomics and machine learning. We also significantly improved our skills in python programming.
What I learned
We learned how to BLAST a sequence from a genome to determine the different species of bacteria in it. We also learned how to use the information from BLASTing a species from a genome to train a machine learning model that can take in a sequence from a genome and determine what the most common species of bacteria in the sequence is. We were also able to learn a great deal more in the field of machine learning as well as how to combine it with genomics to create a machine learning model. We also gained a more in-depth understanding of genomics.
What's next for Challenge 4 Machine Learning Model
Given more time, we can modify this program to automatically download and train against all of the bacteria in NCBI’s database, as we already have a script that downloads the data. This would significantly improve the accuracy and make the setup more manageable as we have to manually import data at the moment due to time and connection limitations. We also look to implement more advanced machine learning techniques such as neural networks and deep learning to improve the accuracy of our model. We also plan to implement detection of the percentage of each bacteria type that is present in the sample. We could also expand the input from one sequence of a genome to an entire genome. Finally, we plan to make our model return the percentage of different species of bacteria in the given genome.
Built With
- brain
- ipython
- pandas
- python
- scikit-learn
Log in or sign up for Devpost to join the conversation.