Inspiration
In a meeting with a Wayne PHD candidate on Friday, a side project was proposed. Most open-source Genome Assemblers omit the usage of parallel programming, a method that can easily speed up computations of large datasets. As such, most personal devices are unable to process data in an efficient amount of time. Many languages, such as C++ and Julia, have multi-threading and parallel programming ability. Thus, for this project, the Julia language was utilized for its high-performance computing as well as for its increased prevalence in science.
The goal will be to construct an Assembler with connection to a database and good UI for easy data retrieval/storage. It is intended to be a tool for research.
What it does
Genomia is a A Julia Based Parallel Processing Genome Assembler Prototype with high efficiency and moderate accuracy during testing. During testing, it was noted that the assembler fit the long read category more accurately, trading off accuracy for efficiency.
How we built it
The assembler takes in multiple strings of DNA sequences as input. The sequences are then split into k-mers, with k being a variable, mers being a subsection of k length of the sequence. Then, a de Bruijn graph is created between the k-mers of different DNA sequences. This graph is expressed in adjacency list form.
Finally, a starting node is found and an Eulerian walk is performed to create a long sequence between each k-mer node. Thus, a sequence is reconstructed.
Challenges we ran into
As a Computer Engineering student, I haven't taken a biology or life science course since high school. As such, I relied much on publications and various university media and presentations to understand the theory of the algorithm at hand.
The PHD student proposed linking up all systems built for his research to a database for easy data retrieval/storage. However, while countless hours were spent trying to link the system to the Oracle Database, it was not successful. This will definitely be part of the future plan.
The Genie.jl website framework had changed its UI and system since that last time I used it (April 2024). Thus, a website was not completely finished nor deployed. With additional time, a frontend would definitely be built.
Accomplishments that we're proud of
I'm proud of the fast-paced learning of the de Bruijn graph algorithm. Additionally, while I wasn't able to complete the database integration, I was able to learn a lot of SQL and the Oracle Database system, something that I'd definitely never have encountered in my education.
What we learned
I learned a lot about the power and complexity of bioinformatics. Through this project, I was able to work more with algorithm design again. Additionally, I learned a lot of SQL and about the Oracle Database system.
What's next for Genomia
The next step, of course, is to contact industry professionals and professors for algorithmic improvement. Additionally, the originally planned feature to connect the system to a database for simple data storage and retrieval will need to be implemented, likely through the Oracle.jl or AWS.jl packages.
Overall, Genomia has promise. With more tinkering and feedback from professionals, we can definitely build a powerful tool for aid in research.
Built With
- genie.jl
- julia
Log in or sign up for Devpost to join the conversation.