Inspiration
Storing and managing data had never been easy and with flourish of AI, deep learning we have generated paramounts of data called Big Data. Because unstructured data is made up of files like audio, video, pictures and even social media data, it's easy to see why volume is a challenge. The value of the data can get lost in the shuffle when working with so much of it. There is value to be found in unstructured data, but harnessing that information can be difficult. The ideal big data storage system would allow storage of a virtually unlimited amount of data, cope both with high rates of random write and read access, flexibly and efficiently deal with a range of different data models, support both structured and unstructured data, and for privacy reasons, only work on encrypted data. Obviously, all these needs cannot be fully satisfied.
The problem statement is also from my personal story, I was working on a wildlife conservation project to save animals from poaching and using biometric sensors to monitor body vitals, with tremendous hardwork I collected enough amount of data, not an easy task- I had to daily meet park rangers to get data but my negligence ruined all my hardwork after fatal HDD crash. Such huge loss in terms of data and time, the project could not be completed yet, due to it. But this time I will ensure that the data generated by the data revolution can be harnessed to the full extent of its potential using CORTX S3 data services which are scalable, reliable and efficient.
Hypothesis
Integrating Label Studio, an open source data annotation tool widely used by global AI brands. Integrating, S3 services for result storing and data retrieval for data analytics and other data related tasks. Label Studio is compatible with cloud services in it's latest version, so it is good and compatible to integrate Cortx S3 with it.
What it does
Our integration is simple, it allows you to import bulk data from your S3 bucket and once you are with data annotations, you can export it in most widely used annotation formats: JSON, CSV, COCO, PASCAL VOC. Moreover, you can also download the exported data in any Machine Learning environment using this integration.
How we built it
Step 1: Download requirements
- We are integrating S3 storage on label Studio, open source annotation tool, download it using
pipcommand:$ pip install -U label-studio
- Start the label studio to verify installation, it would run on local host
$ label-studio
- Once the tool is running, go to account setting and grab Access Token for Label Studio to connect it using python methods to automate tasks through REST API calls.
Step 2: CORTX Cloudshare VM lab setup
Follow this guide from Cortex Team: CORTX-Cloudshare-Setup-for-April-Hackathon-2021
- Once the Cortex VM is ready, run this command, keep note of external address to that will serve as our S3 endpoint URL
$ sudo route add default gw 192.168.2.1 ens33
- Start Windows Server 2019 Edition, go to this S3 cortex dashboard, by default S3 user and test bucket is already made for you, a txt file on Desktop contains all default S3 credentials for use
https://192.168.2.102:28100/#/dashboard
Step 3: Connecting S3 data endpoint class and methods for uploading and downloading of data from S3 bucket to anywhere. Our S3DataEndpoint file link: S3DataEndpoint Python Class
We are using Streamlit as frontend for our integration user interface, below is video for how annotation project looks like through our integration, the label_creater XML file we uploaded has information of different labels to be used for data provided. (There are variety of label_creator XML templates for any kind of data annotation project, decide which one to use based on your use case and create XML file for that to use with our integration platform: https://labelstud.io/templates/ )
Problems we ran into
Troubleshoot CORS and access problems: After syncing the imports in S3 bucket, your files don't load in Label Studio due to Cross-origin resource sharing(CORS) not being supported by Cortex S3 Currently, so you can manually upload all your data for labeling instead of syncing from S3.
Troubleshoot data not being exported or downloaded from S3 bucket: If you are on Cortx Cloudshare instance, make sure it's active not suspended, if suspended, reconnect and try running the below command on Cortx VM and ping it on your system not cloudshare using the device connection external address
$ sudo route add default gw 192.168.2.1 ens33 # ping it on your local system to check if S3 connection is okay or not $ ping uvo100ebn7cuuq50c0t.vm.cld.sr
Troubleshoot app.py file not being able to run: Our app is made with streamlit.io, to run the app, you need to run using the below command,
$ streamlit run app.py
Troubleshoot download and upload path: When using methods to download or upload files from S3, you need to make sure you give correct file name and their location or path.
Accomplishments that we're proud of
- Successful in connecting Cortx S3, making data bucket and connecting with label studio project
- Used API to upload bulk files to Cortx S3 and sync it to our label-studio project
- One button file export to Cortx S3 bucket for different formats of data annotations - JSON, CV, COCO, PASCALVOC
- Ability to download any file from S3 bucket anywhere, allowing flexibility in using at multiple places or use case, like downloading data annotation results in Jupyter Notebook and Tensorflow framework to make AI, data apps
What's next for Cortx Label Studio Integration
- Integrating backend ML operations for auto-labeling tasks in Label Studio and storing results, metrics on S3.
- Multi-level user access for better project control and role management
- Integrating other services like data encoder, decoder and direct training over GPU with major ML frameworks
- Faster query system on Cortx S3 using Motr Layer


Log in or sign up for Devpost to join the conversation.