Review and Progress
Research on the Construction of Cloud Genome Database and Cotton Breeding Information Platform 


Cotton Genomics and Genetics, 2025, Vol. 16, No. 4
Received: 18 Jun., 2025 Accepted: 29 Jul., 2025 Published: 21 Aug., 2025
The rapid advancement of cotton genomics and breeding technologies has heightened the need for an integrated, scalable, and efficient digital platform. In this study, we constructed a cloud-based genome database and developed a comprehensive cotton breeding information platform to address data integration, standardization, and secure management of large-scale breeding and genomic datasets. Leveraging cloud computing infrastructure, we designed a modular architecture comprising data acquisition pipelines, core functional modules, and user-friendly interfaces to support breeding decision-making. A case study on the implementation of a national cotton genomics platform demonstrated the system’s effectiveness in enhancing data accessibility, real-time analysis, and collaborative breeding efforts. This research highlights the benefits of cloud platforms in supporting open science, accelerating breeding efficiency, and enabling data-driven decision-making. Future development will focus on integrating artificial intelligence, enhancing global interoperability, and strengthening policy and capacity-building frameworks to ensure sustainable use of the platform.
1 Introduction
With the development of genomics and information technology, cotton breeding has also undergone great changes. Now, researchers can integrate and analyze a lot of omics data, mainly thanks to cloud-based genomic databases and some specialized breeding information platforms (Yang et al., 2022a). At present, cotton breeding has entered a data-driven stage. Scientists have accumulated a large amount of data on genomes, transcriptomes, and trait performance. To make good use of these data, we need a powerful and easy-to-use platform to help organize, unify and view them. Platforms such as CottonGen, CottonFGD, and COTTONOMICS were established under this demand. They provide tools for finding data, graphical display, and analysis, and also help researchers around the world to better cooperate (Yu et al., 2015; Zhu et al., 2017).
Cloud-based genomic databases are very important for the rapidly growing amount of data. They can not only store data centrally, but also integrate different types of data together, and researchers around the world can access them at any time. These databases can help people find important genes, study genetic differences, and support some advanced breeding methods. The ultimate goal is to breed new cotton varieties with high yield, high quality and disease resistance more quickly (Dai et al., 2022; Yang et al., 2022b). The cloud platform can also support multi-person collaborative projects and keep data updated and open to everyone.
The main goal of this study is to integrate various omics data and breeding information by establishing a cloud-based genome database and breeding information platform, and to provide some useful analysis and display tools, and to further promote cooperation among cotton researchers around the world. Looking forward, these platforms may bring new breeding methods, promote the development of precision agriculture, and may also help solve the problems of sustainable development and adaptation to climate change in cotton cultivation. As long as these platforms continue to improve, researchers will be able to apply genomic research results to practice more quickly and accelerate the process of cotton variety improvement.
2 Technological Foundations of Cloud Genome Databases
2.1 Cloud computing infrastructure
Cloud computing can provide scalable, low-cost, and flexible resources, which is very important for storing and analyzing the large amount of genomic data generated in modern breeding. Because it can quickly increase or decrease resources as needed, researchers are more efficient when processing big data. Distributed systems like Hadoop can also allow multiple computing tasks to be performed simultaneously, which is suitable for analyzing ultra-large data sets (O'Driscoll et al., 2013). Now many countries and international organizations have built their own cloud platforms, some using hybrid clouds, and some using a combination of multiple cloud platforms, which makes it easier for different institutions and countries to do research and share data together (Ogasawara, 2022; Molnár-Gábor et al., 2017).
2.2 Data integration and standardization
Many times, the data is not unusable, but "cannot be put together". Especially in genomic databases, data from different sources and in different formats are piled together. If there is no unified processing method, the analysis work cannot be carried out smoothly (Dahlquist et al., 2023). Of course, if you want to integrate these data well, it is not a matter of relying on only one standard. You must first have a unified extraction process, and then the format and description fields of the data (that is, metadata) must be consistent to avoid errors. But the situation is not always so ideal. Once the standards are well implemented, not only will the data fit, but the various platforms will also be more compatible. At this time, researchers can spend less time on data cleaning and focus on truly meaningful analysis. You will find that seemingly messy data can actually piece together a clear picture (Langmead and Nellore, 2018).
2.3 Security and data governance
Because genomic data is very sensitive, data security and management are particularly important. Cloud platforms will set up a variety of ways to protect data, such as multi-layer access control, user authentication, data encryption, and operation records (Chen et al., 2018; Satish, 2024). At the same time, a complete set of management rules is also needed to regulate the process of data upload, access, and use to ensure compliance with local laws and ethical requirements (Dove et al., 2014). In addition, some new encryption technologies, such as homomorphic encryption and secure computing protocols, are also used to further protect privacy. These methods allow different teams to analyze data together without exposing the original data (Tang et al., 2016; Cheng et al., 2023; Blindenbach et al., 2024).
3 Architecture of Cotton Breeding Information Platforms
3.1 Core functional modules
The cotton breeding information platform is composed of multiple functional modules that support the storage, analysis and display of data. The main modules include: Search and retrieval functions: users can quickly find data on genomes, traits and breeding. Analysis tools: the platform can perform single gene analysis, process a batch of genes, do association studies and draw graphs of different types of data (Yang et al., 2022b). Data management: helps ensure that data from different sources are of good quality, correctly annotated and smoothly integrated (Yu et al., 2013). Special tools: such as genome browsers, genetic map viewers, homology analysis tools, and breeding information management systems, can analyze data in more depth. Download and statistics functions: data can be downloaded in batches, and statistical information on data usage can be viewed, which facilitates others to reproduce and analyze, and ensures the transparency and credibility of the data.
3.2 Data acquisition and upload pipelines
There are so many types of data in cotton research that sometimes you can’t even figure out where the data came from. Omics, traits, and even field data are all mixed together. Without a smooth data flow, the platform can’t handle it. The common practice now is to rely on the system to automatically collect and pull all kinds of data first (Figure 1). After that, it can’t be used directly. It needs to be cleaned up to see if there are any problems with the format and whether the data itself is reliable (Issac et al., 2023). However, cleaning alone is not enough, and processing must also be efficient. Tools like Hadoop or Azure are often used to process large-scale data in batches. When there is a lot of data, this method can save a lot of time (Thesma et al., 2024). In the end, researchers still hope that the entire upload process is not too complicated. We also thought of this, so we designed a relatively simple web form and an API interface. If you want to click manually, just click a few times, and if you want to click automatically, just call the program directly. It’s convenient.
![]() Figure 1 Front view of the rover deployed to collect video streams of cotton plants (Adapted from Thesma et al., 2024) |
3.3 User interface and experience design
How to design the interface of this platform? We did not intend to make it too sophisticated at the beginning. Some users are students who are just getting started, while others are veterans who analyze data every day. Everyone must be able to use it easily. So the interface must be simple and the charts must be clear. For example, we added an interactive genome browser and map viewer, which can be viewed by clicking, and the chart dashboard is also very intuitive. Some people may say that it is too troublesome to find data? Don't worry, the search function is very detailed, and you can filter according to different conditions, and you can quickly find the results you want (Yu et al., 2015). Of course, not everyone knows how to use these tools right away. So we also prepared a complete manual, video tutorials, and instructions for common problems (Zhu et al., 2017). The platform is also adapted, whether you use a mobile phone or tablet, or even an old laptop, you can open the web page without lag, and there is no threshold for researchers around the world to use it.
4 Case Study: Implementation of a National Cotton Genomics Platform
4.1 Project overview
In recent years, there are more and more data on cotton research, and it is becoming more and more difficult to find and use them. Especially for people who do breeding and genetic research, they often have to switch back and forth between different databases, which is easy to make mistakes. Therefore, the National Cotton Genomics Platform came into being. The purpose is actually very direct - to build a unified "transit station". The platform organizes and manages the genome sequence, gene annotation, genetic markers, transcriptome information, and resequencing data in a one-stop manner (Figure 2) (Zhu et al., 2017). Of course, this platform is not just for "storing things". The bigger goal behind it is to make scientific research and actual breeding smoother, so that everyone can make fewer mistakes and produce more. Once the data is unified, the research efficiency will naturally be improved, and it will be easier to find valuable genes and promote the breeding of new varieties.
![]() Figure 2 Overview of CottonMD. Construction pipeline of CottonMD through integration of multi-omics data (Adopted from Yang et al., 2022b) |
4.2 System design and deployment
To explain how this platform was built, it is not complicated, but it is not something that can be done overnight. It mainly relies on several core modules to operate, such as search, analysis, and visualization. Users can check different types of data, select a gene to look at, or analyze a batch of data together. Some functions are not fancy, but very practical. For example, genome browsers, genetic map viewers, and colinearity tools are basically essential for daily operations (Yu et al., 2015). In addition, the platform also supports downloading a large amount of data at one time, and the documentation is also very detailed. It is not difficult to flip through the manual when you can't find something. Although many people don't pay attention to the underlying technology, the platform uses cloud infrastructure, which is very critical. This not only allows it to support large-scale expansion, but also ensures the stability and global accessibility of the system. No matter which country you are in, as long as you have an Internet connection, you can upload and download, and work without restrictions.
4.3 Outcomes and impact
In the final analysis, whether a platform is worth it depends on how well it works after use. Judging from the operation in the past few years, it has indeed promoted a lot of work. Researchers no longer have to worry about finding data or piecing together analysis (Yang et al., 2022b). After the data is centralized, candidate genes can be found faster, and genetic variation analysis and trait association studies can be done more confidently. More importantly, this platform is not only multi-functional, but also easy to use. The interface is clear and the tools are sufficient. Whether it is a novice or an old hand, it is easy to get started. This has also gradually formed a stable scientific research circle, where everyone is willing to cooperate and share ideas. So what do you want to say? In addition to improving efficiency, it also pushed cotton breeding in the direction of "higher yield, more disease resistance, and better quality". This is not a slogan, but something that is slowly happening.
5 Benefits of Cloud-Based Cotton Breeding Systems
5.1 Efficiency and real-time access
Cloud platforms can help users quickly and centrally view large amounts of complex data, such as genomes, trait performance, and breeding-related information. Tools like CottonGen support rapid online search, search, and analysis, allowing researchers to use these data at any time and reuse existing data to discover new knowledge and improve cotton varieties. Cloud platforms can also be flexibly expanded to keep up with the trend of increasing data, ensuring that new data can be found in a timely manner once uploaded (Yu et al., 2015).
5.2 Enhanced decision-making
The cloud-based platform integrates many analytical tools and graphical display functions, making it easier for breeders to make decisions. They can directly use the selected multi-omics data on the platform with advanced analytical tools to find important genes, analyze which traits are related, and develop better breeding methods. These functions can help breed high-yield, high-quality, and disease-resistant cotton varieties more quickly (Yang et al., 2022a).
5.3 Support for open science and collaboration
Not all platforms are willing to open their resources to others, but this type of cloud platform is indeed more open in cotton research. CottonGen is a typical example. It has not only participated in many international cooperation projects, but also built interfaces to facilitate researchers from all over the world to use the same set of data and tools. If you say how difficult communication is, it is actually not a technical problem. The key is whether you are willing to let others come in and work together. This platform is obviously willing. It has built a public space that anyone can check and upload. Although this practice is not new, it is rare to promote it on such a large scale in agricultural research (Conaty et al., 2022). Therefore, this type of sharing mechanism is not simply for the sake of "cooperation", but more like paving the way for innovation. You are doing phenotyping in China, and others are doing gene editing in the United States. Once the information is matched, a new variety may pop up. Some breakthroughs really rely on everyone's efforts.
6 Challenges and Limitations
6.1 Technical barriers
The increasing amount of multi-omics and breeding data has brought many technical challenges to the platform. Integrating and managing various data from different places requires powerful technical systems and efficient computing tools (Yang et al., 2022b). However, it is actually quite difficult to keep these data consistent, of the same quality, and interoperable between different platforms, especially now that new data types and analysis methods are emerging. In addition, as the data becomes larger and larger, the requirements for storage and computing are also increasing, and the original system may become difficult. Real-time analysis and visualization functions will also slow down due to the large amount of data.
6.2 Organizational and operational hurdles
To make these platforms run well in the long term, it is not only a technical issue, but also requires the cooperation of a whole set of people. The platform needs to be managed by professionals on a continuous basis, and it also requires the active participation of community users, and it also needs to coordinate the relationship between different researchers. Researchers from all over the world are contributing data, so there must be clear data submission rules, as well as training and communication channels to ensure that everyone works under the same standards. In terms of operation, constantly adding new tools, accessing new data types, and keeping the platform stable and available all take a lot of resources and time, and also require technical support (Zhu et al., 2017). In addition, it is not easy to get scientific researchers, breeders, and database administrators to work well together. They focus on different points and do things in different ways, which can easily lead to collaboration problems and may also affect the promotion of the platform and data sharing.
6.3 Sustainability and maintenance
Whether the platform can continue to operate is not supported by temporary enthusiasm. It is easy to build it in the early stage, but it is difficult to keep up with it later. Technology alone is not enough, and someone must maintain it and keep an eye on the updates. Once this type of system is online, maintenance is actually the biggest problem. Technology must be upgraded every once in a while, whether it is hardware or software, including those security mechanisms, they must keep up with the changes (Yu et al., 2015). With more users and more data, the pressure also rises. New data sources keep popping up, and no one is responsible for organizing, archiving, and accessing them. Moreover, once the research direction changes, the platform functions must also be changed. The problem is that the manpower, money, and resources behind these must be continuously supplied. But then again, it is not easy to get truly stable and long-term support. Many projects have encountered this, and it is not an isolated case. In the final analysis, whether it can run for a long time depends on the endorsement of the institution and the investment of real money, both of which are indispensable.
7 Future Prospects and Recommendations
7.1 Integration with AI and machine learning
Applying artificial intelligence (AI) and machine learning (ML) to cotton genomic platforms will greatly change the way we analyze data and breed. AI tools can help process massive amounts of multi-omics data, identify relationships between complex traits, and help find target genes for breeding more quickly (Yang et al., 2022a). Gene editing technologies such as CRISPR, if used together with AI analysis, are expected to breed new cotton varieties with specific superior traits more quickly, which can also promote precision breeding and green agriculture (Sheri et al., 2025). In the future, the platform should focus on developing and integrating AI modules for predictive analysis, trait selection, and automatic data processing.
7.2 Global interoperability and platform sharing
Not all databases consider interoperability, especially when looking at the world, where too many standards can easily lead to problems. In the case of cotton, some platforms, such as CottonFGD and CottonMD, have already enabled researchers around the world to use unified data and tools, but compatibility is far from complete (Zhu et al., 2017). Without the same format and unified interface, collaborative research can only be done on a "separate basis", and communication is a bit confusing. However, there is hope. Some initiatives, such as the Global Cotton ENCODE Project, are promoting open data policies and are slowly bringing the originally scattered resources together. This is not to show off technology, but to facilitate more people to participate, compare, and cross-validate. To truly achieve global collaboration, we must rely on unified data formats, standard API interfaces, and even redesign the access system. These things are easy to say but not easy to do, but if we don't do them, there will always be obstacles to cooperation and resources will not be able to maximize their value.
7.3 Policy, education, and capacity building
In order for the platform to continue to improve, in addition to technology, it is also inseparable from policy support, continuous education and capacity building. Clear data sharing rules, privacy protection measures and intellectual property policies will help increase user trust and encourage more people to participate in open science (Kun et al., 2025). In order for researchers, breeders and students to use these platforms smoothly, training courses and simple and easy-to-use operation guides are also needed. In addition, investing resources to encourage community participation, provide technical support, and promote cooperation between disciplines can also ensure that both local breeders and international research teams can get practical help from the cloud-based cotton breeding platform.
8 Concluding Remarks
In the past, cotton breeding relied on experience and field trials, but now, data has become the protagonist. Platforms such as CottonGen, CottonFGD, and CottonMD have unknowingly become indispensable tools for research work. They not only put together massive amounts of omics data, but also keep up with the analysis tools step by step, so that researchers can use them directly no matter where and when. Of course, it may be a bit exaggerated to say that these platforms have "changed everything", but at least, it is now much easier to check data, do analysis, and manage projects than before. Efficiency has been improved and the speed of variety improvement has been accelerated, which is a visible change.
But then again, genomic data has been increasing, and the people involved in the research are from different disciplines. In addition, new technologies are constantly leaning towards breeding, so things are not that simple. Sometimes you will find that if a data interface is not matched, the whole process will be stuck. Although the direction of precision breeding is clear, there is still a lot to do if it is to be truly implemented.
AI, machine learning, and high-throughput tools sound very "high-tech", but in order to make them really work, they have to be embedded in the platform little by little. Prediction, screening, and analysis are not done manually by people, but are automatically run by the system. This is the next goal. To achieve global applicability? That is even more of a systematic project. The unified data format and the mutual recognition of platforms are not something that any team can solve alone. Besides, the investment of resources must keep up. Basic work such as genome splicing cannot be expected to be done once and for all; if the interface is not friendly, no one will want to use the tool no matter how powerful it is.
Looking back now, these databases and platforms are not only built by technology, they have become the "infrastructure" of modern cotton breeding. Technology is changing, and people are changing, but as long as scientists are willing to share and work together, these platforms can continue to move forward and help us cope with future climate, yield, and other uncertain problems. In the end, the goal is still the same old goal: to breed new cotton varieties that can handle things, have high yields, and are easy to manage.
Acknowledgments
I thank Mr. Li for his careful review of an earlier draft, whose comments enhanced the rigor of the argument.
Conflict of Interest Disclosure
The author affirms that this research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.
Blindenbach J., Kang J., Hong S., Karam C., Lehner T., and Gürsoy G., 2024, SQUiD: ultra-secure storage and analysis of genetic data for the advancement of precision medicine, Genome Biology, 25(1): 314.
https://doi.org/10.1186/s13059-024-03447-9
Chen L., Aziz M., Mohammed N., and Jiang X., 2018, Secure large-scale genome data storage and query, Computer Methods and Programs in Biomedicine, 165: 129-137.
https://doi.org/10.1016/j.cmpb.2018.08.007
Cheng K., Hou Y., and Wang L., 2023, Secure similar sequence query over multi-source genomic data on cloud, IEEE Transactions on Cloud Computing, 11(3): 2803-2819.
https://doi.org/10.1109/TCC.2022.3228906
Conaty W., Broughton K., Egan L., Li X., Li Z., Liu S., Llewellyn D., MacMillan C., Moncuquet P., Rolland V., Ross B., Sargent D., Zhu Q., Pettolino F., and Stiller W., 2022, Cotton breeding in australia: meeting the challenges of the 21st century, Frontiers in Plant Science, 13: 904131.
https://doi.org/10.3389/fpls.2022.904131
Dahlquist, J., Nelson S., and Fullerton S., 2023, Cloud-based biomedical data storage and analysis for genomic research: landscape analysis of data governance in emerging NIH-supported platforms, Human Genetics and Genomics Advances, 4(3): 100196.
https://doi.org/10.1016/j.xhgg.2023.100196
Dai F., Chen J., Zhang Z., Liu F., Li J., Zhao T., Hu Y., Zhang T., and Fang L., 2022, COTTONOMICS: a comprehensive cotton multi-omics database, Database, 2022: baac080.
https://doi.org/10.1093/database/baac080
Dove E., Joly Y., Tassé A., Kaye P., Burton P., Chisholm R., Fortier I., Goodwin P., Harris J., Hveem K., Kaye J., Kent A., Knoppers B., Lindpaintner K., Little J., Riegman P., Ripatti S., Stolk R., Knoppers M., Bobrow M., Cambon-Thomsen A., Dressler L., Joly Y., Kato K., Rodriguez L., McPherson T., Nicolàs P., Ouellette F., Romeo-Casabona C., Sarin R., Wallace S., Wiesner G., Wilson J., Zeps N., Simkevitz H., De Rienzo A., and Knoppers B., 2014, Genomic cloud computing: legal and ethical points to consider, European Journal of Human Genetics, 23(10): 1271-1278.
https://doi.org/10.1038/ejhg.2014.196
Issac A., Ebrahimi A., Velni J., and Rains G., 2023, Development and deployment of a big data pipeline for field-based high-throughput cotton phenotyping data, Smart Agricultural Technology, 5: 100265.
https://doi.org/10.1016/j.atech.2023.100265
Kun W., He S., and Zhu Y., 2025, Cotton2035: from genomics research to optimized breeding, Molecular Plant, 18(2): 298-312.
https://doi.org/10.1016/j.molp.2025.01.010
Langmead B., and Nellore A., 2018, Cloud computing for genomic data analysis and collaboration, Nature Reviews Genetics, 19(4): 208-219.
Molnár-Gábor F., Lueck R., Yakneen S., and Korbel J., 2017, Computing patient data in the cloud: practical and legal considerations for genetics and genomics research in Europe and internationally, Genome Medicine, 9(1): 58.
https://doi.org/10.1186/s13073-017-0449-6
O'Driscoll A., Daugelaite J., and Sleator R., 2013, 'Big data', Hadoop and cloud computing in genomics, Journal of Biomedical Informatics, 46(5): 774-781.
https://doi.org/10.1016/j.jbi.2013.07.001
Ogasawara O., 2022, Building cloud computing environments for genome analysis in Japan, Human Genome Variation, 9(1): 46.
https://doi.org/10.1038/s41439-022-00223-8
Satish K., 2024, Database security issues and challenges in cloud computing, International Journal on Recent and Innovation Trends in Computing and Communication, 11(11): 937-943.
https://doi.org/10.17762/ijritcc.v11i11.10396
Sheri V., Mohan H., Jogam P., Alok A., Rohela G., and Zhang B., 2025, CRISPR/Cas genome editing for cotton precision breeding: mechanisms, advances, and prospects, Journal of Cotton Research, 8(1): 4.
https://doi.org/10.1186/s42397-024-00206-w
Tang H., Jiang X., Wang X., Wang S., Sofia H., Fox D., Lauter K., Malin B., Telenti A., Xiong L., and Ohno-Machado L., 2016, Protecting genomic data analytics in the cloud: state of the art and opportunities, BMC Medical Genomics, 9(1): 63.
https://doi.org/10.1186/s12920-016-0224-3
Thesma V., Rains G., and Mohammadpour J., 2024, Development of a low-cost distributed computing pipeline for high-throughput cotton phenotyping, Sensors, 24(3): 970.
https://doi.org/10.3390/s24030970
Wang L.T., and Wang H.M., 2024, Big data in genomics: overcoming challenges through high-performance computing, Computational Molecular Biology, 14(4): 155-162.
https://doi.org/10.5376/cmb.2024.14.0018
Yang Z., Gao C., Zhang Y., Yan Q., Hu W., Yang L., Wang Z., and Li F., 2022a, Recent progression and future perspectives in cotton genomic breeding, Journal of Integrative Plant Biology, 65(2): 548-569.
https://doi.org/10.1111/jipb.13388
Yang Z., Wang J., Huang Y., Wang S., Wei L., Liu D., Weng Y., Xiang J., Zhu Q., Yang Z., Nie X., Yu Y., Yang Z., and Yang Q., 2022b, CottonMD: a multi-omics database for cotton biological study, Nucleic Acids Research, 51(D1): D1446-D1456.
https://doi.org/10.1093/nar/gkac863
Yu J., Jung S., Cheng C., Ficklin S., Lee T., Zheng P., Jones D., Percy R., and Main D., 2013, CottonGen: a genomics, genetics and breeding database for cotton research, Nucleic Acids Research, 42(D1): D1229-D1236.
https://doi.org/10.1093/nar/gkt1064
Yu J., Jung S., Cheng C., Lee T., Zheng P., Buble K., Crabb J., Humann J., Hough H., Jones D., Campbell J., Udall J., and Main D., 2015, CottonGen: the community database for cotton genomics, genetics, and breeding research, Plants, 10(12): 2805.
https://doi.org/10.3390/plants10122805
Zhu S.J., and Luo M.T., 2024, Advancements in pest management techniques for cotton crops, Bioscience Methods, 15(4): 196-206.
https://doi.org/10.5376/bm.2024.15.0020
Zhu T., Liang C., Meng Z., Sun G., Meng Z., Guo S., and Zhang R., 2017, CottonFGD: an integrated functional genomics database for cotton, BMC Plant Biology, 17(1): 101.
https://doi.org/10.1186/s12870-017-1039-x
. HTML
Associated material
. Readers' comments
Other articles by authors
. Kaiwen Liang

Related articles
. Cotton breeding

. Cloud genome database

. Data integration

. Digital platform

. Genomic information system

Tools
. Post a comment