The use of data mixing to increase the size of the dataset for colon cancer diagnosis using diffuse reflectance spectroscopy and machine learning
Valentin Kupriyanov1,2 , Maria R. Pinheiro3, Sónia D. Carvalho4,5, Isa C. Carneiro4,6, Rui M. Henrique4,7 ,Valery V. Tuchin2,8,9, Luís M. Oliveira 3,10, Marine Amouroux1, Yury Kistenev2 and Walter Blondel1
1 Université de Lorraine, CNRS, CRAN UMR 7039, Vandoeuvre-Lès-Nancy, France
2Laboratory of Laser Molecular Imaging and Machine Learning, Tomsk State University, Tomsk, Russia
3Institute for Systems and Computer Engineering, Technology and Science (INESC TEC), Porto, Portugal
4Department of Pathology and Cancer Biology and Epigenetics Group, Portuguese Oncology Institute of Porto, Porto, Portugal
5Department of Pathology, Santa Luzia Hospital (ULSAM), Viana do Castelo, Portugal
6Department of Pathological, Cytological and Thanatological Anatomy, Polytechnic of Porto – School of Health (ESS), Porto, Portugal
7Department of Pathology and Molecular Immunology, Porto University – Institute of Biomedical Sciences Abel Salazar, Porto, Portugal
8Science Medical Center, Saratov State University, Saratov, Russian Federation
9A. N. Bach Institute of Biochemistry, RC “Biotechnology of the Russian Academy of Sciences,” Moscow, Russian Federation
10Physics Department, Polytechnic of Porto – School of Engineering (ISEP), Porto, Portugal
Abstract
The use of optical methods in combination with machine learning techniques to diagnose various pathologies and diseases is a very promising branch of science. However, in order to obtain accurate and reliable machine learning models, a large number of samples is required, which is often impossible even in the case of medical research. This problem can be partially solved by using data generation techniques, the simplest and most accessible of which is mixing. This study presents the results of using different mixing-based data generation strategies to increase the size of a dataset of diffuse reflectance spectra measured on XX? healthy and cancerous colon tissue ex vivo samples in order to train a classification model. The experimental set-up consisted of an integrating sphere coupled to a broadband deuterium-halogen light source and to a XXX spectrometer to acquire reflected intensity spectra in the range from 230 nm to 900 nm. The mixing of the data was implemented in three ways: mixing with randomly chosen weights for the original spectra, mixing using one randomly chosen spectrum as a basis one and the rest as auxiliaries, and mixing using multiple randomly chosen spectra with equal weights. This contribution presents a comparison of the performance of these strategies on the results of classification of diffuse reflectance spectra of healthy and cancerous colon tissues.
File with abstract
Speaker
Valentin Kupriyanov
Université de Lorraine, CNRS, CRAN UMR 7039, Vandoeuvre-Lès-Nancy, France and Laboratory of Laser Molecular Imaging and Machine Learning, Tomsk State University, Tomsk, Russia
France, Russian Federation
Discussion
Ask question