# My experience at the SKA Big Data Africa School

I was lucky to be selected to the Square Kilometre Array Big Data Africa School funded by the SKA and the Development of Africa With Radio Astronomy (DARA). It was held in Cape Town, South Africa.

Big Data is a cloudy idea. Easy to know when you have it, hard to describe. I like thinking of it as data that is sufficiently large such that it is difficult to draw information from it “easily”. Yes, this definition is ambiguous and probably incomplete. I do not claim absolute knowledge. Come up with your own definition!

While I found the dawn of big data and its challenges for an information flooded, knowledge-ridden society impressive, what sparked my interest most is the specific computational schemes that are used to deal with the data. These techniques can be put under a general scheme of computing and informatics called machine learning. A more mystic and misty (maybe more attractive ) term for these computational problem solving schemes is artificial intelligence.

Before the conceptualization of machine learning, the dominant scheme of computer science was the so called rule-based learning. The computer was thought of as a deterministic system. Designing any algorithm was done with this in mind. You had to tell the computer what to do at every step. Every previous step determined the next.

This mode of thinking is excellent for a certain case of problems e.g calculations. One of the earliest instances of such an approach was in the design of calculational machines. A famous example of such a device is Charles Babbage’s difference engine which used the well known mathematical algorithm known as the method of differences. Rule based learning has limitations however. Imagine you had a set of data from which you had to find some trend. For rule based computation, you would have to know the rule *a priori *then fit the rule to the data. It would be very difficult to get rule based systems to find these trends independently. Factor in the massive data volumes we deal with in big data, and you have a mess.

Machine learning steps in to save the day!

The idea is to reduce the amount of control you have on the computer. Just let it be. Let the computer learn the trends in the data on its own. There are several schemes of implementing this. The schemes fall under the general classes of supervised and unsupervised learning. In supervised learning, the computer is provided with a labeled data set. It then tries to associate the labels to the data. That’s is the training phase. After that, the machine does an exam. It is provided with a data set with the labels removed. It then tries to label the data set using the association rules that it came up with during training. If it fails the test, some parameters are modified to help improve its performance. If it passes the test, it is given the real unlabeled data set to work with.

In unsupervised learning, no labeled data set is provided. The algorithm is designed in such a way that the computer tries to group data by finding common characteristics. The system is particularly powerful in clustering applications.

In the school, we were introduced to these techniques. The school uses a “learn by doing” scheme. We were split into groups and given problems to work with. Our group worked on the processing of Diabetic Retinopathy images. Diabetic Retinopathy is a degenerative disease of the eye that affects Diabetic patients. An image is taken and then classified on a zero to four scale. Zero represents a healthy eye, four is a severely sick eye. The classification of these images presents a challenge: trained doctors have a classification accuracy of around sixty percent. Since misclassification may result in misdiagnosis, this is a serious issue.

What is needed is a precise, stable and robust classification system. That is what we set out to build in the school. We used machine learning implementations in the Python programming language. We had 35000 images to classify. We however only worked with 6000 sample images. We hit a peak accuracy of seventy four percent, which is around fourteen percent percentage points better than the average performance of trained doctors. How about that for a week’s work!

Google used similar methods on the same problem. Their algorithm did much better than ours (of course). They hit ninety percent accuracy. They reached an agreement with the National Health Service of the United Kingdom to use the algorithm for free for five years. It would be wonderful to have an open source version of this algorithm at work in Africa. The algorithm could then be distributed to hospitals for free and probably even adapted to other medical imaging procedures.

The school is a wonderful platform. You are trained by some of the best minds from around Africa and the world. We were quickly moved from introductory techniques the state of the art in machine learning. The “learn by doing” approach is extremely useful. It’s an excellent simulation of how a scientist approaches problems. I recommend the school to any young African problem solver!