University of Engineering and Technology Taxila
Call us: +92 051 9047 846
Follow us:

Data Science Lab

With the focus on offering new program at Post-graduate level, MS Data Science, Department of Computer Science started to develop a new lab based on cloud computing that offers a solution for data analysis in resizable infrastructure via on demand available resources model. This lab focuses the seven Vs of Big Data i-e the Volume, Velocity, Variety, Variability, Veracity, Visualization, and Value. Big Data Lab offers students an opportunity to run complex machine learning algorithms, social mining and predication-based analysis over the huge sized, petabytes of raw data to extract the information that becomes knowledge and therefore transforms every discipline of Engineering and Computer Science to discover new challenges. Our Lab research agenda is to offer multi-disciplinary research platform that support MS Data Science program to students with different scientific background so that research issues related to Finance, Computing and Engineering can be addressed at single facility.

Goals

The goals of this lab are:

    To provide University students remote access to a single or set of VMs from anywhere.
    To utilize lab computers and infrastructure to reduce the cost of implementation of a remote lab.
    Provide VM hosting accessible through a hosting pool.
    Disseminate information about the project to the multiple communities in the Punjab through presentations at professional state and regional conferences and assist other interested universities in implementing the platform.
    To provide students, a facility where they can work on and analyze large complex datasets.

Tools

Hadoop Cluster

Students get an opportunity to work with Hadoop in both a single node cluster as well as a fully distributed (multi-node) Cloudera distribution cluster. Hadoop allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. The Hadoop platform is designed to solve problems where you have a big data. Its for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. The majority of this data will be “unstructured” complex data poorly suited to management by structured storage systems like relational database. Unstructured data comes from many sources and takes many forms web logs, text files, sensor readings, user-generated content like product reviews or text messages, audio, video and still imagery and more dealing with big data requires two things:

  • Inexpensive, reliable storage; and
  • New tools for analyzing unstructured and structured data.
  • Cloudera offers commercial support and services to Hadoop users.

    Projects

  • Calculation of CPU performance, power and cost using Hadoop http://ieeexplore.ieee.org/document/7845093/?reload=true
  • Performance analysis of shared-nothing SQL-on-Hadoop frameworks based on columnar database systems http://ieeexplore.ieee.org/document/7845097/
  • Modeling sentiment terminologies: Target based polarity phenomena http://ieeexplore.ieee.org/document/7845100/
  • Reading Material

  • https://www.datascienceweekly.org/data-science-resources/the-big-list-of-data-science-resources
  • http://www.kdnuggets.com/2015/09/free-data-science-books.html
  • http://hpc.fs.uni-lj.si/sites/default/files/HPC_for_dummies.pdf
  • http://www.shodor.org/media/content/petascale/materials/UPModules/beginnersGuideHPC/moduleDocument_pdf.pdf