Mihir Shekhar

Ph.D, Student

Center for Data Engineering

IIIT - Hyderabad

Telangana, India

E-mail : [firstname]dot[lastname]@ research.iiit.ac.in

Phone- 0091 9581 826 727

Facebook    Linked    Twitter    g+   

Mihir Shekhar

About Me

I am a Ph.D, student working with Prof. Kamalakar Karlapalem at International Institute of Information Technology, Hyderabad. I pursued my Bachelor's degree in Computer Science and Engineering from Jalpaiguri Government Engineering College. Before pursuing my Master's, I worked for Tata Consultancy Services, Kolkata as a software engineer. After that my desire to pursue higher studies brought me to IIIT-H, where I joined as a Ph.D student. I work in Data Science and Analytics Center (DSAC) at the university.

My research focuses on Scalable High Dimensional Data Clustering. As part of research associate in DSAC, I am also working on various projects which utilises Natural Language Processing and Machine Learning.

My research interests include Application of Machine Learning and NLP Techniques to problems in data mining, Information Retrieval, Information Extraction etc. In free time I love to listen music, play chess or badminton and cook innovative dishes.


Research Interests:

  • Data Mining
  • Information Extraction and Retrieval
  • Natural Language Processing
  • Machine Learning

Publications :

  • K Santosh, Romil Bansal, Mihir Shekhar, Vasudeva Varma, Author Profiling: Predicting Age and Gender from Blogs, Notebook for PAN at CLEF 2013, Valencia, Spain. link

Current & Past Projects:

  • Deep Clustering and Outlier Detection
  • A semi-supervised deep clustering framework for simultaneous clustering and outlier/noise detection in high dimensional data.
  • Patient Cohort Detection and Visualisation
  • A weak supervised metric learning system for patient similarity detection from discharge summaries which is further used for identifying patient cohorts using clustering and visualisation. This project is funded by Hitachi R&D Labs.
  • Overcoming Data Sparsity in Neural Machine Translation
  • This project involves creating a statistical parser for medical documents and assign med- ical terms like drugs, symptoms, etc. to their corresponding disease.
  • Medical Document Analysis (current)
  • This project involves creating a statistical parser for medical documents and assign med- ical terms like drugs, symptoms, etc. to their corresponding disease. This project is funded by Hitachi R&D Labs.
  • Twitter Data Analysis
  • This project work involves retrieval of semantic knowledge from tweets. Semantics of our interest are : Event/Episode Detection, Sentiment Analysis and Concept Extraction. This project is funded by Hitachi R&D Labs. link
  • Author Profiling on blogs
  • This project involves prediction of Age and Gender of author from blogs written by them. SVM and Decision tree was used for performing classification. link
  • Web Content Filtering
  • This project involves creation of an automatic system for categorization of web pages into different classes based on their content. A web content filter is built on top of it, to block undesired categories dynamically.
  • StackOverFlow Tag Prediction
  • This project involves prediction of tags for StackOverFlow data. κ-nearest neighbor approach and HMM's built on tag graph was used to predict results.
  • Finding Most Influential Entities in Web
  • This project involves finding and ranking of most influential people among a group of Baidu users differentiating between fake users and genuine users. Grapchi was used in implementing algorithms for scalablity and speed. It can process a billion entities in 15 minutes.
  • Data-Mining on Accidents Dataset
  • This project involved data pre-processing of a huge dataset, followed by data clustering using K-Means and DBSCAN algorithms and Frequent Item-Set Generation to analyse the trends in the occurrence of traffic casualities based on several conditions.
  • Wikipedia Search Engine
  • Created a fully functional offline search engine. This search engine was implemented over Wikipedia Corpus of size 42GB. Multilevel indices were built on page title, infobox, text and outlinks to support queries for multiple fields. Everything was done from scratch without the use of any existing tools like Lucene, Lemur and wikixmlj parser.


Work Experience