0% found this document useful (0 votes)
130 views

Dinesh Khanal

This document contains information about Dinesh Khanal's expertise and experience in cloud data implementation using technologies like GCP, SQL, Python, data management, Big Data, and Apache. It provides details on his education background and over 7 years of experience in software development using Hadoop and related technologies. His technical skills include languages like Java, SQL, and Scala as well as tools across the Hadoop ecosystem, AWS, GCP, databases, and cloud services. He has worked as a data engineer at Microsoft and CVS Health, leading data migration, ETL processes, and analytics projects utilizing technologies such as Azure, Databricks, and GCP.

Uploaded by

shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

Dinesh Khanal

This document contains information about Dinesh Khanal's expertise and experience in cloud data implementation using technologies like GCP, SQL, Python, data management, Big Data, and Apache. It provides details on his education background and over 7 years of experience in software development using Hadoop and related technologies. His technical skills include languages like Java, SQL, and Scala as well as tools across the Hadoop ecosystem, AWS, GCP, databases, and cloud services. He has worked as a data engineer at Microsoft and CVS Health, leading data migration, ETL processes, and analytics projects utilizing technologies such as Azure, Databricks, and GCP.

Uploaded by

shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Dinesh Khanal

“100% Remote”
Expertise:- GCP Cloud data implementation, SQL, Python, Data Management, Big Data, Apache

Summary:
• Dinesh has 7+ years of experience in all the phases of the Software Development Life Cycle including design,
architecture, implementation, release, unit testing, refactoring and maintenance of web-based applications using
Big Data & Hadoop technologies.
• Expertise in major components in Hadoop Ecosystems like Spark, HDFS, HIVE, PIG, HBase, Zookeeper, Sqoop, Oozie,
Flume, Kafka.
• Experience with using Hadoop clusters, Hadoop HDFS, Hadoop tools and Spark, Kafka, Hive in social data and
media analytics using Hadoop ecosystem.
• Excellent understanding of Hadoop distributed File system and experienced in developing efficient MapReduce
jobs to process large datasets.
• Experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark
Streaming and Spark SQL.
• Experience in optimizing the data storage in Hive using partitioning and bucketing mechanisms on managed and
external tables.
• Proficient in migrating the data using Sqoop from HDFS to Relational Database System and vice-versa according to
client's requirement.
• Installation of various Hadoop Ecosystems and Hadoop Daemons.
• Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and
Kafka.
• Ability to analyze search performance metrics such as query latency, throughput, and relevance scores to identify
and resolve performance bottlenecks.
• Strong collaboration skills to work effectively with cross-functional teams such as product management, software
engineering, and data science.
• Excellent communication skills to articulate technical concepts to both technical and non-technical stakeholders.
• Hands-on experience extracting data from different databases and risk management.
• Hands-on experience with installing Kerberos Security and setting up permissions, set up Standards and Processes
for Hadoop based application design and implementation.
• Experienced in cloud architecture using AWS ecosystem and GCP and Distributions (Cloudera Hadoop and
Hortonworks Hadoop).
• Worked on Azure Databricks, Azure Data Factory and Azure Data Lake.
• Experienced in Risk Frontier for risk management and data analysis.
• Experience in Big Data platforms like Hadoop platforms Microsoft Azure Data Lake, Azure Data Factory, Azure
Databricks, Azure Blob Storage and Graph Databases.
• Brief exposure in Implementing and Maintaining Hadoop Security and Hive Security.
• Experience integrating Kafka and Spark by using Avro for serializing and deserializing data, and for Kafka producer
and consumer.
• Expert in writing complex SQL queries with databases like MySQL, Oracle, SQL Server.
• Understanding of Hadoop Big Data architectures, data movement technologies, database partitioning, database
optimization, and building communication channels between structured and unstructured databases.

Education:
• Masters in engineering management, San Jose State University, San Jose, CA
• GCP Certified Google Cloud Engineer.

Technical Skills:
Hadoop Ecosystem: HDFS, MapReduce, YARN, Hive, Pig, Impala, Zookeeper, Sqoop, OOZIE, Kafka, Sqoop, Flume, Spark,
Pyspark, AWS, GCP, Azure
Languages: C, Java, SQL, PL/SQL, Scala, Shell Scripts
Operating Systems: Linux, UNIX, Windows
Databases: HBase, Oracle, SQL Server, MySQL, MongoDB, Cassandra
Cloud Services: AWS, GCP, Azure, Elasticsearch, Cloudera, Databricks, Hortonworks, Elastic, Cloud Foundry, Elastic Cloud,
EMR
Web Tools: HTML, JavaScript, XML, JDBC, EJB, MVC, AJAX, JSP, Servlets
Operating System: Windows, UNIX, Linux

Professional Experience:

Microsoft Corporation 06/2022 – Present


Data Engineer
Responsibilities:
• Worked on the Azure store Procedure and migrated storage accounts from existing on prem server to the azure
blob storage.
• Coordinated with the team to migrate file shares from existing on-premises server to the Azure files.
• Expertise in designing and implementing search solutions using open-source search technologies such as
Elasticsearch, Solr, and Lucene. Strong understanding of information retrieval techniques such as term frequency-
inverse document frequency (TF-IDF), relevance ranking, and clustering.
• Collaborated with technical leads in identifying the deliverable dependencies.
• Participated in Daily Scrum meeting to resolve various project related issues and tasks.
• Prepared test cases and strategies for unit testing and integration testing.
• Written and executed SQL queries to validate the Data.
• Extract the data from the source file, analyze, cleanse, and load it into the storage accounts.
• Coordinate with the team to analyze the monthly report.
• Writing and modifying the store procedure.
• Analyze the data coming from the SQL server.
• Developed RESTful APIs to facilitate data extraction, transformation, and loading processes, enabling seamless
integration with external systems.
• Implemented RESTful endpoints for data ingestion, allowing external applications to securely transmit data to the
data processing pipeline.
• Utilized RESTful principles to design and implement data retrieval APIs, enabling efficient and flexible data access
for downstream applications.
• and identifying Business and Functional Requirements and detailed discussions linking tasks for various functional
areas based on OFAC requirements, SAR requirements.
• Contributed to internal/external audit efforts and internal Risk Control Self Assessments in partnership with
compliance, Legal, Risk Management, Audit and successfully deploying it.
• Overlooking and reviewing the test plans, test scenario and test cases to make sure there are no ambiguities.
• Leveraged Azure Databricks to process and analyze large volumes of data, implementing scalable and parallel
processing solutions for efficient data transformations and analytics.
• Built and maintained Azure SQL Data Warehouse to support high-performance analytics and reporting, optimizing
data storage and query performance through intelligent indexing and partitioning strategies.
• Utilized SSIS (SQL Server Integration Services) to design and develop ETL (Extract, Transform, Load) processes for
efficient data integration and migration.
• Leveraged SSIS to extract data from various sources, including databases, flat files, and APIs, and transform it
according to business requirements.
• Developed complex data transformation workflows using SSIS to handle data cleansing, validation, and
aggregation tasks.
• Implemented error handling and data quality checks within SSIS packages to ensure the integrity and reliability of
data.
• Collaborated with cross-functional teams to gather data requirements and translate them into SSIS packages,
optimizing data flow and performance.
• Implemented data orchestration and workflow automation using Azure Logic Apps, ensuring smooth data
movement and coordination across various data processing steps.
• Integrated Azure Machine Learning into data pipelines to develop and deploy predictive models, enabling data-
driven decision-making and automated insights generation.
• Implemented data governance practices using Azure Purview, establishing data lineage, cataloging, and data
discovery capabilities to ensure data quality and compliance.
• Utilized Azure Blob Storage and Azure Data Lake Storage for scalable and cost-effective data storage solutions,
implementing data tiering strategies based on data access patterns and usage.
• Conducted UAT to make sure that all the user requirements are catered to.

CVS health group, Hartford, CT 08/2021 – 06/2022


Cloud Data Engineer
Responsibilities:
• Worked with google cloud platform (GCP) services like computer engine, cloud load balancing and cloud storage.
• Implemented Slowly Changing Dimensions (SCD) methodology to handle historical data updates and changes in a
large-scale data engineering project.
• Designed and developed an automated data pipeline to process and integrate CSV files into a dimensional data
model, utilizing SCD Type 2 techniques for capturing and managing historical changes.
• Collaborated with stakeholders to define business rules and requirements for handling dimension updates,
ensuring data consistency and accuracy over time.
• Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.
• Worked with data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
• Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
• Experience in Azure Databricks, Azure Data Factory, Azure Data Lake.
• Experience in building power bi reports on Azure Analysis services for better performance.
• Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
• Used DataStage Designer to develop process for extracting.
• Performed Import and Export of DataStage components and table definitions using DataStage Manager.
• Hands-on experience with search-related data structures such as inverted indexes, bloom filters, and bitmap
indexes.
• Developed and maintained Cassandra data models to optimize storage and retrieval of large-scale datasets.
• Implemented Cassandra database clusters for high availability and fault tolerance, ensuring data reliability and
minimizing downtime.
• Designed and implemented ETL processes to extract, transform, and load data into Cassandra, ensuring efficient
data ingestion and integration.
• Utilized Cassandra's distribution architecture and built data pipelines to handle real-time data streaming and
processing.
• Implemented data replication strategies and optimized data partitioning techniques in Cassandra to achieve high-
performance data access.
• Tuned and optimized Cassandra database performance by analyzing query execution plans, indexing strategies,
and data compaction techniques.
• Expertise in integrating search capabilities with other data engineering components such as data pipelines, data
lakes, and data warehouses.
• Familiarity with the latest advancements in search technologies, such as natural language processing (NLP),
machine learning, and graph databases.
• Create and maintain source codes in GitHub, kept track of source codes, explored, and shared the changes of
coding scripts, notes in GitHub.
• Develop a layer of applications modules over the Python - Pandas library, delivered various data frame visualization
tools, Data wrangling and cleaning using Python Pandas
• Develop DAGs and Setup production environment for the Apache Airflow for scheduling and automation system
that managed ETL and reporting.
• Used HiveQL for data analysis like creating tables and importing the structured data to specified tables for
reporting.
• Used Pig to perform data validation on the data ingested using scoop and flume and the cleansed data set is pushed
into HBase.

New York Life Insurance, New York, NY 11/2019 – 07/2021


Data Engineer
Responsibilities:
• Involved in entire lifecycle of the projects including design, development, deployment, testing, implementation,
and support.
• Extracted the data from application servers into HDFS and bulk loaded the cleaned data into HBase using Spark.
• Worked on Hadoop Stack, ETL TOOLS like TALEND, Reporting tools like Tableau and Security like Kerberos, User
provisioning with LDAP and lot of other Big Data technologies for multiple use cases.
• Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
• Leveraged Amazon Elastic MapReduce (EMR) to design and implement scalable data processing pipelines for New
York Life Insurance's large-scale datasets, resulting in a significant reduction in processing time and cost.
• Developed custom Hadoop MapReduce and Spark jobs on EMR to efficiently process and analyze multi-terabyte
data, enabling faster decision-making and improved business insights for the insurance domain.
• Implemented data ingestion processes using EMR and Apache Hive to extract, transform, and load (ETL) vast
amounts of structured and semi-structured data from various sources, ensuring data integrity and accuracy in the
downstream systems.
• Optimized EMR cluster configurations and fine-tuned Spark configurations to achieve optimal performance,
enabling efficient resource utilization and enhancing the overall data processing capabilities.
• Collaborated with cross-functional teams to design and implement EMR-based data workflows, integrating data
from different sources, such as databases, data lakes, and external APIs, to provide a unified and comprehensive
view of the data landscape.
• Involved in start to end process of Hadoop cluster setup where in installation, configuration and monitoring the
Hadoop Cluster.
• Collaborated with cross-functional teams to define RESTful API specifications, ensuring consistent and
standardized data communication across systems.
• Integrated RESTful APIs with authentication and authorization mechanisms, ensuring secure access to sensitive
data.
• Optimized RESTful API performance by implementing caching strategies, data compression techniques, and
request/response optimization.
• Implemented error handling and logging mechanisms within RESTful APIs to provide robust error reporting and
troubleshooting capabilities.
• Created and worked Sqoop jobs with incremental load to populate Hive External tables.
• Developed efficient MapReduce programs to filter out unstructured data and developed multiple MapReduce jobs
to perform data cleaning and pre-processing on Hortonworks.
• Created Hive queries to process large sets of structured and semi structed data.
• Involved in implementing security on Hortonworks Hadoop Cluster using with Kerberos by working along with
operations team to move non-secured cluster to secured cluster.
• Wrote partitioned data into Hive external tables using Parquet format.
• Transferred Streaming data from data sources into HDFS and HBase using Apache Flume.
• Involved in running Hadoop jobs for processing millions of records of text data. Troubleshoot the build issue during
the Jenkins build process. Implement Docker to create containers for Tomcat Servers, Jenkins.
• Implemented Spark using Scala and utilizing Data Frames and Spark SQL API for faster processing of data.
• Created Hive and Impala queries to spot emerging trends by comparing Hadoop data with historical metrics.
• Worked on building input adapters for data dumps from FTP Servers using Apache spark.
• Used Spark streaming with Kafka to build continuous ETL pipeline for real time analytics.
• Worked with the team for risk management and data analytics.
• Worked on Spark SQL and Data Frames for faster execution of Hive queries using Spark and AWS EMR.
• Loaded data from servers to AWS S3 bucket & setting appropriate bucket permissions.
• Involved in continuous Integration of application using Jenkins.
• Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
• Designed and implemented data ingestion pipelines using AWS Glue, enabling efficient extraction, transformation,
and loading of data from various sources into AWS Redshift.
• Optimized AWS Glue jobs to improve data processing performance and reduce processing time by implementing
partitioning, parallelization, and caching techniques.
• Implemented data quality checks and validations within AWS Glue workflows to ensure data integrity and accuracy
during the ETL process.
• Collaborated with cross-functional teams to gather requirements, design data models, and develop efficient data
processing solutions using AWS Glue and Redshift.
• Monitored multiple Hadoop clusters environments using Ganglia and Nagios. Monitored workload, job
performance and capacity planning using Ambari. Installed and configured Hortonworks and Cloudera
distributions on single node clusters for POCs.
• Handling schema changes in data stream using Kafka.
• Worked on multi-Clustered environment and set up Cloudera and Hortonworks Hadoop ecosystem.
• Helped create data ingestion framework for multiple source systems using PySpark.

Duke Energy, Charlotte, NC 04/2018 – 10/2019


Hadoop Developer/Data Engineer
Responsibilities:
• Actively involved in analysis of the system requirements specifications and involved in client interaction during
requirements specifications.
• Developed MapReduce programs to process raw data, populated staging tables and stored the refined data in
partitioned tables.
• Created Hive queries to spot emerging trends by comparing data with historical metrics.
• Used Python Pandas to import data from web service into HDFS and transformed data using Spark RDDs.
• Developed interactive Tableau dashboards and reports to visualize and analyze complex datasets for business
stakeholders.
• Designed and implemented efficient data extraction, transformation, and loading (ETL) processes using Tableau
Prep Builder to ensure data accuracy and consistency.
• Collaborated with cross-functional teams to gather business requirements and translate them into actionable
Tableau visualizations.
• Optimized Tableau performance by utilizing techniques such as data extracts, filters, and aggregation to enhance
dashboard loading speed and user experience.
• Created data connections and established data sources in Tableau Server, allowing for secure and scalable sharing
of dashboards across the organization.
• Conducted data validation and quality assurance checks on Tableau visualizations to ensure data integrity and
reliability.
• Developed new topics to segment data from Kafka and other web servers into HDFS.
• Worked with Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of
Hadoop jobs.
• Transferred Streaming data from data sources into HDFS and HBase using Apache Flume.
• Implemented data ingestion and cluster handling in real time processing using Kafka.
• Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations.
• Found solutions to data mapping for structed and unstructured data by applying schema inference.
• Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
• Importing and exporting data into RDBMS and Hive using Sqoop.
• Developed backend (server-side) in Scala.
• Deployed containerized applications using Docker for standardized service infrastructure.
• Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from
different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
• Implemented incremental loading strategies in SSIS to efficiently process large volumes of data and improve overall
system performance.
• Monitored and optimized SSIS package execution, identifying and resolving performance bottlenecks and data
processing issues.
• Worked with ELK Stack cluster for importing logs into Logstash, sending them to Elasticsearch nodes and creating
visualizations in Kibana.
• Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the jobs that extract the data in
a timely manner.
• Developed ETL pipelines using Spark and Hive to perform business specific transformations.
• Automated pipelines in Spark for bulk loads as well as incremental loads of various datasets.
• Worked on building input adapters for data dumps from FTP Servers using Apache spark.

Gap Details: Looking for a job opportunity 03/2018

Sage Rutty & Co., Rochester NY 01/2016 – 02/2018


Hadoop Developer
Responsibilities:
• Participated in requirement gathering, analysis, & documenting of business requirements by conducting meetings
with various business users.
• Developed MapReduce programs to process raw data, populated staging tables and stored the refined data in
partitioned tables.
• Automated all the jobs for pulling data from FTP server to load data into Hive tables, using Oozie workflows.
• Used PIG to preprocess logs that are stored in HDFS and then imported it to Hive warehouse that enables Bas to
write Hive queries.
• Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log
files.
• Worked closely with front-end developers to consume RESTful APIs, ensuring seamless integration and efficient
data transfer between systems.
• Performed aggregations and analyses on large sets of log data using the Risk Frontier risk analysis tool.
• Involved in importing real-time data to Hadoop using Kafka and implemented Oozie jobs for daily imports.
• Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS
HDFS.
• Used Spark SQL for getting customer insights, to be used for critical decision making by business users.
• Closely worked with data science team in building Spark MLlib applications to build various predictive models.
• Developed AWS Cloud Formation templates to create custom infrastructure of our pipeline.
• Developed AWS Lambda to invoke glue job as soon as a new file is available in Inbound S3 bucket.
• Developed custom ETL scripts in Python using AWS Glue to handle complex data transformations, data mapping,
and data cleaning tasks.
• Created and maintained AWS Glue Data Catalog, providing a centralized metadata repository for seamless data
discovery and access across multiple Redshift clusters.
• Implemented data encryption and security controls within AWS Glue and Redshift, ensuring compliance with
industry standards and protecting sensitive data.
• Optimized Redshift database performance by designing and tuning schema, indexes, and query execution plans
based on data access patterns and query performance analysis.
• Implemented automated monitoring and alerting solutions using AWS CloudWatch and AWS Glue job metrics to
proactively identify and resolve issues in the data processing pipeline.
• Conducted performance tuning and capacity planning for Redshift clusters, optimizing resource allocation and
managing cluster scaling to meet growing data processing demands.
• Developed ETL jobs to extract data from S3 & load it in data mart in Amazon Redshift.
• Loaded data into HBase for online lookups to business using Scala.
• Worked on creating the Data Model for Cassandra from the current Oracle Data model.

You might also like