100% found this document useful (1 vote)
1K views

College Data Analysis and Prediction SRS

This document discusses analyzing and predicting data from a college's records. It aims to analyze gender proportions, categories of students, branches with most enrollment, and student residences to help the college improve. The analysis will also predict what branch a student can get into based on their qualifying exam marks, category, and degree to help students choose. A literature survey found that while data analysis and prediction has been applied in other domains like Netflix recommendations, it has not been used for this particular college to study changing admission patterns and help promote the college better.

Uploaded by

sanjana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

College Data Analysis and Prediction SRS

This document discusses analyzing and predicting data from a college's records. It aims to analyze gender proportions, categories of students, branches with most enrollment, and student residences to help the college improve. The analysis will also predict what branch a student can get into based on their qualifying exam marks, category, and degree to help students choose. A literature survey found that while data analysis and prediction has been applied in other domains like Netflix recommendations, it has not been used for this particular college to study changing admission patterns and help promote the college better.

Uploaded by

sanjana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Data Analysis And Prediction

ABSTRACT
As this project mainly aims for analysis the of the college data of a given year and
that will help the college authority for more constructive and better rules and help to
grow college in a positive manner
We particularly aim to analysis and visualise the gender proportion, different
categories, no. Of students admitted with different qualify exams, residence of
different cities,no.of students present in various branches and many more
Our aim is also to predict the branch that will be available to the student with the
marks that he/she is being provided with.This will help students to check whether
they will be eligible for the branch they want.It will also help students to check in
which branch they can get admission.
Thus according to the previous record a criteria will be set for marks such of
different qualifying marks namely Jee and SVET and also parameters like
category,degree are used.
JEE and SVET are two different qualifying exams and each of them have different
different cutoff marks.

S.V.I.I.T 1
Data Analysis And Prediction

TABLE OF CONTENTS
TITLE PAGE NO.
ABSTRACT 1
LIST OF FIGURE
LIST OF TABLE
ABBREVATION
NOTATION

S.V.I.I.T 2
Data Analysis And Prediction

1. INTRODUCTION

S.V.I.I.T 3
Data Analysis And Prediction

1.1 PROBLEM STATEMENT


To tackle the increasing challenges of educational system, the dynamics and data of
the college is not being analysed. The analysis of this big data would enable college
to work in the fields where they are lacking and it will help in improving their
performance growth in various aspects.
Prediction of the data for the coming year will let college accomplish there decisions
with more accuracy and it will help in providing quality services and meeting the
expectations of people in a way better manner which is also not being done.

1.2 OBJECTIVE
Our objective is to analysis the previous years data and provide it to the college for
improvement of college and predict the branch for students according to their
category,qualifying exam with score and degree(intergrated or alone).

1.3 SCOPE
“Data Analysis and Prediction” aims to target all the upcoming students so that they can
check for with branch they are eligible and provide analysis of past years data to the college.

1.4 PLATFORM SPECIFICATION


1.4.1 SYSTEM REQUIREMENTS
 RAM- 4GB

 Processor-i3 Processor can be used as for this project data is in small


quantity but i5 is preferred.

 Operating System- Windows, Mac OS or Linux

 Storage- minimum 5GB of space

S.V.I.I.T 4
Data Analysis And Prediction

2.SYSTEM ANALYSIS

S.V.I.I.T 5
Data Analysis And Prediction

System Analysis is the detailed study of the various operations performed by the system and
their relationships within and outside the system. Analysis is the process of breaking
something into its parts so that the whole may be understood. System analysis is concerned
with becoming aware of the problem, identifying the relevant and most decisional variable,
analyzing and synthesizing the various factors and determining an optional or at least a
satisfactory solution. During this a problem is identified, alternate system solutions are
studied and recommendations are made about committing the resources used to the system.

IDENTIFICATION OF NEED AND PRELIMINARY


INVESTIGATION:
The need for this project is to run the administration of our college in more proper
and predicted way. The data which is analysed will help college to change strategies
and plans to run smoothly.

PRELIMINARY INVESTIGATION:
The investigation tells us that no such work has been done earlier for our college so
that they can plan a stratergy for better growth of college.

S.V.I.I.T 6
Data Analysis And Prediction

3.FEASIBILITY
STUDY

S.V.I.I.T 7
Data Analysis And Prediction

This feasibility study evaluates the project’s potential for success; therefore,
perceived objectivity is an important factor in the credibility of the study for potential
investors and lending institutions. There are following types of feasibility study we
discuss—separate areas that a feasibility study examines, described below.

2.1 TECHNICAL FEASIBILITY


The project is technical feasible, to develop the project, we will use Jupitor, which is
available for free. The resources we have is enough, and we are capable of
converting the ideaintoreality.

2.2 ECONOMICAL FEASIBILITY


This project is economically feasible as well, the main cost of developing theproject
is to train the developers to get used to Jupitor and get started developing in it. The
cost is too low and the team can pull it off.

2.3 OPERATIONAL FEASIBILITY


The average cost of the data science project is somewhere between $200-$400,
which is too costly. Our team believes in reducing the cost in considerable amount .
Our project being developed at such low cost, will be sold at a price greater than of
what is was developed of and we believe that the team will be benefited from it when
taken into higher levels.

S.V.I.I.T 8
Data Analysis And Prediction

4.LITERATURE
SURVEY

S.V.I.I.T 9
Data Analysis And Prediction

3.1 WORK DONE BY OTHERS


Data analytics a very vibrant study these days, a lot of research work is done in this
field and a lot more is yet to be researched. Data analytic is being applied in various
businesses and prediction studies like Netflix uses data analytics to show the TV
series and movies to the user as per their own interest, what they would be likely to
watch if they are seeing something currently.
Currently no such study is applied in our college to study the changing admission
patterns of after it has become an autonomus University so we are applying data
analytics on the data of past 3 years and will generate predictions what can be the
results in 2019 and so that how the college can do better in meeting the needs of the
aspirants and how the college can be promoted.
But many such works are done on other projects of data analysis and prediction
1. Branch Target Buffer:
Some of the design issues of BTB and optimization for dynamically
predicted branches is discussed here [20, 21]. On the execution of static
branch for the first time, an entry in the BTB is allocated. The corresponding
instruction address of the branch instruction is stored in the BIA field, and the
target address of the branch instruction is stored in the BTA field. Let's
assume for simplicity that BTB is fully associative cache, the BIA field is
used for associative access of the BTB. Concurrent accesses of I-cache and
BTB is carried out. A hit results in the BTB, when the current PC matches
with the BIA entry in the BTB. This implies that the current instruction being
executed has been executed before and is a branch instruction. On the result
of a hit in the BTB, the BTA field of the hit entry is accessed and is used as
the NPC value if the branch is predicted taken. Hence, by accessing the BTB
using the branch instruction PC address, in one pipeline stage, NPC value is
predicted. However, if the prediction goes wrong, branch misprediction
recovery must be invoked, essentially pushing the pipeline of the
uncommitted instructions executed before the branch instructions.

2. Gshare predictor:
Two-Level Adaptive training prediction scheme, a global branch prediction
scheme has the disadvantage that, it fails to capture the locality information
related to a branch. From architectural perspective, a branch instruction's
outcome may be completely or partially dependent on its locality, i.e., PC
address of the branch instruction. Not all branches are correlated with some
other branch. Thus one solution could be to associate one global branch
predictor with every branch instruction (associated PC address). However,
most of the entries in each PT of the global predictor will not be populated.
To visualize, imagine the set of all combinations of history of branches taken
before coming to a particular branch. This can be captured with one global
predictor. Now extend the same to all branch instructions. This needs N
pattern tables, where N is the total number of branch instructions in a

S.V.I.I.T 10
Data Analysis And Prediction

program. But to get the best of both worlds, one simple logical operator can
be used. XOR operator can effectively, in one table, capture both local and
global in- formation i.e., XOR PC address and BHR for any given branch
history bits and PC address [7]. This predictor is called the Gshare predictor.
This is definitely practical and efficient for implementation in hardware. 656
AN. Sai Parasanna, Dr. R. Raghunatha Sarma, Dr. S. Balasubramanian
International Journal of Engineering Technology Science and Research
IJETSR www.ijetsr.com ISSN 2394 – 3386 Volume 4, Issue 8 August 2017

3. Perceptron based predictor:


Similar to the manner in which a processor keeps the table of two-bit counters
in fast SRAM, the table of perceptrons is kept. Hardware budget drives the
limit on the number of perceptrons, N, and the number of weights which in
turn is determined by the branch history length. The following are the steps
that a processor encounters conceptually when a branch instruction is
encountered in the fetch stage. o Branch address is hashed which produces as
index i ∈ 0...N-1. This index is used to index into the table of perceptrons. o
From the table of perceptrons, the ith perceptron fetched is stored into a
vector register, P0..n. This register contains the weights of the ith perceptron.
o The dot product of P with the GHR gives the output y. o Branch is
predicted not taken when y is negative; when y is positive, branch is predicted
as taken. o Once the actual outcome of the branch is known, the training
algorithm updates the weights in P based on the actual outcome and the value
of y. o The updated value of P is written back to the ith entry in the table of
perceptrons.

4. TAGE predictor:
. The TAGE predictor characterizes a base predictor T0 which provides a
basic prediction and a group of partially tagged predictor components Ti.
These tagged predictor components Ti, 1 ≤ i ≤ M are indexed using different
history lengths which form a geometric series, i.e., L(i) = (int) (αi−1 * L(1) +
0.5). The order of arrangement of the tagged tables is such that, the table
using the longest history is T1, followed by T2 using the next largest history
length, and so on with TM being the table using the smallest history length in
the geometric series. The base predictor is a simple bimodal table with 2-bit
counters, indexed using PC value. A single entry in the tagged component
consists of a signed counter ctr whose sign provides the prediction, an
unsigned useful counter u and a partial tag. u is a 2-bit counter and ctr is a 3-
bit counter. For a given budget, the number of entries in the base predictor,
the length of the tag bits and number of entries in the tagged component used

S.V.I.I.T 11
Data Analysis And Prediction

3.2 BENIFITS
Through Data analytics process we can clean, analyse and model data using tools.In
the world of business, Data analytics is used for making strategies to get the desired
business results. Data analytics provides both speed and accuracy to business
decisions.
3.3 PROPOSED SOLUTION
Data Analytics is a combination of processes to extract information from datasets
along with domain knowledge you’ll need programming mathematical and statistical
skills to arrive at a decision making process with the help of data.
Overall Analysis and Prediction of the college data includes :
 Data acquisition :- collecting datasets for the last three years from
University.
 Data Wrangling :- data cleansing and data manipulation using modern tools
and technologies.
 Exploratory Data Analysis :- mathematical or graphical output to aid data
analysis.
 Data Exploration :- Discovery of data and identification of patterns in data.
 Conclusions and Predictions :- by creating training models for machine
learning, it uses mathematical and statistical functions.
 Data visualization :- to present the analysis work.

3.4 TECHNOLOGY USED


Python :
Python is a high-level, interpreted and general-purpose dynamic programming language that
focuses on code readability. The syntax in Python helps the programmers to do coding in
fewer steps as compared to Java or C++. The language founded in the year 1991 by the
developer Guido Van Rossum has the programming easy and fun to do. The Python is widely
used in bigger organizations because of its multiple programming paradigms. They usually
involve imperative and object-oriented functional programming. It has a comprehensive and
large standard library that has automatic memory management and dynamic features.

Jupyter:
The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations and narrative text. Uses include:
data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more

S.V.I.I.T 12
Data Analysis And Prediction

S.V.I.I.T 13
Data Analysis And Prediction

5.TECHNICAL PART

S.V.I.I.T 14
Data Analysis And Prediction

5.1CONCEPT
The main concept of the project is build the system that can predict the branch of the
student according to the qualifying exam and score.This project also aims to analyze
all the details of the students.

5.2APPLICATION AREA
 Big authorities
 Airline route planning
 Colleges
 Schools

5.3 OVERALL DESCRIPTION


Importing Data
Understanding Data
Converting Data in Even Case
Dropping Unwanted Columns in the Data
Converting the String Values to Numeric Values to Pass it to Machine Learning
Model
converting Category in numerical values
Splitting the Data into Train and Test Data
Importing the Naive Bayes Classifier for Classification of the Dataset
Testing the Accuracy of the Prediction by Naive Bayes Classifier
Importing the Logistic Regression Model for Classification of the Dataset
Testing the Accuracy of the Prediction by Logistic Regression Model
Conclusion: Naive Bayes Classifier Gives More Accurate Predictions.

S.V.I.I.T 15
Data Analysis And Prediction

6.SOFTWARE
ENGINEERING
APPOROCH

S.V.I.I.T 16
Data Analysis And Prediction

6.1 SOFTWARE ENGINEERING PARADIGM


"Paradigm" is commonly used to refer to a category of entities that share a common
characteristic.
We can distinguish between three different kinds of Software Paradigms:
 Programming Paradigm is a model of how programmers coomunicate an
calculation to computers
 Software Design Paradigm is a model for implementing a group of applications
sharing common properties
 Software Development Paradigm is often referred to as Software Engineering,
may be seen as a management model for implementing big software projects
using engineering principles.

6.1.1 DESCRIPTION
PROGRAMMING PARADIGM:
In our project we have used Python as a programming language which is a Object
Oriented Programming Language in Jupyter Notebook IDE.

Extensive Support Libraries

It provides large standard libraries that include the areas like string operations,
Internet, web service tools, operating system interfaces and protocols.We have used
various libraries of python such as matplotlib,pandas etc.

Integration Feature

Python integrates the Enterprise Application Integration that makes it easy to develop
Web services by invoking COM or COBRA components.

Scalability

Despite its easy-to-learn nature and its simple syntax, Python packs surprising
amounts of power and performance right out of the box. For instance, many of the
newest innovations in the big data ecosystem, such as columnar storage, dataflow
programming, and stream processing, can all be expressed in a relatively
straightforward manner using Python.

The RAD model is based on prototyping and iterative developed with no specific
planning involved. The process of writing the software itself involves the planning
required for developing the product.

S.V.I.I.T 17
Data Analysis And Prediction

Rapid Application development focuses on gathering customer requirements through


workshops or focus groups, early testing of the prototypes by the customer using
iterative concept, reuse of the existing prototypes (components), continuous
integration and rapid delivery.
RAD Model distributes the analysis, design, build, and test phases into a series of
short, iterative development cycles. Following are the phases of RAD Model:
 Business Modeling: The business model for the product under development
is designed in terms of flow of information and the distribution of information
between various business channels. A complete business analysis is
performed to find the vital information for business, how it can be obtained,
how and when is the information processed and what are the factors driving
successful flow of information.
 Data Modeling: The information gathered in the Business Modeling phase is
reviewed and analyzed to form sets of data objects vital for the business. The
attributes of all data sets is identified and defined. The relation between these
data objects are established and defined in detail in relevance to the business
model.
 Process Modeling: The data object sets defined in the Data Modeling phase
are converted to establish the business information flow needed to achieve
specific business objectives as per the business model. The process model for
any changes or enhancements to the data object sets is defined in this phase.
Process descriptions for adding, deleting, retrieving or modifying a data
object are given
 Application Generation: The actual system is built and coding is done by
using automation tools to convert process and data models into actual
prototypes.
 Testing and Turnover: The overall testing time is reduced in RAD model as
the prototypes are independently tested during every iteration. However the
data flow and the interfaces between all the components need to be
thoroughly tested with complete test coverage. Since most of the
programming components have already been tested, it reduces the risk of any
major issues.

6.1.2 ADVANTAGES AND DISADVANTAGES


Advantages:
 Changing requirements can be accommodated.
 Progress can be measured.
 Iteration time can be short with use of powerful RAD tools.
 Productivity with fewer people in short time.
 Reduced development time.

S.V.I.I.T 18
Data Analysis And Prediction

 Encourages customer feedback.


 Integration from very beginning solves a lot of integration issues.
Disadvantages –
 Dependency on technically strong team members for identifying business
requirements.
 Only system that can be modularized can be built using RAD.
 Requires highly skilled developers/designers.
 High dependency on modeling skills.
 Requires user involvement throughout the life cycle.
 Suitable for project requiring shorter development times.
 Management complexity is more.

6.1.3 REASON FOR USE


 The application is easy to use.
 It becomes easy for the user to access the software as and when needed.
 24*7 availability of the software.
 Improved teaching methodology.
 Impressive learning system.
 User friendly environment.

6.2 REQUIREMENT ANALYSIS


6.2.1 SOFTWARE REQUIRMENT SPECIFICATIONS
The developed project has some hardware , software , functional and non-functional
requirements for smooth functioning.

Functional Requirement:
 The data that is to entered into the data analysis and prediction system is
provided by the college.The data that was provided was first pre-processed
in excel sheet format.

 After pre-processing the data for prediction is separated. In the separated


data,algorithms will be applied for prediction. For rest of the data applied for
analysis.

 Work flow - As the user enters the data that data is fed into the machine
learning model and the required predictions are made. The most suitable
algorithm which gives the optimum result is applied based on the given
outcome which comes under the possible classes

S.V.I.I.T 19
Data Analysis And Prediction

 System Report - Based on the past data with which the model is trained we
predict the outcome of the test data.

 Who provides the data - any aspirants who is wishes to take admission in the
University can fed his details and check whether he/she is eligible or not.

Non functional requirements:

Performance: Performance is the speed at which content is delivered to users, and


how responsive the system is.Our system is very fast and provide predicted data
within no time.

Capacity and Scalability: The capacity is the amount of resources made available to
the system, and scalability is the ability of the system to make use of these resources.

Amount of resources provided by the system is sufficient enough to make accurate


and correct prediction and analysis. Our system makes the most use of the data
provide

Availability: Availability is the proportion of time that the system is online. The
system is always available during the hours it is most popular. Any maintenance
where the system needs to be taken offline should be done outside these times.

Maintainability: Maintainability describes how well the system can be kept


functional and how well it can be changed. The code should is build modularly, such
that independent parts accomplish independent tasks. Common coding is utilised.

Recovery: The system should is responsible of taking backups of data, such that it
may be restored to a working state
Security and privacy: Due to potentially sensitive information being contained
within the database, the system facilitate security and privacy.

S.V.I.I.T 20
Data Analysis And Prediction

6.2.1 SOFTWARE REQUIREMENT SPECIFICATION

Software Requirement Specification (SRS) is the starting point of the software


developing activity. As system grew more complex, it became evident that the goal
of the entire system cannot be easily comprehended. Hence the need for the
requirement phase arose. The SRS is the means of translating the ideas of the minds
of clients (the input) into a formal document (the output of the requirement phase.)

Role of SRS
The purpose of the Software Requirement Specification is to reduce the
communication gap between the system and the developers. Software Requirement
Specification is the medium through which the client and user needs are accurately
specified. It forms the basis of software development. A good SRS should satisfy all
the parties involved in the system.
Purpose of SRS
The purpose of this document is to describe all external requirements for mobile task
manager. It also describes the interfaces for the system.

System Requirement
The system consists of a web application which can provide the online service. It
should be able to help the customer for selecting the best deals for himself/herself.

S.V.I.I.T 21
Data Analysis And Prediction

6.2.1.1 GLOSSARY
Descriptive analytics – Descriptive analytics refers to a set of techniques used to describe or
explore or profile any kind of data. Any kind of reporting usually involves descriptive
analytics. Data exploration and data preparation are essential ingredients for predictive
modelling and these rely heavily on descriptive analytics.

Inquisitive analytics – Whereas descriptive analytics is used for data presentation and
exploration, inquisitive analytics answers terms why, what, how and what if. Ex: Why have
the sales in the Q4 dropped could be a question based on which inquisitive analysis can be
performed on the data

Advanced analytics – Like “Predictive analytics”, “Advanced analytics” too is a marketing


driven terminology. “Advanced” adds a little more punch, a little more glamour to
“Analytics” and is preferred by marketers.

Big data analytics – When analytics is performed on large data sets with huge volume,
variety and velocity of data it can be termed as big data analytics. The annual amount of data
we have is expected to grow from 8 zettabytes (trillion gigabytes) in 2015 to 35 zettabytes in
2020.

Data Mining – Data mining is the term that is most interchangeably used with “Analytics”.
Data Mining is an older term that was more popular in the nineties and the early 2000s.
However, data mining began to be confused with OLAP and that led to a drive to use more
descriptive terms like “Predictive analytics”.

Data Science – Data science and data analytics are mostly used interchangeably. However,
sometimes a data scientist is expected to possess higher mathematical and statistical
sophistication than a data analyst. A Data scientist is expected to be well versed in linear
algebra, calculus, machine learning and should be able to navigate the nitty-gritty details of
mathematics and statistics with much ease.

Artificial Intelligence –During the early stages of computing, there were a lot of
comparisons between computing and human learning process and this is reflected in the
terminology.

The term “Artificial intelligence” was popular in the very early stages of computing and
analytics (in the 70s and 80s) but is now almost obsolete.

Machine learning – involves using statistical methods to create algorithms. It replaces


explicit programming which can become cumbersome due to the large amounts of data,
inflexible to adapt to the solution requirements and also sometimes illegible.

It is mostly concerned with the algorithms which can be a black box to interpret but good
models can give highly accurate results compared to conventional statistical methods. Also,
visualization, domain knowledge etc. are not inclusive when we speak about machine
learning. Neural networks, support vector machines etc. are the terms which are generally
associated with the machine learning algorithms

Algorithm – Usually refers to a mathematical formula which is output from the tools. The
formula summarizes the model

S.V.I.I.T 22
Data Analysis And Prediction

Ex: Amazon recommendation algorithm gives a formula that can recommend the next best
buy

Machine Learning – Similar to “Artificial intelligence” this term too has lost its popularity
in the recent past to terms like “Analytics” and its derivatives.

OLAP – Online analytical processing refers to descriptive analytic techniques of slicing and
dicing the data to understand it better and discover patterns and insights. The term is derived
from another term “OLTP” – online transaction processing which comes from the data
warehousing world.

Reporting – The term “Reporting” is perhaps the most unglamorous of all terms in the world
of analytics. Yet it is also one of the most widely used practices within the field. All
businesses use reporting to aid decision making. While it is not “Advanced analytics” or even
“Predictive analytics”, effective reporting requires a lot of skill and a good understanding of
the data as well as the domain.

Data warehousing – Ok, this may actually be considered more unglamorous than even
“Reporting”. Data warehousing is the process of managing a database and involves
extraction, transformation and loading (ETL) of data. Data warehousing precedes analytics.
The data managed in a data warehouse is usually taken out and used for business analytics.

6.2.2.2 SUPPLEMENTARY SPECIFICATIONS

Purpose
Supplementary specifications define the Requirement that are not easily defined in the use
case Model. Requirements such as legal standards, quality aspects, supportability and
execution criteria of the system..
Scope
The supplementary specifications cover all the non-functional requirements of the system.
The scope of the supplementary specifications is limited to all the non-functional
requirements.

S.V.I.I.T 23
Data Analysis And Prediction

6.2.2.3 USE CASE DIAGRAM

The purpose of use case diagram is to capture the dynamic aspect of a system. However, this
definition is too generic to describe the purpose, as other four diagrams (activity, sequence,
collaboration, and Statechart) also have the same purpose. We will look into some specific
purpose, which will distinguish it from other four diagrams.

Use case diagrams are used to gather the requirements of a system including internal and
external influences. These requirements are mostly design requirements. Hence, when a
system is analyzed to gather its functionalities, use cases are prepared and actors are
identified.

When the initial task is complete, use case diagrams are modelled to present the outside
view.

In brief, the purposes of use case diagrams can be said to be as follows −

 Used to gather the requirements of a system.

 Used to get an outside view of a system.

 Identify the external and internal factors influencing the system.

 Show the interaction among the requirements are actors.

Use case diagrams are considered for high level requirement analysis of a system. When the
requirements of a system are analyzed, the functionalities are captured in use cases.

We can say that use cases are nothing but the system functionalities written in an organized
manner. The second thing which is relevant to use cases are the actors. Actors can be defined
as something that interacts with the system.

Actors can be a human user, some internal applications, or may be some external
applications. When we are planning to draw a use case diagram, we should have the
following items identified.

 Functionalities to be represented as use case

 Actors

 Relationships among the use cases and actors.

Use case diagrams are drawn to capture the functional requirements of a system. After
identifying the above items, we have to use the following guidelines to draw an efficient use
case diagram

S.V.I.I.T 24
Data Analysis And Prediction

 The name of a use case is very important. The name should be chosen in such a way
so that it can identify the functionalities performed.

 Give a suitable name for actors.

 Show relationships and dependencies clearly in the diagram.

 Do not try to include all types of relationships, as the main purpose of the diagram is to identify
the requirements

6.2.3 CONCEPTUAL LEVEL SEQUENCE DIAGRAM


Sequence diagrams describe interactions among classes in terms of an exchange of messages
over time. They're also called event diagrams. A sequence diagram is a good way to visualize
and validate various runtime scenarios. These can help to predict how a system will behave
and to discover responsibilities a class may need to have in the process of modeling a new
system.

S.V.I.I.T 25
Data Analysis And Prediction

Sequence Diagram
Start with one of SmartDraw's included sequence diagram templates. You'll notice that all the
notations and symbols you need are docked to the left of your drawing area. Simply stamp
them to your page and connect the symbols.

 Model and document how your system will behave in various scenarios
 Validate the logic of complex operations and functions

6.2.4 CONCEPTUAL LEVEL ACTIVITY DIAGRAM

It captures the dynamic behavior of the system. Other four diagrams are used to show the
message flow from one object to another but activity diagram is used to show message flow
from one activity to another.

Activity is a particular operation of the system. Activity diagrams are not only used for
visualizing the dynamic nature of a system, but they are also used to construct the executable
system by using forward and reverse engineering techniques. The only missing thing in the
activity diagram is the message part.It does not show any message flow from one activity to
another. Activity diagram is sometimes considered as the flowchart. Although the diagrams
look like a flowchart, they are not. It shows different flows such as parallel, branched,
concurrent, and single.

The purpose of an activity diagram can be described as −

 Draw the activity flow of a system.

 Describe the sequence from one activity to another.

 Describe the parallel, branched and concurrent flow of the system.

Activity diagrams are mainly used as a flowchart that consists of activities performed by the
system. Activity diagrams are not exactly flowcharts as they have some additional
capabilities. These additional capabilities include branching, parallel flow, swimlane, etc.

Before drawing an activity diagram, we must have a clear understanding about the elements
used in activity diagram. The main element of an activity diagram is the activity itself. An
activity is a function performed by the system. After identifying the activities, we need to
understand how they are associated with constraints and conditions.

Before drawing an activity diagram, we should identify the following elements −

 Activities

 Association

 Conditions

S.V.I.I.T 26
Data Analysis And Prediction

 Constraints

Once the above-mentioned parameters are identified, we need to make a mental layout of the
entire flow. This mental layout is then transformed into an activity diagram.

S.V.I.I.T 27
Data Analysis And Prediction

6.3 PLANNING MANAGERIAL ISSUES

6.3.1 PLANNING SCOPE


the scoping process is fairly iterative and the scope gets refined both during the scoping
process as well as during the project.

Step 1: Goals –This is the most critical step in the scoping process. Most projects start with a
very vague and abstract goal, get a little more concrete and keep getting refined until the goal
is both concrete and achieves the aims of the organization. This step is difficult because most
organizations haven’t explicitly defined analytical goals for many of the problems they’re
tackling. Sometimes, these goals exist but are locked implicitly in the minds of people within
the organization. Other times, there are several goals that different parts of the organization
are trying to optimize. The objective here is to take the outcome we’re trying to achieve and
turn it into a goal that is measurable and can be optimized.

Step 2: Actions –A well-scoped project ideally has a set of actions that the organizations is
taking that can be now be better informed using data science. If the action/intervention a
public health department is taking is lead hazard inspections, the data science work can help
inform which homes to inspect. You don’t have to limit this to making existing actions better.
Often, we end up creating a new set of actions as well. Generally, it’s a good strategy to first
focus on informing existing actions instead of starting with completely new actions that the
organization isn’t familiar with implementing. Enumerating the set of actions allows the
project to be actionable. If the analysis that will be done later does not inform an action, then
it usually (but not always) does not help the organization achieve their goals and should not
be a priority.

Step 3: Data –You’ll notice that so far in the scoping process we haven’t talked about data at
all. This is intentional since we want these projects to be problem-centric and not data-centric.
Yes, data is important and we all love data but starting with the data often leads to analysis
that may not be actionable or relevant to the goals we want to achieve. Once we’ve
determined the goals and actions, the next step is to find out what data sources exist inside
(and outside) the organization that will be relevant to this problem and what data sources we
need to solve this problem effectively. For each data source, it’s good practice to find out how
it’s stored, how often it’s collected, what’s its level of granularity, how far back does it go, is
there a collection bias, how often does new data come in, and does it overwrite old fields or
does it add new rows?

You first want to make a list of data sources that are available inside the organization. This is
an iterative process as well since most organization don’t necessarily have a comprehensive
list of data sources they have. Sometimes, (if you’re lucky) data may be in a central,
integrated data warehouse but even then you may find individuals and/or departments who
have additional data or different versions of the data warehouse.

S.V.I.I.T 28
Data Analysis And Prediction

Step 4: Analysis – The final step in the scoping process is to now determine the analysis that
needs to be done to inform the actions using the data we have to achieve our goals.

The analysis can use methods and tools from different areas: computer science, machine
learning, data science, statistics, and social sciences. One way to think about the analysis that
can be done is to break it down into 4 types:

1. Description: primarily focused on understanding events and behaviors that have


happened in the past. Methods used to do description are sometimes called
unsupervised learning methods and include methods for clustering.
2. Detection: Less focused on the past and more focused on ongoing events. Detection
tasks often involve detecting events and anomalies that are currently happening.
3. Prediction: Focused on the future and predicting future behaviors and events.
4. Behavior Change: Focused on causing change in behaviors of people, organizations,
neighborhoods. Typically uses methods from causal inference and behavioral
economics.

6.3.2 PROJECT RESOURCES


Resources are available from the college itself that is the data is by the college itself.

6.3.3 TEAM ORGANISATION


Acknowledging that data exists and can do wonders is passé. Organisations today need to do
much more than just identify big data. With the shortage of data scientist skills worldwide, it
is difficult for an organisation to fulfil its dream of capitalising on data science, as it is hard to
find the exceptionally skilled person – who is a machine learning expert, a data engineer, a
developer, a storyteller and a business analyst.

A data science team, if carefully built, with the right set of professionals, will be an asset to
any business. It’s a fact that the success of any project is dictated by the expertise of its
resources, and data science is no exception to this golden rule of thumb. Professionals with
varied skill-sets are required to successfully negotiate the challenges of a complex big data
project.

For a data science project to be on the right track, businesses need to ensure that the team has skilled
professionals.

Putting together an entire team has the potential to be more difficult. The truth is that data
science is a big field, and a cross-functional team is better prepared to handle real world
challenges and goals. Now that we have established the types of teams that can be set up, let
us look at the factors that ensure the smooth functioning of a data science team.

S.V.I.I.T 29
Data Analysis And Prediction

Skillset: Getting the right talent to fill in positions in the team is the first and foremost way to
move forward. Data scientists should be able to work on large datasets and understand the
theory behind the science. They should also be capable of developing predictive models. Data
engineers and data software developers are important, too. They need to understand
architecture, infrastructure, and distributed programming.

Some of the other roles to fill in a data science team include the data solutions architect, data
platform administrator, full-stack developer, and designer. Those companies that have teams
focusing on building data products will also likely want to have a product manager on the
team.

6.3.4 PROJECT SCEDULING


Data science projects do not have a nice clean lifecycle with well-defined steps like software
development lifecycle(SDLC). Usually, data science projects tramp into delivery delays with
repeated hold-ups, as some of the steps in the lifecycle of a data science project are non-
linear, highly iterative and cyclical between the data science team and various others teams in
an organization. It is very difficult for the data scientists to determine in the beginning which
is the best way to proceed further. Although the data science workflow process might not be
clean, data scientists ought to follow a certain standard workflow to achieve the output.
Lifecycle of data science projects is just an enhancement to the CRISP-DM workflow process
with some alterations-

1. Data Acquisition
2. Data Preparation
3. Hypothesis and Modelling
4. Evaluation and Interpretation
5. Deployment
6. Operations
7. Optimization

1) Data Acquisition
For doing Data Science, you need data. The primary step in the lifecycle of data science
projects is to first identify the person who knows what data to acquire and when to acquire
based on the question to be answered. The person need not necessarily be a data scientist but
anyone who knows the real difference between the various available data sets and making
hard-hitting decisions about the data investment strategy of an organization – will be the right
person for the job.

Data science project begins with identifying various data sources which could be –logs from
webservers, social media data, data from online repositories like US Census datasets, data
streamed from online sources via APIs, web scraping or data could be present in an excel or

S.V.I.I.T 30
Data Analysis And Prediction

can come from any other source. Data acquisition involves acquiring data from all the
identified internal and external sources that can help answer the business question.

2) Data Preparation
Often referred as data cleaning or data wrangling phase. Data scientists often complain that
this is the most boring and time consuming task involving identification of various data
quality issues. Data acquired in the first step of a data science project is usually not in a
usable format to run the required analysis and might contain missing entries, inconsistencies
and semantic errors.

Having acquired the data, data scientists have to clean and reformat the data by manually
editing it in the spreadsheet or by writing code. This step of the data science project lifecycle
does not produce any meaningful insights. However, through regular data cleaning, data
scientists can easily identify what foibles exists in the data acquisition process, what
assumptions they should make and what models they can apply to produce analysis results.
Data after reformatting can be converted to JSON, CSV or any other format that makes it
easy to load into one of the data science tools.

3) Hypothesis and Modelling


This is the core activity of a data science project that requires writing, running and refining
the programs to analyse and derive meaningful business insights from data. Often these
programs are written in languages like Python, R, MATLAB or Perl. Diverse machine
learning techniques are applied to the data to identify the machine learning model that best
fits the business needs. All the contending machine learning models are trained with the
training data sets.

4) Evaluation and Interpretation


There are different evaluation metrics for different performance metrics. For instance, if the
machine learning model aims to predict the daily stock then the RMSE (root mean squared
error) will have to be considered for evaluation. If the model aims to classify spam emails
then performance metrics like average accuracy, AUC and log loss have to be considered. A
common question that professionals often have when evaluating the performance of a
machine learning model is that which dataset they should use to measure the performance of
the machine learning model. Looking at the performance metrics on the trained dataset is
helpful but is not always right because the numbers obtained might be overly optimistic as the
model is already adapted to the training dataset. Machine learning model performances
should be measured and compared using validation and test sets to identify the best model
based on model accuracy and over-fitting.

All the above steps from 1 to 4 are iterated as data is acquired continuously and business
understanding become much clearer.

5) Deployment
Machine learning models might have to be recoded before deployment because data scientists
might favour Python programming language but the production environment supports Java.
After this, the machine learning models are first deployed in a pre-production or test
environment before actually deploying them into production.

S.V.I.I.T 31
Data Analysis And Prediction

6) Operations/Maintenance
This step involves developing a plan for monitoring and maintaining the data science project
in the long run. The model performance is monitored and performance downgrade is clearly
monitored in this phase. Data scientists can archive their learnings from a specific data
science projects for shared learning and to speed up similar data science projects in near
future.

7) Optimization
This is the final phase of any data science project that involves retraining the machine
learning model in production whenever there are new data sources coming in or taking
necessary steps to keep up with the performance of the machine learning model.

Having a well-defined workflow for any data science project is less frustrating for any data
professional to work on. The lifecycle of a data science project mentioned above is not
definitive and can be altered accordingly to improve the efficiency of a specific data science
project as per the business requirements.

6.3.5 ESTIMATION

1. Understand your team’s expertise & job responsibilities

2. Become the go-to expert of your company’s project process

3. Broaden your PM skill set

4. Study estimation history

5. Ask more of the right questions

6. Apply a work breakdown structure

7. Estimate projects with TeamGantt

8. Get to planning and estimation

6.3.6 RISK ANALYSIS


While the broad lifecycle phases of any analytics project remain pretty much the same, the
techniques that are used to implement each phase may vary a little depending on the type of
data, its sources, etc. The amount of time and effort invested in each phase may also vary
greatly across projects. Increased variability in method, time and effort implies less
standardization and repeatability, and hence increased chances of encountering new risks in

S.V.I.I.T 32
Data Analysis And Prediction

each new project. So while analytics is being used to manage business risk, is enough effort
consciously being applied in managing potential project risks in these analytics projects?

As with any other project analytics projects involve the usual risks that are related to the
overrun of time and cost, sometimes by inadequate resourcing, or by delayed inputs and
decisions and so on. But in addition to these, there are a number of other potential risks that
may make a difference to the reliability (and hence business usability) of analytics project
output. Analytics, and predictive analytics in particular, is a science of heuristics and
statistical probability, and so its output will always contain some degree of inherent
uncertainty. Reducing the uncertainty requires a conscious effort to manage certain risks that
could creep into the project lifecycle, or at least be aware of them. Some of these risks are as
follows.

Inadequate data.

Statistical analysis is often done on a sample of data rather than on a 100% complete data
population because it is often neither practical nor possible to have data from the whole
population. In such cases statistical results are expressed along with a calculated margin of
error, which is fine. The larger the sample size relative to the total population, the greater the
confidence in the result, ie, the lower the potential margin of error. But what if the real
population size is not really known? Assumptions can be made, but how reliable are these
assumptions? What are they based on? It may be worth asking the question and ascertaining
from the right sources that the total population size is as accurate as it can be.

Lack of awareness about future change.

This applies especially to predictive analysis, where data from the past is used to make
probabilistic predictions about the future. When making such predictions the biggest risk is
that the future environment could be different, and therefore the past is not enough of a
determinant of the future. This is actually a fundamental thumb rule in the world of financial
investment, where the utilization of publicly unavailable knowledge to make gains is illegal
in most countries. However this is not the case in other areas of making predictions about the
future, and so while the past can be used to produce good predictions, wherever possible, any
variables that could change value in future should be considered to see if they would make
any change to the statistical prediction.

Data Quality.

Data quality could be an issue when testing of data integration and data cleansing techniques
is done on the basis of sample testing, especially when Big Data is involved. While the
techniques may produce good data in development and test samples, what is the confidence
that the sample represents the entire data population well enough? Again, this is not
something that requires a 100% check of the data population, but when it results in significant
skews or outliers it’s always worth asking the question and then going back to double check
the quality of the data points leading to such results.

S.V.I.I.T 33
Data Analysis And Prediction

Not enough team input included.

Business analytics can be an expensive investment, given the cost of talent and also the
amount of time that may be needed before benefits are realized. At the same time, the value
of soft power in analytics cannot be underestimated. Getting the right data together, analysing
models correctly and asking the right questions of the output requires as much creativity,
business expertise and experience as possible, and therefore even if the core analytics team is
small it helps if they engage with as many other colleagues as possible to get additional
perspectives on their line of thinking.

Biased interpretations.

Sometimes, though, the power of experience and gut feel may also come in the way, and
that’s when there has to be that interesting discussion and debate between the statistician who
is detached from the business, and the business expert who has so much knowledge of the
domain that they may only expect to see validation of what they guess to be right.

Group think.

Group think is a twist on the biased interpretation issue. It refers to the phenomenon where
those who don’t really have an opinion or don’t wish to voice an opinion defer to another
opinion that seems more credible for whatever reason, and go along with it without really
having any basis for doing so

Unintended illegality.

This is a risk that is relatively easier to control. When the analytics team is given the freedom
to gather together whatever data they feel is necessary for their work there should be a control
in place and exercised that ensures that the collection and use of any of that data is not illegal
from the perspective of factors such as security, privacy and confidentiality. It is quite usual
to allow employees to access several kinds of data, but there could be a risk that they may be
unaware that the manner in which they intend to use it may constitute an illegality.

6.3.7 SECURITY PLAN


While secure storage media will protect data when it is not being analyzed, it is also
important to follow practices that keep data secure while it is being analyzed. Secure storage
is important, but it is only one aspect of a larger set of behaviors and habits that are important
when handling research data that must be kept confidential. Ultimately, the researcher is
responsible for appropriate use and storage of their research data.

1. STORE PAPER FORMS SECURELY: Much like electronic data, paper documents such as
consent forms, printouts, or case tracking sheets that contain personal identifying information
(PII) must be stored securely in locked file cabinets when not in use and must be handled
only by trained staff members when actively used during research..

S.V.I.I.T 34
Data Analysis And Prediction

2. USE SECURE STORAGE FOR DETACHABLE MEDIA: Confidential data stored on


transportable media such as CDs, DVDs, flash memory devices, or portable external drives
must be stored securely in a safe or locked file cabinet and handled only by authorized staff
members.

3. PROTECT PASSWORDS: Secure data storage depends on the creation and use of passwords
that are needed to gain access to data records. The best storage and encryption technologies
can be easily undone by poor password practices. Passwords should be difficult to determine
and be protected as carefully as confidential data. They should never be shared or left on slips
of paper at work stations or desks.
4. TRAIN AND MONITOR RESEARCH ASSISTANTS: Research assistants who work with
confidential data should understand and follow all of the basic data security practices outlined
in this section. This begins with human subject research training which may be completed
on line at: Human Research/training. Research assistants and other project staff must be
acquainted with procedures and practices described in these guidelines
5. RESTRICTED USE SHARED ACCOUNTS OR GROUP LOGIN IDs: Anyone who works
with confidential electronic data should identify themselves when they log on to the PC or
laptop computer that gives them access to the data. Use of group login IDs violates this
principle. Project managers must make certain that everyone working with confidential data
has a unique password that personally identifies them before they can access the data
6. KEEP USER GROUP LISTS UP-TO-DATE: User groups are a convenient way to grant
access to project files stored on a remote server. The use of user groups simplifies the
granting and revoking of access to a research project’s electronic data resources. By granting
access privileges to each of the research project’s electronic folders to the group as a whole,
newly authorized members of the project team can obtain access to all related electronic data
resources by just being added to the group access any shared resources.

7. AVOID USING NON-DESC PCs OR LAPTOPS FOR COLLECTION OR STORAGE OF


CONFIDENTIAL RESEARCH DATA:The Desktop Systems Council (DeSC) oversees the
use and maintenance of computers participating in the managed environments that make up
the DeSC Program. The scope of the Council’s activities is to advise the university on
standards for the managed computing platforms for institutionally owned computers

8. ACTIVATE LOCK OUT FUNCTIONS FOR SCREEN SAVERS: Computers used for data
analysis should be configured to "lock out" after 20 minutes of inactivity. This reduces the
risk of theft or unauthorized use of data in situations where a user working with confidential
data leaves his or her desk and forgets to logoff the PC. OIT provides instructions on how to
configure the automatic lock out feature for Windows PCs.

9. USE SECURE METHODS OF FILE TRANSFER: Transfer of confidential data files


between users or between institutions has the potential to result in unintended disclosure. File
transfers are often the weakest part of any plan for keeping research data secure. The method
used to transfer files should reflect the sensitivity level of the data.

S.V.I.I.T 35
Data Analysis And Prediction

10. USE EFFECTIVE METHODS OF DATA DESTRUCTION: When requesting IRB review
for their planned studies, researchers must create a plan for the ultimate disposition of their
research data. This plan specifies what will be done with the data once the objectives of the
project are completed. In many cases, researchers will produce various types of reports or
papers for publication, as well as a de-identified data file for use by other researchers or the
general public6.3.8 CONFIGURATION MANAGEMENT PLAN

6.4 DESIGN

6.1 DESIGN CONCEPT USED


The purpose of the design phase is to plan a solution of the problem specified by the
requirement of the problem specified by the requirement document. This phase is the first
step in moving from the problem domain to the solution domain. In other words, starting with
what is needed; design takes us towards how to satisfy the needs. The design of system is the
most critical factor affecting the quality of the software and has major impact on testing and
maintenance. The output of this phase is the design document.

S.V.I.I.T 36
Data Analysis And Prediction

6.2 Design Technique

System Design:
System design provides the understandings and procedural details necessary for implementing the
system recommended in the system study. Emphasis is on the translating the performance requirements
into design specifications. The design phase is a transition from a user-oriented document (System
proposal) to a document oriented to the programmers or database personnel.

System Design goes through two phases of development:

• Logical design

• Physical Design

A data flow diagram shows the logical flow of the system. For a system it describes the input
(source), output (destination), database (data stores) and procedures (data flows) all in a format that
meets the user’s requirement. When analysis prepares the logical system design, they specify the
user needs at a level of detail that virtually determines the information flow into an out of the
system and the required data resources. The logical design also specifies input forms and screen
layouts.

The activities following logical design are the procedure followed in the physical design e.g.,
producing programs, software, file and a working system.

Logical and Output Design:


The logical design of an information system is analogous to an engineering blue print of an
automobile. It shows the major features and how they are related to one another. The detailed
specification for the new system was drawn on the basis of user’s requirement data. The outputs
inputs and databases are designed in this phase. Output design is one of the most important features
of the information system. When the output is not of good quality the user will be averse to use the
newly designed system and may not use the system. There are many types of output, all of which
can be either highly useful or can be critical to the users, depending on the manner and degree to
which they are used. Outputs from computer system are required primarily to communicate the
results of processing to users. they are also used to provide a permanent hard copy of these results
for later consultation. Various types of outputs required can be listed as below:

• External Outputs, whose destination is outside the organization


• Internal outputs, whose destination is with the organization

S.V.I.I.T 37
Data Analysis And Prediction

• Operational outputs, whose use is purely with in the computer department e.g., program-listing
etc.

• Interactive outputs, which involve the user is communicating directly with the computer, it is
particularly important to consider human factor when designing computer outputs

Data Flow Diagram


A data flow diagram (DFD) maps out the flow of information for any process or system. It
uses defined symbols like rectangles, circles and arrows, plus short text labels, to show data
inputs, outputs, storage points and the routes between each destination. Data flowcharts can
range from simple, even hand-drawn process overviews, to in-depth, multi-level DFDs that
dig progressively deeper into how the data is handled. They can be used to analyze an
existing system or model a new one. Like all the best diagrams and charts, a DFD can often
visually “say” things that would be hard to explain in words, and they work for both technical
and nontechnical audiences, from developer to CEO. That’s why DFDs remain so popular
after all these years. While they work well for data flow software and systems, they are less
applicable nowadays to visualizing interactive, real-time or database-oriented software or
systems

Using any convention’s DFD rules or guidelines, the symbols depict the four components of
data flow diagrams.

1. External entity: an outside system that sends or receives data, communicating with the
system being diagrammed. They are the sources and destinations of information entering
or leaving the system. They might be an outside organization or person, a computer
system or a business system. They are also known as terminators, sources and sinks or
actors. They are typically drawn on the edges of the diagram.
2. Process: any process that changes the data, producing an output. It might perform
computations, or sort data based on logic, or direct the data flow based on business rules.
A short label is used to describe the process, such as “Submit payment.”
3. Data store: files or repositories that hold information for later use, such as a database
table or a membership form. Each data store receives a simple label, such as “Orders.”
4. Data flow: the route that data takes between the external entities, processes and data
stores. It portrays the interface between the other components and is shown with arrows,
typically labeled with a short data name, like “Billing details.”

DFD rules and tips

 Each process should have at least one input and an output.


 Each data store should have at least one data flow in and one data flow out.
 Data stored in a system must go through a process.
 All processes in a DFD go to another process or a data store.

S.V.I.I.T 38
Data Analysis And Prediction

Level 0 DFD:

Level 1 DFD:

S.V.I.I.T 39
Data Analysis And Prediction

Level 2 DFD:

6.5 IMPLEMENTATION PHASE

6.5.1 LANGUAGE USED CHARACTERISTICS


A. Python:
The Python language has diversified application in the software development companies such
as in gaming, web frameworks and applications, language development, prototyping, graphic
design applications, etc. This provides the language a higher plethora over other
programming languages used in the industry. Some of its advantages are-
• Extensive Support Libraries
It provides large standard libraries that include the areas like string operations, Internet, web
service tools, operating system interfaces and protocols. Most of the highly used
programming tasks are already scripted into it that limits the length of the codes to be written
in Python.
• Integration Feature
Python integrates the Enterprise Application Integration that makes it easy to develop Web
services by invoking COM or COBRA components. It has powerful control capabilities as it

S.V.I.I.T 40
Data Analysis And Prediction

calls directly through C, C++ or Java via Jython. Python also processes XML and other
markup languages as it can run on all modern operating systems through same byte code.
• Improved Programmer’s Productivity
The language has extensive support libraries and clean object-oriented designs that increase
two to ten fold of programmer’s productivity while using the languages like Java, VB, Perl,
C, C++ and C#.
• Productivity
With its strong process integration features, unit testing framework and enhanced control
capabilities contribute towards the increased speed for most applications and productivity of
applications. It is a great option for building scalable multi-protocol network applications.
Code Efficiency
Efficiency, as it applies to programming, means obtaining the correct results while
minimizing the need for human and computer resources. The various aspects of code
efficiency are broken down into four major components:
• Central Processing Unit (CPU) time: Compiling and executing programs take up time
and space. The required time the CPU spends to perform the operations that are assigned in
the statements determine the complexity of the program. In order to make the program
efficient and to reduce CPU time, we should
o execute only the necessary statements
o reduce the number of statements executed
o execute calculations only for the necessary observations.
o reduce the number of operations performed in a particular statement
o keep desired variables by using KEEP = or DROP = data set options
o create and use indexes with large data sets
o use IF-THEN/ELSE statements to process data avoid unnecessary sorting
o use CLASS statements in procedures
o use a subset of data set to test code before production
o consider the use of nested functions
o shorten expressions with functions
• Data Storage: Data storage is primarily concerned with temporary datasets generated
during program execution which can become very large and slow down processing. Here are
some ways to reduce the amount of temporary data storage required by a program:
o Create a data set by reading long records from a flat file with an input statement with
keeping the selected records with a needed incoming variables (?)

S.V.I.I.T 41
Data Analysis And Prediction

o Process and store only the variables that you need by using KEEP/DROP= data set
options (or KEEP/DROP statements) to retain desired variables when reading or creating a
SAS data set
o Create a new SAS data set by reading an existing SAS data set with a SET statement
with keeping selected observations based on the values of only a few incoming variables
o Create as many data sets in one DATA step as possible with OUTPUT statements
o Use LENGTH statements to reduce variable size
o Read in as many SAS data sets in one DATA step as possible (SET or MERGE
statement).
• I/O Time: I/O time is the time the computer spends on data input and output (reading
and writing data). Input refers to moving data from disk space into memory for work. Output
refers to moving the results out of memory to disk space or a display device such as a
terminal or a printer. To save I/O time, the following tips can be used:
o Read only data that is needed by subsetting data with WHERE or IF statement (or
WHERE= data step option) and using KEEP/DROP statement (or KEEP=/DROP= data set
option) instead of creating several datasets
o Avoid rereading data if several subsets are required
o Use data compression for large datasets
o Use the DATASETS procedure COPY statement to copy datasets with indexes
o Use the SQL procedure to consolidate code
o Store data in temporary SAS work datasets, not external files
o Assign a value to a constant only once (employ retain with initial values)
• Programming Time:
o Reducing I/O time and CPU usage are important, but using techniques which are
efficient in terms of the programming time it takes to develop, debug, and validate code can
be even more valuable. Much efficiency can be gained by following the good programming
practices for readability and maintainability of code as discussed in this guide.
o utilize macros for redundant code
o use the SQL procedure to consolidate the number of steps
Optimization of Code
Code optimization is any method of code modification to improve code quality and
efficiency. A program may be optimized so that it becomes a smaller size, consumes less
memory, executes more rapidly, or performs fewer input/output operations.
The basic requirements optimization methods should comply with, is that an optimized
program must have the same output and side effects as its non-optimized version. This
requirement, however, may be ignored in the case that the benefit from optimization, is

S.V.I.I.T 42
Data Analysis And Prediction

estimated to be more important than probable consequences of a change in the program


behavior.
Optimization can be performed by automatic optimizers, or programmers. An optimizer is
either a specialized software tool or a built-in unit of a compiler (the so-called optimizing
compiler). Modern processors can also optimize the execution order of code instructions.
Optimizations are classified into high-level and low-level optimizations. High-level
optimizations are usually performed by the programmer, who handles abstract entities
(functions, procedures, classes, etc.) and keeps in mind the general framework of the task to
optimize the design of a system. Optimizations performed at the level of elementary
structural blocks of source code - loops, branches, etc. - are usually referred to as high-level
optimizations too, while some authors classify them into a separate ("middle") level (N.
Wirth?). Low-level optimizations are performed at the stage when source code is compiled
into a set of machine instructions, and it is at this stage that automated optimization is usually
employed. Assembler programmers believe however, that no machine, however perfect, can
do this better than a skilled programmer (yet everybody agrees that a poor programmer will
do much worse than a computer).

Validation Check
Data validation is the process of ensuring that a program operates on clean, correct and useful
data. It uses routines, often called ‘validation rules’, or ‘check routines’, that check for
correctness, meaningfulness, and security of data that are input to the system. The rules may
be implemented through the automated facilities of a data dictionary, or by the inclusion of
explicit application program validation logic.
In evaluating the basics of data validation, generalizations can be made regarding the
different types of validation, according to the scope, complexity, and purpose of the various
validation operations to be carried out.
For example:
• Data type validation: Data type validation is customarily carried out on one or more
simple data fields.The simplest kind of data type validation verifies that the individual
characters provided through user input are consistent with the expected characters of one or
more known primitive data types; as defined in a programming language or data storage and
retrieval mechanism.
• Range and constraint validation: Simple range and constraint validation may examine
user input for consistency with a minimum/maximum range, or consistency with a test for
evaluating a sequence of characters, such as one or more tests against regular expressions.
• Code and Cross-reference validation: Code and cross-reference validation includes
tests for data type validation, combined with one or more operations to verify that the user-
supplied data is consistent with one or more external rules, requirements, or validity
constraints relevant to a particular organization, context or set of underlying assumptions.
These additional validity constraints may involve cross-referencing supplied data with a
known look-up table or directory information service such as LDAP.

S.V.I.I.T 43
Data Analysis And Prediction

• Structured validation: Structured validation allows for the combination of any of


various basic data type validation steps, along with more complex processing. Such complex
processing may include the testing of conditional constraints for an entire complex data
object or set of process operations within a system.

6.5.2 CODING
Visualization Code:
import pandas as pd
from matplotlib import pyplot as plt
data=pd.read_csv('data.csv')
gender=data.Gender
alpha_color=0.5
gender['Gender'].value_counts().plot(kind='bar', color=['b','r'],alpha= alpha_color)
plt.title('Gender Analysis ')
plt.xlabel('Gender')
plt.ylabel('No of Students')
cat=data.Category
cat['Category'].value_counts().plot(kind='bar', color=['b','r','y','g'])
plt.title('Category Analysis ')
plt.xlabel('Category')
plt.ylabel('No of Students')
score=data.Score
score['Score'].value_counts().plot(kind='bar', color=['b'])
plt.title('Score Analysis ')
plt.xlabel('Score')
plt.ylabel('No of Students')
Degree['Degree'].value_counts().plot(kind='bar', color=['b'])
plt.title('Degree Analysis ')
plt.xlabel('Degree')
plt.ylabel('No of Students')
%matplotlib inline

S.V.I.I.T 44
Data Analysis And Prediction

alpha_color=0.5
College['College'].value_counts().plot(kind='bar', color=['b'])
plt.title('College Analysis ')
plt.xlabel('Degree')
plt.ylabel('No of Students')

labels = 'Genral', 'Others','SC', 'ST', 'OBC'


sizes = [868, 1, 82,19,468]
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue','red']

# Plot
plt.pie(sizes, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.show()
labels = 'Maharashtra','New Delhi','Rajasthan','Lakshadweep','Bihar','Uttar
Pradesh','Chattisgarh','Gujarat','Jharkhand','Punjab'
sizes = [12,1,11,4,8,7,19,4,2,1]
colors = ['gold', 'g', 'lightcoral', 'lightskyblue','red','b','yellowgreen','k','m','c']

# Plot
plt.pie(sizes, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.show()

Prediction Code
Importing Data

S.V.I.I.T 45
Data Analysis And Prediction

import numpy as np
data = pd.read_csv('D:\minor project data\Book2.csv')

Understanding Data
data.head()
print(data.columns)
data.columns = ['Branch', 'Degree', 'Gender', 'Category', 'Year', 'Semester', 'State',
'City', 'College', 'QExam', 'Score']
print(data.columns)
Converting Data in Even Case
data1 = data.apply(lambda x: x.astype(str).str.lower())
data1.head()
Dropping Unwanted Columns in the Data
data1 = data1.drop(columns = ['Gender', 'Year', 'Semester', 'State', 'City', 'College'] )
data1.head()
Converting the String Values to Numeric Values to Pass it to Machine Learning Model

data1.Category.unique()

#converting Category in numerical values

cat_dic = data1.Category.unique()

for idx, val in enumerate(cat_dic) :


data1['Category'][data1['Category']==val] = idx

data1.head()
data1.QExam.unique()
#converting Qualifying Exam in numerical values

qexam_dic = data1.QExam.unique()

for idx, val in enumerate(qexam_dic) :

S.V.I.I.T 46
Data Analysis And Prediction

data1['QExam'][data1['QExam']==val] = idx

data1.head()
data1.Degree.unique()
#converting Degree in numerical values

degree_dic = data1.Degree.unique()

for idx, val in enumerate(degree_dic) :


data1['Degree'][data1['Degree']==val] = idx

data1.head()
data1.Branch.nunique()
data1.Branch.unique()

S.V.I.I.T 47
Data Analysis And Prediction

data1['Branch'] = data1['Branch'].replace({'computer and communication engineering':


'computer & communication engineering', 'electrical & electronics engineering':
'electrical & electronics engineering'})
data1.Branch.unique()
Defining the Input and Output Classes
X = data1[['QExam', 'Score', 'Degree', 'Category']]
y = data1['Branch']
X.head()
y.head()
Splitting the Data into Train and Test Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
#X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=10) for
same set of train data
len(X_train)
X_train.head()
len(X_test)
X_test.head()
Importing the Naive Bayes Classifier for Classification of the Dataset
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)
print(clf.predict(X_test))
y_test
Testing the Accuracy of the Prediction by Naive Bayes Classifier
clf.score(X_test, y_test)
Importing the Logistic Regression Model for Classification of the Dataset
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial').fit(X, y)
clf.predict(X_test)
y_test

S.V.I.I.T 48
Data Analysis And Prediction

Testing the Accuracy of the Prediction by Logistic Regression Model


clf.score(X_test, y_test)
Conclusion: Naive Bayes Classifier Gives More Accurate Predictions.

S.V.I.I.T 49
Data Analysis And Prediction

Screenshots:
Data Analysis SS:

S.V.I.I.T 50
Data Analysis And Prediction

S.V.I.I.T 51
Data Analysis And Prediction

S.V.I.I.T 52
Data Analysis And Prediction

S.V.I.I.T 53
Data Analysis And Prediction

Screenshots of data prediction:

S.V.I.I.T 54
Data Analysis And Prediction

S.V.I.I.T 55
Data Analysis And Prediction

S.V.I.I.T 56
Data Analysis And Prediction

S.V.I.I.T 57
Data Analysis And Prediction

S.V.I.I.T 58
Data Analysis And Prediction

6.6 TESTING
6.1 PRELIMINARY TESTING
` The development of Software system involves a series of production activities. There is a
chance of errors to occur at any stage. Because of human inability to perform and
communicate with perfection, a Quality Assurance Activity accompanies software
development.

Unit Testing

The resultant system after the integration of the modules was tested to ascertain its
correctness in terms input, processing and output. This was done by executing prepared test
scenarios. The unit testing focused on the internal processing logic and data structures within
the boundaries of a component. More than often, the developer had to keep editing a module
severally until each module was complete and correct.

Security testing
Security testing attempted to verify that protection mechanism of the system. It is protected
against unauthorized access. There was deliberate inputting of username with a passwords
and the reaction of the system were checked.

Different types of testing are


• Boundary Condition Testing
• Integration Testing
• Black Box Testing
• Validation Testing
• User Acceptance Testing

S.V.I.I.T 59
Data Analysis And Prediction

During the implementation for the system each module of the system is tested separately to
uncover errors within its boundaries. User interface is used as a guide in this process. The
validations have been done for all the inputs using Java Script.

For example to check whether the work allotted among the database correctly without
exceeding the schemes which are not validated thoroughly and the internal database has to
check the reflections in the database.

Boundary conditions Test:


Boundary conditions as in case of generating sequences were tested to ensure that the module
operates properly at boundaries establish to limit or restrict processing also it is able to handle
incorrect out of the boundary values properly.

Integration Test:
The objective of Integration Test is to take the tested modules and build a program structure
that has been defined in the design. We have done top down integration, which is
constructing and testing small segments where errors are easier to isolate, and correct. The
integration process was performed in three steps:

• The main control was used as test driver.

• Test was conducted as each module was integrated.

• Regression testing to ensure that new errors have not been introduced due to the
corrections.
Black Box Testing:
It focuses on functional requirements of the software. Block box testing attempts to find
errors in the following categories.

• Incorrect or missing function


• Interface error
• Errors in external device access
• Performance error
• Initialization and termination errors

Validation Testing:
At the culmination of integration testing, software is completely assembled as a package,
interfacing errors have been uncovered and corrected, and a final series of software tests

S.V.I.I.T 60
Data Analysis And Prediction

namely validation tests are performed. Validation succeeds when the software functions in
the manner that can be easily accepted by the customer.

S.V.I.I.T 61
After validation test has been conducted, one of the possible conditions are satisfied. The
functions or performance characteristics confirmed to specifications are acceptable. The
deviation form specifications are uncovered and a note of what is lacking is made. The
developed system has been tested satisfactorily to ensure its performance is satisfactory and it
is working efficiently.
Screenshots of testing:

Fig 1
Fig 2
7.CONCLUSION
&DICUSSION
7.1 PRELIMINARY CONCLUSION
As this project mainly aims for analysis the of the college data of a given year and that will help
the college authority for more constructive and better rules and help to grow college in a
positive manner
We particularly aim to analysis and visualise the gender proportion, different categories, no. Of
students admitted with different qualify exams, residence of different cities,no.of students
present in various branches and many more
Our aim is also to predict the branch that will be available to the student with the marks that
he/she is being provided with.This will help students to check whether they will be eligible for
the branch they want.It will also help students to check in which branch they can get admission.
Thus according to the previous record a criteria will be set for marks such of different qualifying
marks namely Jee and SVET and also parameters like category,degree are used.
JEE and SVET are two different qualifying exams and each of them have different different
cutoff marks.
8.BIBLIOGRAPHY
AND REFRENCES
8.1 REFRENCES
 Wikipedia
 Tutorials Point
 Coursera
 Data Camp
 Python For Data Science
 Python Data Science Handbook

You might also like