College Data Analysis and Prediction SRS
College Data Analysis and Prediction SRS
ABSTRACT
As this project mainly aims for analysis the of the college data of a given year and
that will help the college authority for more constructive and better rules and help to
grow college in a positive manner
We particularly aim to analysis and visualise the gender proportion, different
categories, no. Of students admitted with different qualify exams, residence of
different cities,no.of students present in various branches and many more
Our aim is also to predict the branch that will be available to the student with the
marks that he/she is being provided with.This will help students to check whether
they will be eligible for the branch they want.It will also help students to check in
which branch they can get admission.
Thus according to the previous record a criteria will be set for marks such of
different qualifying marks namely Jee and SVET and also parameters like
category,degree are used.
JEE and SVET are two different qualifying exams and each of them have different
different cutoff marks.
S.V.I.I.T 1
Data Analysis And Prediction
TABLE OF CONTENTS
TITLE PAGE NO.
ABSTRACT 1
LIST OF FIGURE
LIST OF TABLE
ABBREVATION
NOTATION
S.V.I.I.T 2
Data Analysis And Prediction
1. INTRODUCTION
S.V.I.I.T 3
Data Analysis And Prediction
1.2 OBJECTIVE
Our objective is to analysis the previous years data and provide it to the college for
improvement of college and predict the branch for students according to their
category,qualifying exam with score and degree(intergrated or alone).
1.3 SCOPE
“Data Analysis and Prediction” aims to target all the upcoming students so that they can
check for with branch they are eligible and provide analysis of past years data to the college.
S.V.I.I.T 4
Data Analysis And Prediction
2.SYSTEM ANALYSIS
S.V.I.I.T 5
Data Analysis And Prediction
System Analysis is the detailed study of the various operations performed by the system and
their relationships within and outside the system. Analysis is the process of breaking
something into its parts so that the whole may be understood. System analysis is concerned
with becoming aware of the problem, identifying the relevant and most decisional variable,
analyzing and synthesizing the various factors and determining an optional or at least a
satisfactory solution. During this a problem is identified, alternate system solutions are
studied and recommendations are made about committing the resources used to the system.
PRELIMINARY INVESTIGATION:
The investigation tells us that no such work has been done earlier for our college so
that they can plan a stratergy for better growth of college.
S.V.I.I.T 6
Data Analysis And Prediction
3.FEASIBILITY
STUDY
S.V.I.I.T 7
Data Analysis And Prediction
This feasibility study evaluates the project’s potential for success; therefore,
perceived objectivity is an important factor in the credibility of the study for potential
investors and lending institutions. There are following types of feasibility study we
discuss—separate areas that a feasibility study examines, described below.
S.V.I.I.T 8
Data Analysis And Prediction
4.LITERATURE
SURVEY
S.V.I.I.T 9
Data Analysis And Prediction
2. Gshare predictor:
Two-Level Adaptive training prediction scheme, a global branch prediction
scheme has the disadvantage that, it fails to capture the locality information
related to a branch. From architectural perspective, a branch instruction's
outcome may be completely or partially dependent on its locality, i.e., PC
address of the branch instruction. Not all branches are correlated with some
other branch. Thus one solution could be to associate one global branch
predictor with every branch instruction (associated PC address). However,
most of the entries in each PT of the global predictor will not be populated.
To visualize, imagine the set of all combinations of history of branches taken
before coming to a particular branch. This can be captured with one global
predictor. Now extend the same to all branch instructions. This needs N
pattern tables, where N is the total number of branch instructions in a
S.V.I.I.T 10
Data Analysis And Prediction
program. But to get the best of both worlds, one simple logical operator can
be used. XOR operator can effectively, in one table, capture both local and
global in- formation i.e., XOR PC address and BHR for any given branch
history bits and PC address [7]. This predictor is called the Gshare predictor.
This is definitely practical and efficient for implementation in hardware. 656
AN. Sai Parasanna, Dr. R. Raghunatha Sarma, Dr. S. Balasubramanian
International Journal of Engineering Technology Science and Research
IJETSR www.ijetsr.com ISSN 2394 – 3386 Volume 4, Issue 8 August 2017
4. TAGE predictor:
. The TAGE predictor characterizes a base predictor T0 which provides a
basic prediction and a group of partially tagged predictor components Ti.
These tagged predictor components Ti, 1 ≤ i ≤ M are indexed using different
history lengths which form a geometric series, i.e., L(i) = (int) (αi−1 * L(1) +
0.5). The order of arrangement of the tagged tables is such that, the table
using the longest history is T1, followed by T2 using the next largest history
length, and so on with TM being the table using the smallest history length in
the geometric series. The base predictor is a simple bimodal table with 2-bit
counters, indexed using PC value. A single entry in the tagged component
consists of a signed counter ctr whose sign provides the prediction, an
unsigned useful counter u and a partial tag. u is a 2-bit counter and ctr is a 3-
bit counter. For a given budget, the number of entries in the base predictor,
the length of the tag bits and number of entries in the tagged component used
S.V.I.I.T 11
Data Analysis And Prediction
3.2 BENIFITS
Through Data analytics process we can clean, analyse and model data using tools.In
the world of business, Data analytics is used for making strategies to get the desired
business results. Data analytics provides both speed and accuracy to business
decisions.
3.3 PROPOSED SOLUTION
Data Analytics is a combination of processes to extract information from datasets
along with domain knowledge you’ll need programming mathematical and statistical
skills to arrive at a decision making process with the help of data.
Overall Analysis and Prediction of the college data includes :
Data acquisition :- collecting datasets for the last three years from
University.
Data Wrangling :- data cleansing and data manipulation using modern tools
and technologies.
Exploratory Data Analysis :- mathematical or graphical output to aid data
analysis.
Data Exploration :- Discovery of data and identification of patterns in data.
Conclusions and Predictions :- by creating training models for machine
learning, it uses mathematical and statistical functions.
Data visualization :- to present the analysis work.
Jupyter:
The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations and narrative text. Uses include:
data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more
S.V.I.I.T 12
Data Analysis And Prediction
S.V.I.I.T 13
Data Analysis And Prediction
5.TECHNICAL PART
S.V.I.I.T 14
Data Analysis And Prediction
5.1CONCEPT
The main concept of the project is build the system that can predict the branch of the
student according to the qualifying exam and score.This project also aims to analyze
all the details of the students.
5.2APPLICATION AREA
Big authorities
Airline route planning
Colleges
Schools
S.V.I.I.T 15
Data Analysis And Prediction
6.SOFTWARE
ENGINEERING
APPOROCH
S.V.I.I.T 16
Data Analysis And Prediction
6.1.1 DESCRIPTION
PROGRAMMING PARADIGM:
In our project we have used Python as a programming language which is a Object
Oriented Programming Language in Jupyter Notebook IDE.
It provides large standard libraries that include the areas like string operations,
Internet, web service tools, operating system interfaces and protocols.We have used
various libraries of python such as matplotlib,pandas etc.
Integration Feature
Python integrates the Enterprise Application Integration that makes it easy to develop
Web services by invoking COM or COBRA components.
Scalability
Despite its easy-to-learn nature and its simple syntax, Python packs surprising
amounts of power and performance right out of the box. For instance, many of the
newest innovations in the big data ecosystem, such as columnar storage, dataflow
programming, and stream processing, can all be expressed in a relatively
straightforward manner using Python.
The RAD model is based on prototyping and iterative developed with no specific
planning involved. The process of writing the software itself involves the planning
required for developing the product.
S.V.I.I.T 17
Data Analysis And Prediction
S.V.I.I.T 18
Data Analysis And Prediction
Functional Requirement:
The data that is to entered into the data analysis and prediction system is
provided by the college.The data that was provided was first pre-processed
in excel sheet format.
Work flow - As the user enters the data that data is fed into the machine
learning model and the required predictions are made. The most suitable
algorithm which gives the optimum result is applied based on the given
outcome which comes under the possible classes
S.V.I.I.T 19
Data Analysis And Prediction
System Report - Based on the past data with which the model is trained we
predict the outcome of the test data.
Who provides the data - any aspirants who is wishes to take admission in the
University can fed his details and check whether he/she is eligible or not.
Capacity and Scalability: The capacity is the amount of resources made available to
the system, and scalability is the ability of the system to make use of these resources.
Availability: Availability is the proportion of time that the system is online. The
system is always available during the hours it is most popular. Any maintenance
where the system needs to be taken offline should be done outside these times.
Recovery: The system should is responsible of taking backups of data, such that it
may be restored to a working state
Security and privacy: Due to potentially sensitive information being contained
within the database, the system facilitate security and privacy.
S.V.I.I.T 20
Data Analysis And Prediction
Role of SRS
The purpose of the Software Requirement Specification is to reduce the
communication gap between the system and the developers. Software Requirement
Specification is the medium through which the client and user needs are accurately
specified. It forms the basis of software development. A good SRS should satisfy all
the parties involved in the system.
Purpose of SRS
The purpose of this document is to describe all external requirements for mobile task
manager. It also describes the interfaces for the system.
System Requirement
The system consists of a web application which can provide the online service. It
should be able to help the customer for selecting the best deals for himself/herself.
S.V.I.I.T 21
Data Analysis And Prediction
6.2.1.1 GLOSSARY
Descriptive analytics – Descriptive analytics refers to a set of techniques used to describe or
explore or profile any kind of data. Any kind of reporting usually involves descriptive
analytics. Data exploration and data preparation are essential ingredients for predictive
modelling and these rely heavily on descriptive analytics.
Inquisitive analytics – Whereas descriptive analytics is used for data presentation and
exploration, inquisitive analytics answers terms why, what, how and what if. Ex: Why have
the sales in the Q4 dropped could be a question based on which inquisitive analysis can be
performed on the data
Big data analytics – When analytics is performed on large data sets with huge volume,
variety and velocity of data it can be termed as big data analytics. The annual amount of data
we have is expected to grow from 8 zettabytes (trillion gigabytes) in 2015 to 35 zettabytes in
2020.
Data Mining – Data mining is the term that is most interchangeably used with “Analytics”.
Data Mining is an older term that was more popular in the nineties and the early 2000s.
However, data mining began to be confused with OLAP and that led to a drive to use more
descriptive terms like “Predictive analytics”.
Data Science – Data science and data analytics are mostly used interchangeably. However,
sometimes a data scientist is expected to possess higher mathematical and statistical
sophistication than a data analyst. A Data scientist is expected to be well versed in linear
algebra, calculus, machine learning and should be able to navigate the nitty-gritty details of
mathematics and statistics with much ease.
Artificial Intelligence –During the early stages of computing, there were a lot of
comparisons between computing and human learning process and this is reflected in the
terminology.
The term “Artificial intelligence” was popular in the very early stages of computing and
analytics (in the 70s and 80s) but is now almost obsolete.
It is mostly concerned with the algorithms which can be a black box to interpret but good
models can give highly accurate results compared to conventional statistical methods. Also,
visualization, domain knowledge etc. are not inclusive when we speak about machine
learning. Neural networks, support vector machines etc. are the terms which are generally
associated with the machine learning algorithms
Algorithm – Usually refers to a mathematical formula which is output from the tools. The
formula summarizes the model
S.V.I.I.T 22
Data Analysis And Prediction
Ex: Amazon recommendation algorithm gives a formula that can recommend the next best
buy
Machine Learning – Similar to “Artificial intelligence” this term too has lost its popularity
in the recent past to terms like “Analytics” and its derivatives.
OLAP – Online analytical processing refers to descriptive analytic techniques of slicing and
dicing the data to understand it better and discover patterns and insights. The term is derived
from another term “OLTP” – online transaction processing which comes from the data
warehousing world.
Reporting – The term “Reporting” is perhaps the most unglamorous of all terms in the world
of analytics. Yet it is also one of the most widely used practices within the field. All
businesses use reporting to aid decision making. While it is not “Advanced analytics” or even
“Predictive analytics”, effective reporting requires a lot of skill and a good understanding of
the data as well as the domain.
Data warehousing – Ok, this may actually be considered more unglamorous than even
“Reporting”. Data warehousing is the process of managing a database and involves
extraction, transformation and loading (ETL) of data. Data warehousing precedes analytics.
The data managed in a data warehouse is usually taken out and used for business analytics.
Purpose
Supplementary specifications define the Requirement that are not easily defined in the use
case Model. Requirements such as legal standards, quality aspects, supportability and
execution criteria of the system..
Scope
The supplementary specifications cover all the non-functional requirements of the system.
The scope of the supplementary specifications is limited to all the non-functional
requirements.
S.V.I.I.T 23
Data Analysis And Prediction
The purpose of use case diagram is to capture the dynamic aspect of a system. However, this
definition is too generic to describe the purpose, as other four diagrams (activity, sequence,
collaboration, and Statechart) also have the same purpose. We will look into some specific
purpose, which will distinguish it from other four diagrams.
Use case diagrams are used to gather the requirements of a system including internal and
external influences. These requirements are mostly design requirements. Hence, when a
system is analyzed to gather its functionalities, use cases are prepared and actors are
identified.
When the initial task is complete, use case diagrams are modelled to present the outside
view.
Use case diagrams are considered for high level requirement analysis of a system. When the
requirements of a system are analyzed, the functionalities are captured in use cases.
We can say that use cases are nothing but the system functionalities written in an organized
manner. The second thing which is relevant to use cases are the actors. Actors can be defined
as something that interacts with the system.
Actors can be a human user, some internal applications, or may be some external
applications. When we are planning to draw a use case diagram, we should have the
following items identified.
Actors
Use case diagrams are drawn to capture the functional requirements of a system. After
identifying the above items, we have to use the following guidelines to draw an efficient use
case diagram
S.V.I.I.T 24
Data Analysis And Prediction
The name of a use case is very important. The name should be chosen in such a way
so that it can identify the functionalities performed.
Do not try to include all types of relationships, as the main purpose of the diagram is to identify
the requirements
S.V.I.I.T 25
Data Analysis And Prediction
Sequence Diagram
Start with one of SmartDraw's included sequence diagram templates. You'll notice that all the
notations and symbols you need are docked to the left of your drawing area. Simply stamp
them to your page and connect the symbols.
Model and document how your system will behave in various scenarios
Validate the logic of complex operations and functions
It captures the dynamic behavior of the system. Other four diagrams are used to show the
message flow from one object to another but activity diagram is used to show message flow
from one activity to another.
Activity is a particular operation of the system. Activity diagrams are not only used for
visualizing the dynamic nature of a system, but they are also used to construct the executable
system by using forward and reverse engineering techniques. The only missing thing in the
activity diagram is the message part.It does not show any message flow from one activity to
another. Activity diagram is sometimes considered as the flowchart. Although the diagrams
look like a flowchart, they are not. It shows different flows such as parallel, branched,
concurrent, and single.
Activity diagrams are mainly used as a flowchart that consists of activities performed by the
system. Activity diagrams are not exactly flowcharts as they have some additional
capabilities. These additional capabilities include branching, parallel flow, swimlane, etc.
Before drawing an activity diagram, we must have a clear understanding about the elements
used in activity diagram. The main element of an activity diagram is the activity itself. An
activity is a function performed by the system. After identifying the activities, we need to
understand how they are associated with constraints and conditions.
Activities
Association
Conditions
S.V.I.I.T 26
Data Analysis And Prediction
Constraints
Once the above-mentioned parameters are identified, we need to make a mental layout of the
entire flow. This mental layout is then transformed into an activity diagram.
S.V.I.I.T 27
Data Analysis And Prediction
Step 1: Goals –This is the most critical step in the scoping process. Most projects start with a
very vague and abstract goal, get a little more concrete and keep getting refined until the goal
is both concrete and achieves the aims of the organization. This step is difficult because most
organizations haven’t explicitly defined analytical goals for many of the problems they’re
tackling. Sometimes, these goals exist but are locked implicitly in the minds of people within
the organization. Other times, there are several goals that different parts of the organization
are trying to optimize. The objective here is to take the outcome we’re trying to achieve and
turn it into a goal that is measurable and can be optimized.
Step 2: Actions –A well-scoped project ideally has a set of actions that the organizations is
taking that can be now be better informed using data science. If the action/intervention a
public health department is taking is lead hazard inspections, the data science work can help
inform which homes to inspect. You don’t have to limit this to making existing actions better.
Often, we end up creating a new set of actions as well. Generally, it’s a good strategy to first
focus on informing existing actions instead of starting with completely new actions that the
organization isn’t familiar with implementing. Enumerating the set of actions allows the
project to be actionable. If the analysis that will be done later does not inform an action, then
it usually (but not always) does not help the organization achieve their goals and should not
be a priority.
Step 3: Data –You’ll notice that so far in the scoping process we haven’t talked about data at
all. This is intentional since we want these projects to be problem-centric and not data-centric.
Yes, data is important and we all love data but starting with the data often leads to analysis
that may not be actionable or relevant to the goals we want to achieve. Once we’ve
determined the goals and actions, the next step is to find out what data sources exist inside
(and outside) the organization that will be relevant to this problem and what data sources we
need to solve this problem effectively. For each data source, it’s good practice to find out how
it’s stored, how often it’s collected, what’s its level of granularity, how far back does it go, is
there a collection bias, how often does new data come in, and does it overwrite old fields or
does it add new rows?
You first want to make a list of data sources that are available inside the organization. This is
an iterative process as well since most organization don’t necessarily have a comprehensive
list of data sources they have. Sometimes, (if you’re lucky) data may be in a central,
integrated data warehouse but even then you may find individuals and/or departments who
have additional data or different versions of the data warehouse.
S.V.I.I.T 28
Data Analysis And Prediction
Step 4: Analysis – The final step in the scoping process is to now determine the analysis that
needs to be done to inform the actions using the data we have to achieve our goals.
The analysis can use methods and tools from different areas: computer science, machine
learning, data science, statistics, and social sciences. One way to think about the analysis that
can be done is to break it down into 4 types:
A data science team, if carefully built, with the right set of professionals, will be an asset to
any business. It’s a fact that the success of any project is dictated by the expertise of its
resources, and data science is no exception to this golden rule of thumb. Professionals with
varied skill-sets are required to successfully negotiate the challenges of a complex big data
project.
For a data science project to be on the right track, businesses need to ensure that the team has skilled
professionals.
Putting together an entire team has the potential to be more difficult. The truth is that data
science is a big field, and a cross-functional team is better prepared to handle real world
challenges and goals. Now that we have established the types of teams that can be set up, let
us look at the factors that ensure the smooth functioning of a data science team.
S.V.I.I.T 29
Data Analysis And Prediction
Skillset: Getting the right talent to fill in positions in the team is the first and foremost way to
move forward. Data scientists should be able to work on large datasets and understand the
theory behind the science. They should also be capable of developing predictive models. Data
engineers and data software developers are important, too. They need to understand
architecture, infrastructure, and distributed programming.
Some of the other roles to fill in a data science team include the data solutions architect, data
platform administrator, full-stack developer, and designer. Those companies that have teams
focusing on building data products will also likely want to have a product manager on the
team.
1. Data Acquisition
2. Data Preparation
3. Hypothesis and Modelling
4. Evaluation and Interpretation
5. Deployment
6. Operations
7. Optimization
1) Data Acquisition
For doing Data Science, you need data. The primary step in the lifecycle of data science
projects is to first identify the person who knows what data to acquire and when to acquire
based on the question to be answered. The person need not necessarily be a data scientist but
anyone who knows the real difference between the various available data sets and making
hard-hitting decisions about the data investment strategy of an organization – will be the right
person for the job.
Data science project begins with identifying various data sources which could be –logs from
webservers, social media data, data from online repositories like US Census datasets, data
streamed from online sources via APIs, web scraping or data could be present in an excel or
S.V.I.I.T 30
Data Analysis And Prediction
can come from any other source. Data acquisition involves acquiring data from all the
identified internal and external sources that can help answer the business question.
2) Data Preparation
Often referred as data cleaning or data wrangling phase. Data scientists often complain that
this is the most boring and time consuming task involving identification of various data
quality issues. Data acquired in the first step of a data science project is usually not in a
usable format to run the required analysis and might contain missing entries, inconsistencies
and semantic errors.
Having acquired the data, data scientists have to clean and reformat the data by manually
editing it in the spreadsheet or by writing code. This step of the data science project lifecycle
does not produce any meaningful insights. However, through regular data cleaning, data
scientists can easily identify what foibles exists in the data acquisition process, what
assumptions they should make and what models they can apply to produce analysis results.
Data after reformatting can be converted to JSON, CSV or any other format that makes it
easy to load into one of the data science tools.
All the above steps from 1 to 4 are iterated as data is acquired continuously and business
understanding become much clearer.
5) Deployment
Machine learning models might have to be recoded before deployment because data scientists
might favour Python programming language but the production environment supports Java.
After this, the machine learning models are first deployed in a pre-production or test
environment before actually deploying them into production.
S.V.I.I.T 31
Data Analysis And Prediction
6) Operations/Maintenance
This step involves developing a plan for monitoring and maintaining the data science project
in the long run. The model performance is monitored and performance downgrade is clearly
monitored in this phase. Data scientists can archive their learnings from a specific data
science projects for shared learning and to speed up similar data science projects in near
future.
7) Optimization
This is the final phase of any data science project that involves retraining the machine
learning model in production whenever there are new data sources coming in or taking
necessary steps to keep up with the performance of the machine learning model.
Having a well-defined workflow for any data science project is less frustrating for any data
professional to work on. The lifecycle of a data science project mentioned above is not
definitive and can be altered accordingly to improve the efficiency of a specific data science
project as per the business requirements.
6.3.5 ESTIMATION
S.V.I.I.T 32
Data Analysis And Prediction
each new project. So while analytics is being used to manage business risk, is enough effort
consciously being applied in managing potential project risks in these analytics projects?
As with any other project analytics projects involve the usual risks that are related to the
overrun of time and cost, sometimes by inadequate resourcing, or by delayed inputs and
decisions and so on. But in addition to these, there are a number of other potential risks that
may make a difference to the reliability (and hence business usability) of analytics project
output. Analytics, and predictive analytics in particular, is a science of heuristics and
statistical probability, and so its output will always contain some degree of inherent
uncertainty. Reducing the uncertainty requires a conscious effort to manage certain risks that
could creep into the project lifecycle, or at least be aware of them. Some of these risks are as
follows.
Inadequate data.
Statistical analysis is often done on a sample of data rather than on a 100% complete data
population because it is often neither practical nor possible to have data from the whole
population. In such cases statistical results are expressed along with a calculated margin of
error, which is fine. The larger the sample size relative to the total population, the greater the
confidence in the result, ie, the lower the potential margin of error. But what if the real
population size is not really known? Assumptions can be made, but how reliable are these
assumptions? What are they based on? It may be worth asking the question and ascertaining
from the right sources that the total population size is as accurate as it can be.
This applies especially to predictive analysis, where data from the past is used to make
probabilistic predictions about the future. When making such predictions the biggest risk is
that the future environment could be different, and therefore the past is not enough of a
determinant of the future. This is actually a fundamental thumb rule in the world of financial
investment, where the utilization of publicly unavailable knowledge to make gains is illegal
in most countries. However this is not the case in other areas of making predictions about the
future, and so while the past can be used to produce good predictions, wherever possible, any
variables that could change value in future should be considered to see if they would make
any change to the statistical prediction.
Data Quality.
Data quality could be an issue when testing of data integration and data cleansing techniques
is done on the basis of sample testing, especially when Big Data is involved. While the
techniques may produce good data in development and test samples, what is the confidence
that the sample represents the entire data population well enough? Again, this is not
something that requires a 100% check of the data population, but when it results in significant
skews or outliers it’s always worth asking the question and then going back to double check
the quality of the data points leading to such results.
S.V.I.I.T 33
Data Analysis And Prediction
Business analytics can be an expensive investment, given the cost of talent and also the
amount of time that may be needed before benefits are realized. At the same time, the value
of soft power in analytics cannot be underestimated. Getting the right data together, analysing
models correctly and asking the right questions of the output requires as much creativity,
business expertise and experience as possible, and therefore even if the core analytics team is
small it helps if they engage with as many other colleagues as possible to get additional
perspectives on their line of thinking.
Biased interpretations.
Sometimes, though, the power of experience and gut feel may also come in the way, and
that’s when there has to be that interesting discussion and debate between the statistician who
is detached from the business, and the business expert who has so much knowledge of the
domain that they may only expect to see validation of what they guess to be right.
Group think.
Group think is a twist on the biased interpretation issue. It refers to the phenomenon where
those who don’t really have an opinion or don’t wish to voice an opinion defer to another
opinion that seems more credible for whatever reason, and go along with it without really
having any basis for doing so
Unintended illegality.
This is a risk that is relatively easier to control. When the analytics team is given the freedom
to gather together whatever data they feel is necessary for their work there should be a control
in place and exercised that ensures that the collection and use of any of that data is not illegal
from the perspective of factors such as security, privacy and confidentiality. It is quite usual
to allow employees to access several kinds of data, but there could be a risk that they may be
unaware that the manner in which they intend to use it may constitute an illegality.
1. STORE PAPER FORMS SECURELY: Much like electronic data, paper documents such as
consent forms, printouts, or case tracking sheets that contain personal identifying information
(PII) must be stored securely in locked file cabinets when not in use and must be handled
only by trained staff members when actively used during research..
S.V.I.I.T 34
Data Analysis And Prediction
3. PROTECT PASSWORDS: Secure data storage depends on the creation and use of passwords
that are needed to gain access to data records. The best storage and encryption technologies
can be easily undone by poor password practices. Passwords should be difficult to determine
and be protected as carefully as confidential data. They should never be shared or left on slips
of paper at work stations or desks.
4. TRAIN AND MONITOR RESEARCH ASSISTANTS: Research assistants who work with
confidential data should understand and follow all of the basic data security practices outlined
in this section. This begins with human subject research training which may be completed
on line at: Human Research/training. Research assistants and other project staff must be
acquainted with procedures and practices described in these guidelines
5. RESTRICTED USE SHARED ACCOUNTS OR GROUP LOGIN IDs: Anyone who works
with confidential electronic data should identify themselves when they log on to the PC or
laptop computer that gives them access to the data. Use of group login IDs violates this
principle. Project managers must make certain that everyone working with confidential data
has a unique password that personally identifies them before they can access the data
6. KEEP USER GROUP LISTS UP-TO-DATE: User groups are a convenient way to grant
access to project files stored on a remote server. The use of user groups simplifies the
granting and revoking of access to a research project’s electronic data resources. By granting
access privileges to each of the research project’s electronic folders to the group as a whole,
newly authorized members of the project team can obtain access to all related electronic data
resources by just being added to the group access any shared resources.
8. ACTIVATE LOCK OUT FUNCTIONS FOR SCREEN SAVERS: Computers used for data
analysis should be configured to "lock out" after 20 minutes of inactivity. This reduces the
risk of theft or unauthorized use of data in situations where a user working with confidential
data leaves his or her desk and forgets to logoff the PC. OIT provides instructions on how to
configure the automatic lock out feature for Windows PCs.
S.V.I.I.T 35
Data Analysis And Prediction
10. USE EFFECTIVE METHODS OF DATA DESTRUCTION: When requesting IRB review
for their planned studies, researchers must create a plan for the ultimate disposition of their
research data. This plan specifies what will be done with the data once the objectives of the
project are completed. In many cases, researchers will produce various types of reports or
papers for publication, as well as a de-identified data file for use by other researchers or the
general public6.3.8 CONFIGURATION MANAGEMENT PLAN
6.4 DESIGN
S.V.I.I.T 36
Data Analysis And Prediction
System Design:
System design provides the understandings and procedural details necessary for implementing the
system recommended in the system study. Emphasis is on the translating the performance requirements
into design specifications. The design phase is a transition from a user-oriented document (System
proposal) to a document oriented to the programmers or database personnel.
• Logical design
• Physical Design
A data flow diagram shows the logical flow of the system. For a system it describes the input
(source), output (destination), database (data stores) and procedures (data flows) all in a format that
meets the user’s requirement. When analysis prepares the logical system design, they specify the
user needs at a level of detail that virtually determines the information flow into an out of the
system and the required data resources. The logical design also specifies input forms and screen
layouts.
The activities following logical design are the procedure followed in the physical design e.g.,
producing programs, software, file and a working system.
S.V.I.I.T 37
Data Analysis And Prediction
• Operational outputs, whose use is purely with in the computer department e.g., program-listing
etc.
• Interactive outputs, which involve the user is communicating directly with the computer, it is
particularly important to consider human factor when designing computer outputs
Using any convention’s DFD rules or guidelines, the symbols depict the four components of
data flow diagrams.
1. External entity: an outside system that sends or receives data, communicating with the
system being diagrammed. They are the sources and destinations of information entering
or leaving the system. They might be an outside organization or person, a computer
system or a business system. They are also known as terminators, sources and sinks or
actors. They are typically drawn on the edges of the diagram.
2. Process: any process that changes the data, producing an output. It might perform
computations, or sort data based on logic, or direct the data flow based on business rules.
A short label is used to describe the process, such as “Submit payment.”
3. Data store: files or repositories that hold information for later use, such as a database
table or a membership form. Each data store receives a simple label, such as “Orders.”
4. Data flow: the route that data takes between the external entities, processes and data
stores. It portrays the interface between the other components and is shown with arrows,
typically labeled with a short data name, like “Billing details.”
S.V.I.I.T 38
Data Analysis And Prediction
Level 0 DFD:
Level 1 DFD:
S.V.I.I.T 39
Data Analysis And Prediction
Level 2 DFD:
S.V.I.I.T 40
Data Analysis And Prediction
calls directly through C, C++ or Java via Jython. Python also processes XML and other
markup languages as it can run on all modern operating systems through same byte code.
• Improved Programmer’s Productivity
The language has extensive support libraries and clean object-oriented designs that increase
two to ten fold of programmer’s productivity while using the languages like Java, VB, Perl,
C, C++ and C#.
• Productivity
With its strong process integration features, unit testing framework and enhanced control
capabilities contribute towards the increased speed for most applications and productivity of
applications. It is a great option for building scalable multi-protocol network applications.
Code Efficiency
Efficiency, as it applies to programming, means obtaining the correct results while
minimizing the need for human and computer resources. The various aspects of code
efficiency are broken down into four major components:
• Central Processing Unit (CPU) time: Compiling and executing programs take up time
and space. The required time the CPU spends to perform the operations that are assigned in
the statements determine the complexity of the program. In order to make the program
efficient and to reduce CPU time, we should
o execute only the necessary statements
o reduce the number of statements executed
o execute calculations only for the necessary observations.
o reduce the number of operations performed in a particular statement
o keep desired variables by using KEEP = or DROP = data set options
o create and use indexes with large data sets
o use IF-THEN/ELSE statements to process data avoid unnecessary sorting
o use CLASS statements in procedures
o use a subset of data set to test code before production
o consider the use of nested functions
o shorten expressions with functions
• Data Storage: Data storage is primarily concerned with temporary datasets generated
during program execution which can become very large and slow down processing. Here are
some ways to reduce the amount of temporary data storage required by a program:
o Create a data set by reading long records from a flat file with an input statement with
keeping the selected records with a needed incoming variables (?)
S.V.I.I.T 41
Data Analysis And Prediction
o Process and store only the variables that you need by using KEEP/DROP= data set
options (or KEEP/DROP statements) to retain desired variables when reading or creating a
SAS data set
o Create a new SAS data set by reading an existing SAS data set with a SET statement
with keeping selected observations based on the values of only a few incoming variables
o Create as many data sets in one DATA step as possible with OUTPUT statements
o Use LENGTH statements to reduce variable size
o Read in as many SAS data sets in one DATA step as possible (SET or MERGE
statement).
• I/O Time: I/O time is the time the computer spends on data input and output (reading
and writing data). Input refers to moving data from disk space into memory for work. Output
refers to moving the results out of memory to disk space or a display device such as a
terminal or a printer. To save I/O time, the following tips can be used:
o Read only data that is needed by subsetting data with WHERE or IF statement (or
WHERE= data step option) and using KEEP/DROP statement (or KEEP=/DROP= data set
option) instead of creating several datasets
o Avoid rereading data if several subsets are required
o Use data compression for large datasets
o Use the DATASETS procedure COPY statement to copy datasets with indexes
o Use the SQL procedure to consolidate code
o Store data in temporary SAS work datasets, not external files
o Assign a value to a constant only once (employ retain with initial values)
• Programming Time:
o Reducing I/O time and CPU usage are important, but using techniques which are
efficient in terms of the programming time it takes to develop, debug, and validate code can
be even more valuable. Much efficiency can be gained by following the good programming
practices for readability and maintainability of code as discussed in this guide.
o utilize macros for redundant code
o use the SQL procedure to consolidate the number of steps
Optimization of Code
Code optimization is any method of code modification to improve code quality and
efficiency. A program may be optimized so that it becomes a smaller size, consumes less
memory, executes more rapidly, or performs fewer input/output operations.
The basic requirements optimization methods should comply with, is that an optimized
program must have the same output and side effects as its non-optimized version. This
requirement, however, may be ignored in the case that the benefit from optimization, is
S.V.I.I.T 42
Data Analysis And Prediction
Validation Check
Data validation is the process of ensuring that a program operates on clean, correct and useful
data. It uses routines, often called ‘validation rules’, or ‘check routines’, that check for
correctness, meaningfulness, and security of data that are input to the system. The rules may
be implemented through the automated facilities of a data dictionary, or by the inclusion of
explicit application program validation logic.
In evaluating the basics of data validation, generalizations can be made regarding the
different types of validation, according to the scope, complexity, and purpose of the various
validation operations to be carried out.
For example:
• Data type validation: Data type validation is customarily carried out on one or more
simple data fields.The simplest kind of data type validation verifies that the individual
characters provided through user input are consistent with the expected characters of one or
more known primitive data types; as defined in a programming language or data storage and
retrieval mechanism.
• Range and constraint validation: Simple range and constraint validation may examine
user input for consistency with a minimum/maximum range, or consistency with a test for
evaluating a sequence of characters, such as one or more tests against regular expressions.
• Code and Cross-reference validation: Code and cross-reference validation includes
tests for data type validation, combined with one or more operations to verify that the user-
supplied data is consistent with one or more external rules, requirements, or validity
constraints relevant to a particular organization, context or set of underlying assumptions.
These additional validity constraints may involve cross-referencing supplied data with a
known look-up table or directory information service such as LDAP.
S.V.I.I.T 43
Data Analysis And Prediction
6.5.2 CODING
Visualization Code:
import pandas as pd
from matplotlib import pyplot as plt
data=pd.read_csv('data.csv')
gender=data.Gender
alpha_color=0.5
gender['Gender'].value_counts().plot(kind='bar', color=['b','r'],alpha= alpha_color)
plt.title('Gender Analysis ')
plt.xlabel('Gender')
plt.ylabel('No of Students')
cat=data.Category
cat['Category'].value_counts().plot(kind='bar', color=['b','r','y','g'])
plt.title('Category Analysis ')
plt.xlabel('Category')
plt.ylabel('No of Students')
score=data.Score
score['Score'].value_counts().plot(kind='bar', color=['b'])
plt.title('Score Analysis ')
plt.xlabel('Score')
plt.ylabel('No of Students')
Degree['Degree'].value_counts().plot(kind='bar', color=['b'])
plt.title('Degree Analysis ')
plt.xlabel('Degree')
plt.ylabel('No of Students')
%matplotlib inline
S.V.I.I.T 44
Data Analysis And Prediction
alpha_color=0.5
College['College'].value_counts().plot(kind='bar', color=['b'])
plt.title('College Analysis ')
plt.xlabel('Degree')
plt.ylabel('No of Students')
# Plot
plt.pie(sizes, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()
labels = 'Maharashtra','New Delhi','Rajasthan','Lakshadweep','Bihar','Uttar
Pradesh','Chattisgarh','Gujarat','Jharkhand','Punjab'
sizes = [12,1,11,4,8,7,19,4,2,1]
colors = ['gold', 'g', 'lightcoral', 'lightskyblue','red','b','yellowgreen','k','m','c']
# Plot
plt.pie(sizes, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()
Prediction Code
Importing Data
S.V.I.I.T 45
Data Analysis And Prediction
import numpy as np
data = pd.read_csv('D:\minor project data\Book2.csv')
Understanding Data
data.head()
print(data.columns)
data.columns = ['Branch', 'Degree', 'Gender', 'Category', 'Year', 'Semester', 'State',
'City', 'College', 'QExam', 'Score']
print(data.columns)
Converting Data in Even Case
data1 = data.apply(lambda x: x.astype(str).str.lower())
data1.head()
Dropping Unwanted Columns in the Data
data1 = data1.drop(columns = ['Gender', 'Year', 'Semester', 'State', 'City', 'College'] )
data1.head()
Converting the String Values to Numeric Values to Pass it to Machine Learning Model
data1.Category.unique()
cat_dic = data1.Category.unique()
data1.head()
data1.QExam.unique()
#converting Qualifying Exam in numerical values
qexam_dic = data1.QExam.unique()
S.V.I.I.T 46
Data Analysis And Prediction
data1['QExam'][data1['QExam']==val] = idx
data1.head()
data1.Degree.unique()
#converting Degree in numerical values
degree_dic = data1.Degree.unique()
data1.head()
data1.Branch.nunique()
data1.Branch.unique()
S.V.I.I.T 47
Data Analysis And Prediction
S.V.I.I.T 48
Data Analysis And Prediction
S.V.I.I.T 49
Data Analysis And Prediction
Screenshots:
Data Analysis SS:
S.V.I.I.T 50
Data Analysis And Prediction
S.V.I.I.T 51
Data Analysis And Prediction
S.V.I.I.T 52
Data Analysis And Prediction
S.V.I.I.T 53
Data Analysis And Prediction
S.V.I.I.T 54
Data Analysis And Prediction
S.V.I.I.T 55
Data Analysis And Prediction
S.V.I.I.T 56
Data Analysis And Prediction
S.V.I.I.T 57
Data Analysis And Prediction
S.V.I.I.T 58
Data Analysis And Prediction
6.6 TESTING
6.1 PRELIMINARY TESTING
` The development of Software system involves a series of production activities. There is a
chance of errors to occur at any stage. Because of human inability to perform and
communicate with perfection, a Quality Assurance Activity accompanies software
development.
Unit Testing
The resultant system after the integration of the modules was tested to ascertain its
correctness in terms input, processing and output. This was done by executing prepared test
scenarios. The unit testing focused on the internal processing logic and data structures within
the boundaries of a component. More than often, the developer had to keep editing a module
severally until each module was complete and correct.
Security testing
Security testing attempted to verify that protection mechanism of the system. It is protected
against unauthorized access. There was deliberate inputting of username with a passwords
and the reaction of the system were checked.
S.V.I.I.T 59
Data Analysis And Prediction
During the implementation for the system each module of the system is tested separately to
uncover errors within its boundaries. User interface is used as a guide in this process. The
validations have been done for all the inputs using Java Script.
For example to check whether the work allotted among the database correctly without
exceeding the schemes which are not validated thoroughly and the internal database has to
check the reflections in the database.
Integration Test:
The objective of Integration Test is to take the tested modules and build a program structure
that has been defined in the design. We have done top down integration, which is
constructing and testing small segments where errors are easier to isolate, and correct. The
integration process was performed in three steps:
• Regression testing to ensure that new errors have not been introduced due to the
corrections.
Black Box Testing:
It focuses on functional requirements of the software. Block box testing attempts to find
errors in the following categories.
Validation Testing:
At the culmination of integration testing, software is completely assembled as a package,
interfacing errors have been uncovered and corrected, and a final series of software tests
S.V.I.I.T 60
Data Analysis And Prediction
namely validation tests are performed. Validation succeeds when the software functions in
the manner that can be easily accepted by the customer.
S.V.I.I.T 61
After validation test has been conducted, one of the possible conditions are satisfied. The
functions or performance characteristics confirmed to specifications are acceptable. The
deviation form specifications are uncovered and a note of what is lacking is made. The
developed system has been tested satisfactorily to ensure its performance is satisfactory and it
is working efficiently.
Screenshots of testing:
Fig 1
Fig 2
7.CONCLUSION
&DICUSSION
7.1 PRELIMINARY CONCLUSION
As this project mainly aims for analysis the of the college data of a given year and that will help
the college authority for more constructive and better rules and help to grow college in a
positive manner
We particularly aim to analysis and visualise the gender proportion, different categories, no. Of
students admitted with different qualify exams, residence of different cities,no.of students
present in various branches and many more
Our aim is also to predict the branch that will be available to the student with the marks that
he/she is being provided with.This will help students to check whether they will be eligible for
the branch they want.It will also help students to check in which branch they can get admission.
Thus according to the previous record a criteria will be set for marks such of different qualifying
marks namely Jee and SVET and also parameters like category,degree are used.
JEE and SVET are two different qualifying exams and each of them have different different
cutoff marks.
8.BIBLIOGRAPHY
AND REFRENCES
8.1 REFRENCES
Wikipedia
Tutorials Point
Coursera
Data Camp
Python For Data Science
Python Data Science Handbook