0% found this document useful (0 votes)
1K views

Fake Job Posting Detection Documentation Updated

The document describes a project that aims to detect fake job postings using machine learning algorithms. It details the existing problem of fraudulent job ads online and proposes developing a classification system to label job postings as legitimate or fraudulent by analyzing textual content. The system is trained on real world datasets and implemented using technologies like Python, scikit-learn, NLTK, Pandas, Django. It provides functionality for users, service providers and testing methodologies employed.

Uploaded by

Jay Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Fake Job Posting Detection Documentation Updated

The document describes a project that aims to detect fake job postings using machine learning algorithms. It details the existing problem of fraudulent job ads online and proposes developing a classification system to label job postings as legitimate or fraudulent by analyzing textual content. The system is trained on real world datasets and implemented using technologies like Python, scikit-learn, NLTK, Pandas, Django. It provides functionality for users, service providers and testing methodologies employed.

Uploaded by

Jay Vardhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

A Project Report On

FAKE JOB POSTING DETECTION


Submitted to partial fulfillment of the requirements for the award of the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
By
P. SHIRIN --------------------------- (20911A0554)
CH. SATHWIKA ------------------ (20911A0551)
G. MEGHANA --------------------- (21915A0504)
A. ANOKH -------------------------- (21915A0501)

Under the Guidance of


Mrs. V. Prathima
Assistant Professor

Department of Computer Science and Engineering

VIDYA JYOTHI INSTITUTE OF TECHNOLOGY


(An Autonomous Institution)
(Approved by AICTE, Accredited by NAAC, NBA & permanently Affiliated to JNTUH,
Hyderabad) Aziz Nagar Gate, C.B. Post, Hyderabad-500075
2023-2024
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the project report titled “FAKE JOB POSTING DETECTION” is
being submitted by P. SHIRIN (20911A0554), CH. SATHWIKA (20911A0551), G.
MEGHANA (21915A0504), A. ANOKH (21915A0501) in partial fulfillment for the award of the
Degree of Bachelor of Technology in Computer Science and Engineering, is a record of bonafide
work carried out by them under my guidance and supervision. These results embodied in this project
report have not been submitted to any other University or Institute for the award of any degree or
diploma.

Internal Guide Head of Department


Mrs. V. Prathima Dr. D Aruna Kumari
Assistant. Professor Professor

External Examiner
DECLARATION

We, P. SHIRIN, CH. SATHWIKA, G. MEGHANA, A. ANOKH Hereby declare that the
project entitled, “FAKE JOB POSTING DETECTION” submitted for the degree of
Bachelor of Technology in Computer Science and Engineering is original and has been done
by us and this work is not copied and submitted anywhere for the award of any degree.

Date: 04-04-2024 P. SHIRIN (20911A0554)


Place: Hyderabad CH. SATHWIKA (20911A0551)
G. MEGHANA (21915A0504)

A. ANOKH (21915A0501)
ACKNOWLEDGEMENT

We wish to express our sincere gratitude to the project guide, Mrs. V. Prathima, Assistant
Professor, Vidya Jyothi Institute of Technology, Hyderabad for her timely cooperation and
valuable suggestions while carrying out this work. It is her kindness that made us learn more
from her.

We are grateful to Dr. D Aruna Kumari, Professor, and HOD, department of CSE, for her
help and support during our academic year.

We wholeheartedly convey our gratitude to Principal Dr. E. Sai Baba Reddy for providing
the necessary infrastructure to complete this project.

We would thank my parents and all the faculty members who have contributed to our
progress through the course to come to this stage.

P. SHIRIN (20911A0554)
A. ANOKH (21915A0501)
CH. SATHWIKA (20911A0551)

G. MEGHANA (21915A0504)
ABSTRACT

FAKE JOB POSTING DETECTION

This project aims to address the problem of fraudulent job postings, which is a
significant concern in the realm of Online Recruitment Frauds (ORF). These fraudulent
postings not only pose risks to individuals but also damage the reputation of companies.
With the widespread use of online platforms for job advertisements, fake job postings have
become increasingly common. These postings deceive job-seekers by offering non-
existent positions and often result in financial loss. Even reputable job advertising
platforms are not immune to such scams. To tackle this issue, an automated system based
on machine learning algorithms has been developed. Users provide details such as job
title, location, requirements, description, benefits, company profile, and industry. Through
feature extraction, the system classifies job postings as either legitimate or fraudulent by
analyzing their textual content and descriptions. Various data mining techniques and
classification algorithms, including decision trees, support vector machines, naive Bayes
classifiers, and random forest classifiers, are employed for this purpose. The dataset used
for training and testing is the Employment Scam Aegean Dataset (EMSCAD), which
contains 17,014 legitimate and 866 fraudulent job ads published between 2012 and 2014,
obtained from the University of the Aegean, Greece. This project provides a practical
solution to the problem of fraudulent job postings, contributing to a safer online job-
seeking environment and protecting both job seekers and the reputation of companies
advertising on these platforms.
INDEX

S.NO TITLE PAGE-NO

1. Introduction 1-2
2. Literature Survey 3
3. Feasibility Study 4
3.1 Technological sides 4
3.2 Web Technologies 4
3.3 Economical side 4

4. System Requirement Specifications 5-8


4.1 Existing System 5
4.2 Proposed System 6
4.3 System Requirements 7
4.3.1 Software Requirements 7
4.3.2 Hardware Requirements 7
4.4 Requirements Definition 8
4.4.1 Functional Requirements 8
4.4.2 Non-Functional Requirements 8

5. System Design 9-25


5.1 UML Diagrams 9-18
5.1.1 Use case Diagram 10-11
5.1.2 Activity Diagram 12-13
5.1.3 Class Diagram 13-14
5.1.4 Sequence Diagram 15-16
5.1.5 Dataflow Diagram 17-18
5.1.6 Architecture 19
5.1.7 Algorithms 20-25
6. Software Implementation 26-38
6.1 Technologies used in project 26

6.1.1 Python 26-27

6.1.2 Scientific and Numeric Computing 27

6.1.3 Sci-kit Learn 27

6.1.4 NLTK 28

6.1.5 Pandas 28

6.1.6 HTML 28-29

6.1.7 CSS 29

6.1.8 JavaScript 29

6.1.9 Django 30

6.2 Sample Code 30-38

6.2.1 Machine Learning Code 30-32

6.2.2 Django file 33

6.2.3 Frontend Code 33-34

6.2.4 Python code (Views.py file) 34-38

6.2.5 Charts’ Code 38

7. System Testing 39-48

7.1 Types of tests 39-40


7.1.1 Unit Testing 39

7.1.2 Integration Testing 39

7.1.3 Functional testing 40

7.2 System Test 40-41

7.2.1 White Box Testing 40-41


7.2.2 Black Box Testing 41-42
7.3 Unit Testing 42
7.4 Integration Testing 43
7.5 Acceptance Testing 43

7.6 Testing Methodologies 44-48


7.6.1 Unit Testing 44
7.6.2 Integration Testing 44-45
7.6.2.1 Top-Down Integration 44-45
7.6.2.2 Bottom-Up Integration 45
7.6.3 User Acceptance Testing 45
7.6.4 Output Testing 45-46
7.6.5 Validation Testing 46-48

8. Results and output 49-53


8.1 Homepage/Login Page 49

8.2 Registration Page 49

8.3 Service Provider/Admin Login Page 50

8.4 User Module Features 50-51

8.5 Service Provider/Admin Module Features 51-53

9. Conclusion 54

10. Future Enhancement 55

11. References 56
LIST OF FIGURES

S.NO TITLE PAGE NO

1.1 Use case Diagram 11


1.2 Activity Diagram 13
1.3 Class Diagram 14
1.4 Sequence Diagram 16
1.5 Dataflow Diagram 18
1.6 Architecture Diagram 19
2.1 Homepage/Login screen 49
2.2 New User Registration screen 49
2.3 Service Provider Login Screen 50
2.4 Real/Fraudulent Job Post Prediction screen 50-51
2.5 View your profile screen 51
2.6 Accuracy of the algorithms used screen 52
2.7 Bar chart of algorithms used 52
2.8 Line and Pie Chart of Algorithms screen 52-53
2.9 Viewing all remote users’ screen 53
CHAPTER-1

INTRODUCTION

Through job portals and recruitment sites, individuals explore potential job opportunities
based on their qualifications, experience, and preferences. The impact of social media on job
advertisement is significant, with the rapid sharing of job posts leading to an increase in fraudulent
postings. This not only poses challenges for job seekers but also for recruitment platforms.
Consequently, individuals may become wary of new job postings due to concerns about the
security and accuracy of their personal and professional information.
The proliferation of fake job posts results in wasted time and inconvenience for job seekers. Job
scams involve offering enticing job offers to candidates, only to deceive them and obtain their
money or personal information. These scams can take various forms, such as advance fee scams
or illegal money laundering schemes. Scammers often utilize fake websites and emails to lure job
seekers, targeting platforms like LinkedIn to pose as legitimate recruiters. Despite efforts to
combat cybercrime, job scams persist, with fraudsters constantly adapting their tactics.
In India, a significant percentage of job seekers encounter scams during their job search, with job
scams rising dramatically in recent years. About 56 per cent of job seekers in India face scams
during their job hunt process. Job scams rose 250 percent in the first quarter of 2023, nearly triple
compared to the same period in 2022. A recent survey from the UK has shown that more than 67%
of people are at great risk who look for jobs through online advertisements but are unaware of fake
job posts or job scams. Traditional cybercrime prevention methods have proven ineffective against
the evolving nature of job scams. Filtering out false job postings would significantly benefit the
recruitment process, safeguarding both job seekers and the reputation of companies. This project
aims to address this challenge by implementing strategies to predict and identify fraudulent job
posts, ultimately enhancing trust in the recruitment process and protecting the integrity of company
brands.

This project aims to achieve the following objectives:


1. Avoid financial losses - To protect users from getting scammed by paying a fee to the scammers
as a part of the recruitment procedure.
2. Efficiency: Job seekers can focus their time and efforts on legitimate opportunities, leading to a

1
more efficient job search process.
3. Identity Protection: A detection system helps safeguard sensitive personal data, reducing the risk
of identity theft.
4. Preserving Employer Reputation: To protect the reputation of genuine employers by preventing
misuse of their branding.
5. International Reach: To be used for job markets worldwide as online job searching is a global
phenomenon and the impact of fake job postings is not limited by geographical boundaries.

To combat the growing threat of job scams, it's imperative to implement proactive measures that
go beyond traditional cybersecurity tactics. This involves leveraging advanced technologies such
as machine learning and natural language processing to analyze job postings and detect fraudulent
patterns. By harnessing the power of data-driven insights, we can develop more sophisticated
algorithms capable of identifying subtle indicators of fraudulent activity.
Additionally, collaboration between industry stakeholders, government agencies, and
cybersecurity experts is essential to share information and best practices for combating job scams
effectively. Moreover, raising awareness among job seekers about the red flags of fraudulent job
postings and educating them on how to verify the legitimacy of job opportunities can empower
individuals to protect themselves from falling victim to scams. By adopting a multi-faceted
approach that combines technological innovation, collaboration, and education, we can create a
safer and more trustworthy environment for online job seeking.

2
CHAPTER-2
LITERATURE SURVEY

Vidros et al. found statistics about many real and renowned companies who produced fake
job advertisements with ill-motives. They experimented on EMSCAD dataset using several
classification algorithms like Naive Bayes classifier, Random Forest classifier, etc. Random Forest
Classifier showed the best performance on the dataset with 89.5% classification accuracy. They
found logistic regression performing very poor on the dataset.

Alghamdi et al. worked on this dataset in three steps - data pre-processing, feature selection
and fraud detection using classifier. They applied feature selection technique to reduce the number
of attributes effectively and efficiently. SVM was used for feature selection and ensemble classifier
using Random Forest was used to detect fake job posts from the test data. This classifier showed
97.4% classification accuracy to detect fake job posts.

“A Study on Fake Job Posts Detection Techniques” using NLP by Akhtar et al. explores
various NLP techniques, including sentiment analysis, keyword extraction, and syntactic analysis.
“Detecting Fraudulent Job Advertisements using Machine Learning Techniques" by Founta et al.
investigates the application of machine learning techniques, including support vector machines
and random forests, for detecting fraudulent job advertisements based on textual features.

"Identifying Fraudulent Job Advertisements through Text Mining" by Okou et al. focuses
on identifying fraudulent job advertisements using text mining approaches such as TF-IDF and
topic modeling. Text mining techniques enable the extraction of relevant features from job
postings, which are then used to classify them as fraudulent or legitimate.

"Detection of Fake Job Postings using Machine Learning Algorithms" by Kumar et al. proposes a
framework for detecting fake job postings using machine learning algorithms, including decision
trees and logistic regression, trained on textual data features. These algorithms are trained on
labeled data to learn patterns indicative of fraudulent job postings and then applied to classify new
postings.

3
CHAPTER-3
FEASIBILITY STUDY

In order to evaluate if the project can be done in the given time frame, we are using the TEL-
evaluation methods, where we cover the feasibility of the project from a technological, economical,
and legal perspective.

3.1 Technological Sides


The technologies and libraries relevant to data analysis and web technologies would be used to
develop this project. The application starts running when it is executed for the first time. The server
has to be started. The project will run on the local area network so the server machine will serve as
the administrator while the other connected systems can act as the clients. The developed system
is highly user friendly and can be easily understood by anyone using it even for the first time.

3.2 Web Technologies


In Web technologies, several technologies are used to implement Web-based applications.
We do think that we are going to evaluate a language and its various libraries and modules (e.g.,
Python and its libraries – NumPy, Sci-kit learn, Pandas, Seaborn, Matplotlib, NLTK a database
like MySQL) and different server-side and client-side languages (e.g., HTML, CSS, JavaScript,
Django) during the implementation phase and rather mix and match different technologies to use
their strong points. The system is technically feasible for development and can be developed with
the existing facility.

3.3 Economical Side


This project will be based on free and open-source technologies and libraries that are
readily available to developers and scientists, free of charge. This means that we don’t have to
worry about costs related to licensing or reusing source code and that the only costs related to the
project are related to the time and effort spent on developing it. As hardware was installed from
the beginning thus, the cost of hardware is low. So, the project is economically feasible.

4
CHAPTER-4
SYSTEM REQUIREMENTS

4.1 EXISTING SYSTEM:


There exists a large body of research on machine learning methods for deception detection,
but most of it is focused on classifying online reviews and news. Presently, there isn't a specific
widely known website dedicated solely to detecting fake job listings. ORF detection has not
received much attention and is currently a relatively unexplored area.

Challenges of the existing system:

1. Time – consuming: Job seekers can currently identify fake postings by verifying the contact
details, searching the company’s name, etc. This takes a lot of time and patience.

2. Lack of automated systems: There is no automation or automated system in place. Lack of


automation leads to inefficiency, as there are only manual ways of cross-checking a job post.

3. Not foolproof: It is not always possible to be able to detect fraud postings as scammers are
evolving in their schemes and tactics to present a fake job as a real one, so it is not easy to identify
a job posting.

4. No dedicated website or tool in place: Websites like Glassdoor, LinkedIn, ScamAdviser, etc
allow users to read reviews and ratings for a particular job posting, and to verify the trustworthiness
of a website. These are helpful, but not always accurate and don’t solely focus on identifying
fraudulent job postings. Scammers have learnt to bypass these websites to make their postings
look legitimate.

5. Fragmentation of Resources: Existing systems are fragmented, with job seekers needing to visit
multiple websites or platforms to verify the legitimacy of job postings.

6. Limited Focus: Some existing websites and tools may not prioritize fake job posting detection as
their primary focus, leading to gaps in coverage and effectiveness.

7. Dynamic Nature of Job Scams: Job scams are constantly evolving, requiring continuous updates
and adaptations to detection strategies.
5
4.2 PROPOSED SYSTEM:
A more robust system can address the limitations of the existing system and offer additional
features. This system uses ML algorithms and NLP approaches.

The main steps involved are:

1. An EMSCAD dataset that contains 18,000 job posts, including attributes such as job id, title,
location, department, salary_range, company_profile, description, requirements, benefits,
telecommunication, has_company_logo, has_questions, employment_type,
required_experience, required_education, industry, function, fraudulent (class label). This is
used to train the system.

2. The data is cleaned and provided to the classifier for predictions to make it prediction ready.

3. Feature selection and extraction: Among these 18 attributes, we have used only 7 attributes
which are converted into categorical attribute. Telecommuting, has_company_logo,
has_questions, employment_type, required experience, required_education and fraudulent are
changed into categorical value from text value. As an illustration, the "employment_type"
attribute values undergo transformation as follows: "none" is represented by 0, "full-time" by
1, "part-time" by 2, "others" by 3, "contract" by 4, and "temporary" by 5. The primary objective
behind converting these attributes into categorical form is to perform the classification of
fraudulent job advertisements without using text processing or natural language processing
techniques. This study exclusively focuses on utilizing these categorical attributes for analysis.

4. The classifier gets trained regularly as new training data is feed into the classifier (keyword
analysis, pattern recognition).

5. A website (UI) is provided for administrators and users, allowing them to login, use the tool
and see the prediction.

Features:

1. User-Friendly Interface: Develop a website using HTML and CSS web-based interface using a
framework like Django. Users can easily register, login, post their job datasets, predict a job
posting, view their profile and logout.

2. Free: This tool is absolutely free to use and has an easy to interact UI that non-technical users can
use.
6
3. Multiple classifiers: It uses multiple algorithms like RF, NB, SVM and LR to compare their

accuracies and to predict effectively.

4. Database Integration: Store logs of user details like their names, emails, passwords, city, state
and country in a database for persistence.

5. Error Handling: Provide clear error messages to enhance the user experience.

Advantages of the Proposed System:

1. Improved accuracy: Machine learning algorithms can be trained on large datasets to detect
patterns that are indicative of fake job postings.

2. Efficiency: Machine learning models can process large volumes of data much more quickly
and efficiently than humans.

3. Scalability: Machine learning models can be easily scaled up to handle large volumes of job
postings.

4. Adaptability: Machine learning models can adapt to new fraud techniques and patterns as they
emerge, making them more effective at identifying fake job postings.

5. Cost-effective: More cost-effective than employing a team of human moderators.

4.3 SYSTEM REQUIREMENTS


4.3.1 SOFTWARE REQUIREMENTS

● OS: Windows 10 and above (with any web browser)


● Back - end: Python 3.5+ and its libraries (NLTK, NumPy, Scikit-learn, Pandas, Seaborn,
Matplotlib) and MySQL (WAMP Server) to store the user details.
● Front – end: HTML, CSS, JavaScript, Bootstrap
● Django – ORM for integration of frontend and backend.
● Terminal to execute script from

4.3.2 HARDWARE REQUIREMENTS

● RAM: 4GB or higher


● Processor: Pentium IV
● Hard Disk: 20 GB

7
4.4 REQUIREMENTS DEFINITION
After the severe continuous analysis of the problems that arose in the existing system, we
are now familiar with the requirement required by the current system. The requirements that the
system needs are categorized into functional and non-functional requirements. These requirements
are listed below:

4.4.1 FUNCTIONAL REQUIREMENTS

Functional requirements specify which features or functions should be included in a


system in order to satisfy client expectations and meet business needs. The functional requirements
define the connection between the inputs and outputs, based on the premise. To obtain the output,
all the operations that must be carried out on the input data must be specified. This includes
defining the input and output data validity checks, the parameters impacted by the operations, and
any additional operations required to convert the inputs into outputs. Functional requirements
specify the behavior of the system for valid input and output.

4.4.2 NON-FUNCTIONAL REQUIREMENTS


Non-functional requirements describe features, characteristics, and capacity of the system
and they may constraints the boundaries of the proposed system. The following are the non-
functional requirements that are essential depending on the performance, cost, and control and give
security efficiency and services. Based on the above-explained non-functional prerequisites, they
are as follows:

● User friendly
● The system should provide better accuracy.
● To perform efficiently with better throughput and response time

8
CHAPTER-5
SYSTEM DESIGN

5.1 UML Diagrams

UML diagram is designed to let developers and customers view a software system from a
different perspective and in varying degrees of abstraction. In its simplest form, a use case can be
described as a specific way of using the system from a User’s (actor’s) perspective. A more
detailed description might characterize a use case as:
● a pattern of behavior the system exhibits
● a sequence of related transactions performed by an actor and the system
● delivering something of value to the
actor Use cases provide a means to
● capture system requirements
● communicate with the end users and domain experts
● Test the system
Use cases are best discovered by examining the actors and defining what the actor will be able to
do with the system. Since all the needs of a system typically cannot be covered in one use case, it
is usual to have a collection of use cases. Together this use case collection specifies all the ways
of using the system. A UML system is represented using five different views that describe the
system from a distinctly different perspective. Each view is defined by a set of diagrams, which is
as follows:
User Model View
● This view represents the system from the user’s perspective.
● The analysis representation describes a usage scenario from the end
user’s perspective.
Structural model view
● In this model, the data and functionality come from inside the system.
● This model view models the static structures.
Behavioral Model View
● It represents the dynamic of behavior as parts of the system, depicting the interactions of
collection between various structural elements described in the user model and structural
model view.
9
Implementation Model View
● In this, the structural and behavioral parts of the system are represented as they are to be built.
Environmental Model View
● In this, the structural and behavioral aspects of the environment in which the system is to
be implemented are represented.
UML is specifically constructed through two different domains they are:
● UML Analysis modeling, which focuses on the user model and structural model views of
the system.

● UML design modeling, which focuses on behavioral.

5.1.1 USE CASE DIAGRAM:


A use case diagram shows the use cases, the actors and the relationships between them. It
depicts the functionality of a system and its interaction with external entities.

Components of a Use Case Diagram:


1. Actors:

• Actors represent the users or external systems interacting with the system being modelled. They
are depicted as stick figures.

• Actors can be human users, other systems, or external entities that interact with the system.

2. Use Cases:

• Use cases represent the specific functionalities or features provided by the system.

• They describe a sequence of actions or interactions between the system and its actors to achieve a
specific goal.

• Use cases are represented as ovals.

3. Relationships:

• Relationships define how actors interact with use cases.

• The main relationship types are:

• Association: Indicates that an actor is involved in a use case.

• Generalization: Represents inheritance or specialization between use cases.

• Include: Shows that one use case includes another.

• Extend: Indicates that one use case extends another.

10
Principles of Use Case Diagrams:
1. Simplicity: Use case diagrams should be simple and easy to understand and avoid excessive
detail and focus on essential interactions.

2. Relevance: Include only relevant actors and use cases. Prioritize use cases based on their
importance to system functionality.

3. Clarity: Use clear and descriptive names for actors and use cases. Ensure that relationships
are clearly defined and labelled.

• Actors: Service Provider (Admin), Remote User

• Use Cases: Login, Train and Test Datasets, View Trained and Tested Accuracy in Bar Chart,
View Trained and Tested Accuracy results, Register and Login, Predict job post, View your
profile, View all remote users.

Figure. 5.1.1: Use Case Diagram

The above use case diagram represents the actors and the process in the application.
11
5.1.2 ACTIVITY DIAGRAM:

The activity diagram depicts the flow from one activity to another activity. They show the
workflow from a start point to the finish point detailing the many decision paths that exist in the
progression of events contained in the activity. The following section describe the elements that
constitute an activity diagram:

1. Activities

An activity is shown as a round-cornered rectangle enclosing all the actions, control flows
and other elements that make up the activity.

2. Actions
An action represents a single step within an activity. Actions are denoted by round-cornered
rectangles.
3. Control Flow

A control flow shows the flow of control from one action to the next. Its notation is a line with
an arrowhead.

4. Initial Node

An initial or start node is depicted by a large black spot.

5. Final Node

The activity final node is depicted as a circle with a dot inside.

• Activities: Registration, Login, Submit fake job post details, Applying feature extraction,
Classify, Predict the job post, Verify true positive result.

12
Figure. 5.1.2: Activity Diagram

5.1.3 CLASS DIAGRAM:

The class diagram represents the class, relations between the classes in the activity. Class diagram
consists of classes, attributes, relations. It describes the attributes and operations of the classes.
Class diagrams provide a high-level overview of a system’s design, helping to communicate and
document the structure of the software. They are a fundamental tool in object-oriented design and
play a crucial role in the software development lifecycle.

UML Class Notation:

● Class Name: The name of the class is typically written in the top compartment of the class
box and is centered and bold.

13
● Attributes: Attributes, also known as properties or fields, represent the data members of the
class. They are listed in the second compartment of the class box and often include the
visibility (e.g., public, private) and the data type of each attribute.

● Methods:

Methods, also known as functions or operations, represent the behavior or functionality of


the class. They are listed in the third compartment of the class box and include the visibility
(e.g., public, private), return type, and parameters of each method.

• Visibility Notation:

● + for public (visible to all classes)


● - for private (visible only within the class)
● # for protected (visible to subclasses)
● ~ for package or default visibility (visible to classes in the same package)

Figure. 5.1.3: Class Diagram

14
5.1.4 SEQUENCE DIAGRAM:

A sequence diagram is a Unified Modeling Language (UML) diagram that illustrates the sequence
of messages between objects in an interaction. It describes the flow of messages in the system and
how and in what order the objects in a system function. It consists of Remote User, Service
Provider (Admin) and Web Server.

Sequence Diagram Notation:

1. Actors:

An actor in a UML diagram represents a type of role where it interacts with the system and
its objects. It is important to note here that an actor is always outside the scope of the system
we aim to model using the UML diagram. We use actors to depict various roles including
human users and other external subjects. We represent an actor in a UML diagram using a
stick person notation. We can have multiple actors in a sequence diagram.

2. Lifelines:

A lifeline is a named element which depicts an individual participant in a sequence diagram.


So basically, each instance in a sequence diagram is represented by a lifeline. Lifeline
elements are located at the top in a sequence diagram. We display a lifeline in a rectangle
called head with its name and type. The head is located on top of a vertical dashed line
(referred to as the stem) as shown above.

3. Messages:

Communication between objects is depicted using messages. The messages appear in a


sequential order on the lifeline. We represent messages using arrows. Lifelines and messages
form the core of a sequence diagram. Messages can be broadly classified into the following
categories:

Synchronous messages: A synchronous message waits for a reply before the interaction can
move forward. The sender waits until the receiver has completed the processing of the
message. The caller continues only when it knows that the receiver has processed the
previous message i.e. it receives a reply message. We use a solid arrow head to represent a
synchronous message.

15
Asynchronous Messages: An asynchronous message does not wait for a reply from the
receiver. The interaction moves forward irrespective of the receiver processing the previous
message or not. We use a lined arrow head to represent an asynchronous message.

4 . Create message:

We use a Create message to instantiate a new object in the sequence diagram. There are
situations when a particular message call requires the creation of an object. It is represented
with a dotted arrow and create word labelled on it to specify that it is the create Message
symbol.

5. Self Message:

Certain scenarios might arise where the object needs to send a message to itself. Such
messages are called Self Messages and are represented with a U-shaped arrow.

Figure. 5.1.4: Sequence Diagram


16
5.1.5 DATA FLOW DIAGRAM:

A data flow diagram shows the way information flows through a process or system. It includes
data inputs and outputs, data stores, and how the subprocesses the data moves through. It shows
the graphical representation of the “flow” of data through a system, modelling its process aspects
and to visualize the processes in a system. Development is done at different levels: 0,1,2. It is also
called as “bubble chart”.

Components of a DFD:
1. External Entities:
External entities represent sources or destinations of data outside the system being modeled.
They are depicted as squares or rectangles. Examples include users, other systems, or external
data sources.

2. Processes:
Processes represent activities or functions that transform data inputs into outputs. They
describe the actions or operations performed within the system. Processes are represented as
circles or ovals.

3. Data Stores:
Data stores represent repositories or storage locations where data is held within the system.
They can be databases, files, or other data repositories. Data stores are depicted as rectangles
with two parallel lines at the top and bottom.

4. Data Flows:
Data flows represent the movement of data between processes, data stores, and external
entities. They show how data is input, processed, and output within the system.
Data flows are depicted as arrows indicating the direction of data movement.

17
Level 0:

Level 1:

Figure 5.1.5: Data Flow Diagram

18
5.1.6 ARCHITECTURE:

Figure 5.1.6: Architecture Diagram


The system follows a pipeline architecture model, where the various steps of ML are organized in
a sequential and streamlined manner.

The following are the components in the architecture:

Data Collection: Data is collected from a dataset, which contains 18K job descriptions out of
which about 800 are fake.

Data Pre-Processing & Cleaning: Unnecessary information is removed like copies, punctuation
(How are you? - > How are you) etc. It involves activities like:

Lemmatization: To reduce tokens to a standard form. Eg: "running" → "run“, "ate" → "eat“

Stemming: Reducing a word to its stem/root form by removing suffixes. Eg: Celebrating,
Celebrated -> Celebrate.

Tokenization: Splitting text into individual words or tokens. Eg: This is new-> ‘This’, ’is’, ’new’

Stopwords: To remove words that don’t carry much meaning, such as "the", "and", and "a".

Machine Learning Models: Train machine learning models on labeled data to identify patterns
associated with fake job postings. Use algorithms such as decision trees, random forests, etc.

Data Splitting: Dataset is divided into two parts: 70% training data and 30% test data. The test
dataset is given as input to the model and the output is predicted.

19
5.1.7 ALGORITHMS:

1. RANDOM FOREST:

This consists of many decision trees. The random forest takes the prediction from each
tree and based on the majority votes of predictions; it predicts the final output. The
‘forest’ generated is trained through bagging or bootstrap aggregating. The steps are:

a. Select a random sample from the dataset.

b. A decision tree is constructed for every sample and produces a result of prediction
for each sample.

c. Each prediction result has been voted.

d. Choose the predicted result with the highest votes.

Random Forest is a popular machine learning algorithm used for both classification
and regression tasks. It belongs to the ensemble learning techniques, which combine
multiple individual models to improve predictive performance.

Here's how the Random Forest algorithm works:

1. Decision Trees: Random Forest is built upon the concept of decision trees.
Decision trees are hierarchical structures that recursively split the data into subsets
based on the features, aiming to create simple decision rules.

2. Bootstrap Sampling: Random Forest employs a technique called bootstrap


sampling to create multiple subsets of the original dataset. Each subset, known as a
bootstrap sample, is generated by randomly selecting data points from the original
dataset with replacement.

3. Random Feature Selection: At each node of the decision tree, Random Forest
randomly selects a subset of features from the available features. This helps in
reducing correlation among the trees and promotes diversity in the ensemble.

4. Decision Tree Construction: For each bootstrap sample and at each node of the
tree, the algorithm selects the best split among the randomly chosen features, based
on criteria such as Gini impurity or entropy for classification tasks.
20
5. Ensemble of Trees: After constructing multiple decision trees using different
bootstrap samples and feature subsets, Random Forest combines the predictions of
all individual trees to make the final prediction. For classification tasks, it employs
a majority voting mechanism, while for regression tasks, it averages the predictions
of all trees.

Random Forest offers several advantages, including:

- Robustness to Overfitting: Random Forest reduces overfitting by combining


multiple weak learners (individual decision trees) into a strong ensemble model.

- Handling of High-Dimensional Data: It can handle datasets with a large number


of features and is less prone to the curse of dimensionality.

- Feature Importance: Random Forest provides a measure of feature importance,


indicating which features have the most significant impact on the prediction.

Due to its versatility, robustness, and ease of use, Random Forest is widely used in
various domains, including finance, healthcare, and marketing, for tasks such as
classification, regression, and feature selection.

2. NAIVE BAYES:

This classification technique is based on Bayes theorem of conditional probability.


Its primary use is text classification with a huge training set. It assumes that the
presence of a particular feature in a class (words) is independent of the presence of any
other feature, given the class label. The presence or absence of a particular word (eg:
money) is assumed to have no influence on the presence or absence of any other word
(eg: buy). It is used for binary classification like spam/not spam, positive/negative,
yes/no.

The Naive Bayes algorithm is a simple yet powerful classification algorithm used in
machine learning. It's based on Bayes' theorem, which describes the probability of an
event occurring given prior knowledge of conditions related to the event.

21
Here's a brief overview of how the Naive Bayes algorithm works:

1. Bayes' Theorem: The algorithm is based on Bayes' theorem, which states:

\(P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}\)

In the context of classification, \(P(A|B)\) represents the probability of class \(A\)


given the input features \(B\), \(P(B|A)\) is the likelihood of observing features \(B\)
given class \(A\), \(P(A)\) is the prior probability of class \(A\), and \(P(B)\) is the
evidence probability.

2. Naive Assumption: The "naive" assumption in Naive Bayes is that the features are
conditionally independent given the class label. In other words, the presence of one
feature does not affect the presence of another feature. Although this assumption may
not hold true in many real-world scenarios, Naive Bayes often performs well in practice
despite this simplification.

3. Training: During the training phase, the algorithm calculates the prior probabilities
\(P(A)\) for each class and the likelihoods \(P(B|A)\) for each feature given each class.
These probabilities are estimated from the training data.

4. Prediction: To make predictions on new data, the algorithm uses Bayes' theorem to
calculate the posterior probability \(P(A|B)\) for each class given the input features
\(B\). The class with the highest posterior probability is then assigned as the predicted
class.

5. Types of Naive Bayes: There are different variants of the Naive Bayes algorithm,
such as:

- Gaussian Naive Bayes: Assumes that continuous features follow a Gaussian


distribution.

- Multinomial Naive Bayes: Suitable for discrete features, often used in text
classification tasks.

- Bernoulli Naive Bayes: Assumes that features are binary (e.g., presence or absence
of a feature).

22
Naive Bayes is known for its simplicity, efficiency, and scalability, especially with
high-dimensional data. It's commonly used in text classification, spam filtering,
sentiment analysis, and other classification tasks where the independence assumption
holds reasonably well. However, it may not perform as well as more complex
algorithms when the independence assumption is violated or when dealing with highly
correlated features.

3. LOGISTIC REGRESSION:

It studies the association between a categorical dependent variable and a set of


independent variables and is used when the dependent variable has only two values,
such as 0 and 1. It is specifically designed for predicting binary outcomes and has the
highest accuracy in this case – 98%.

Logistic Regression is a fundamental supervised learning algorithm used for binary


classification tasks in machine learning. Despite its name, logistic regression is
primarily used for classification, not regression. It models the probability of a binary
outcome based on one or more predictor variables.

Here's how logistic regression works:

1. Linear Combination: Logistic regression starts by computing a linear combination


of the input features. Each feature is multiplied by a weight (coefficients) and summed
up with an intercept term.

2. Logistic Function (Sigmoid): The linear combination \(z\) is then passed through a
logistic function (also known as the sigmoid function) to squash the output into the
range [0, 1]. This function ensures that the output represents a valid probability.

3. Probability Estimation: The output of the logistic function represents the estimated
probability that the input belongs to the positive class (class 1 in binary classification).
If the estimated probability is greater than or equal to a predefined threshold (usually
0.5), the input is classified as belonging to the positive class; otherwise, it is classified
as belonging to the negative class (class 0).

23
4. Training: During the training phase, logistic regression learns the optimal weights
\(w_0, w_1, ..., w_n\) by minimizing a loss function (e.g., logistic loss or cross-entropy

loss) using an optimization algorithm such as gradient descent. The model is trained
on labeled data, where the true class labels are known.

5. Evaluation: Once trained, the logistic regression model can make predictions on
new, unseen data by computing the probability of the input belonging to the positive
class and applying the threshold to classify it accordingly.

Logistic regression is widely used in various domains, including finance, healthcare,


and marketing, for tasks such as churn prediction, credit scoring, and disease diagnosis.
Despite its simplicity, logistic regression can be quite powerful, especially when the
relationship between the input features and the target variable is approximately linear
and the data is well-behaved.

4. SUPPORT VECTOR MACHINE(SVM):

It is a powerful supervised learning algorithm used for classification and regression


tasks and is a binary classification algorithm. It performs well with high-dimensional
data, making it suitable for text classification tasks, which helps in reducing overfitting
and improving generalization performance, especially in cases where the dataset is
limited or imbalanced. It's particularly effective in high-dimensional spaces and is
widely used in various applications such as image classification, text classification, and
bioinformatics.

Here are some key points about the Support Vector Machine algorithm:

1. Binary Classification: SVM is primarily used for binary classification, where it


classifies data into two categories based on labeled training examples. However, it can
be extended to handle multi-class classification using strategies such as one-vs-one or
one-vs-all.

2. Margin Maximization: SVM aims to find the hyperplane that best separates the data
points of different classes while maximizing the margin, which is the distance between
the hyperplane and the nearest data points (support vectors). Maximizing the margin
helps improve the generalization ability of the model and reduces the risk of
24
overfitting.

3. Convex Optimization: Training an SVM involves solving a convex optimization


problem, which ensures that the algorithm converges to the global optimum solution.
This makes SVM efficient and reliable for large-scale datasets.

4. Sensitivity to Outliers: SVM is sensitive to outliers as they can influence the position
and orientation of the decision boundary. Techniques such as using soft-margin SVM
or robust kernel functions can help mitigate the impact of outliers on the model's
performance.

Overall, Support Vector Machine is a powerful and versatile algorithm for


classification tasks, known for its effectiveness in high-dimensional spaces, ability to
handle non-linear data, and robustness against overfitting. However, it may require
careful selection of hyperparameters and tuning for optimal performance on specific
datasets.

25
CHAPTER-6

SOFTWARE IMPLEMENTATION

6.1 Technologies used in project:


6.1.1 Python:

● Python is a powerful multi-purpose programming language created by Guido van Rossum


● It has simple easy-to-use syntax, making it the perfect language for someone trying to learn
computer programming for the first time.

Features Of Python:
1. Easy to code:
Python is high level programming language. Python is very easy to learn language as
compared to other language like c, c#, javascript, java etc. It is very easy to code in python
language and anybody can learn python basic in few hours or days. It is also developer-friendly
language.

2. Free and Open Source:


Python language is freely available at official website and you can download it from the given
download link below click on the Download Python keyword. Since, it is open-source, this means
that source code is also available to the public. So, you can download it as, use it as well as share
it.

3. Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports object-
oriented language and concepts of classes, objects encapsulation etc.

4. High-Level Language:

Python is a high-level language. When we write programs in python, we do not need to

remember the system architecture, nor do we need to manage the memory.

26
5. Python is Portable language:

Python language is also a portable language. for example, if we have python code for windows
and if we want to run this code on other platform such as Linux, Unix and Mac then we do not
need to change it, we can run this code on any platform.

6. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a time. like
other language c, c++, java etc there is no need to compile python code this makes it easier to
debug our code. The source code of python is converted into an immediate form called bytecode.

7. Large Standard Library:


Python has a large standard library which provides rich set of module and functions so you do
not have to write your own code for every single thing. There are many libraries present in python
for such as regular expressions, unit-testing, web browsers etc.

6.1.2 SCIENTIFIC AND NUMERIC COMPUTING:


● There are numerous libraries available in Python for scientific and numeric computing. There are
libraries like: SciPy and NumPy that are used in general purpose computing.
● Also, the language is heavily used in machine learning, data mining and deep learning.

6.1.3 Scikit - Learn:


It is used to import algorithms into a product environment. Through scikit-learn, we can
implement various machine learning models for regression, classification, clustering, and
statistical tools for analyzing these models. Scikit-learn is largely written in Scikit-learn integrates
well with many other Python libraries, such as Matplotlib and plotly for plotting, NumPy for array
vectorization, Pandas dataframes, SciPy, and many more.

27
6.1.4 NLTK:
NLTK is a leading platform for building Python programs to work with human language
data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet,
along with a suite of text processing libraries for classification, tokenization, stemming, tagging,
parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active
discussion forum.

6.1.5 Pandas:
Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. It is free software released under the three-clause BSD license.
The name is derived from the term "panel data", an econometrics term for data sets that include
observations over multiple time periods for the same individuals. Its name is a play on the phrase
"Python data analysis" itself.
Pandas is mainly used for data analysis and associated manipulation of tabular data in
DataFrames. Pandas allows importing data from various file formats such as comma-separated
values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel. Pandas allows
various data manipulation operations such as merging, reshaping, selecting, as well as data
cleaning, and data wrangling features. The pandas library is built upon another library, NumPy,
which is oriented to efficiently working with arrays instead of the features of working on
DataFrames.

6.1.6 HTML:

HTML is an abbreviation for Hyper Text Markup Language is the standard markup
language used to create web pages. HTML is written in the form of HTML elements consisting of
tags enclosed in angle brackets (like <html>). HTML tags most commonly come in pairs like <h1>
and </h1>, although some tags represent empty elements and so are unpaired, for example <imp>.
The first tag in a pair is the start tag, and the second tag is the end tag (they are also called opening
tags and closing tags).

28
The purpose of a web browser is to read HTML documents and compose them into visible or
audible web pages. The browser does not display the HTML tags but uses the tags to interpret the
content of the page. HTML elements form the building blocks of all websites. HTML allows
images and objects to be embedded and can be used to create interactive forms. It provides a means
to create structured documents by denoting structural semantics for text such as headings,
paragraphs, lists, links, quotes, and other items. It can embed scripts written in languages such as
JavaScript which affects the behavior of HTML web pages.

6.1.7 CSS:

CSS stands for Cascading Style Sheet. CSS is used to design HTML tags. CSS is a widely
used language on the web. HTML, CSS, and JavaScript are used for web designing. It helps web
designers to apply a style to HTML tags. It can also be used to allow the web page to display
differently depending on the screen size or device on which it is being viewed.

With plain HTML you define the colors and sizes of text and tables throughout your pages. If you
want to change a certain element you will therefore have to work your way through the document
and change it. With CSS you define the colors and sizes in "styles". Then as you write your
documents you refer to the styles Another big advantage is that CSS offers much more detailed
attributes than plain HTML for defining the look and feel of your site.

6.1.8 JavaScript:

JavaScript is a versatile programming language primarily used for web development. It is a


client-side scripting language, meaning it runs on the user's web browser rather than on a remote
server, allowing for fast and responsive web applications. It enables developers to manipulate the
content and behavior of web pages dynamically, allowing for features such as form validation,
interactive maps, animations, and much more. In addition to client-side scripting, JavaScript is
also used on the server-side through platforms like Node.js, which allows developers to build
scalable and high-performance web applications entirely in JavaScript.

29
6.1.9 DJANGO:

Django is a high-level Python web framework that facilitates rapid development and clean,
pragmatic design. One of the key features of Django is its ability to seamlessly integrate the
frontend and backend components of a web application. Key features of Django include its
powerful ORM (Object-Relational Mapping) system, which simplifies database management and
interactions, along with a robust admin interface that automates administrative tasks.

Additionally, Django's URL routing mechanism enables clean and concise URL patterns,
enhancing the overall readability of web applications. Moreover, Django boasts a comprehensive
authentication system, aiding in the implementation of user authentication and authorization
functionalities Overall, Django's emphasis on rapid development, security, and scalability,
combined with its rich feature set and thriving community, solidify its position as one of the
leading web frameworks for building robust and maintainable web applications.

6.2 Sample Code:

6.2.1 Machine Learning Code:

#Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data=pd.read_csv("C:/Users/PERSONAL/Desktop/New folder (5)/Fake-Job-Prediction-Model-
main/Job_Posting_DataSets.csv",encoding='latin1')

30
data.drop(['job_id','salary_range','telecommuting','has_company_logo','has_questions'],axis=1,in
place = True)
data.fillna(' ', inplace=True)
#Create independent and Dependent Features
columns = data.columns.tolist() #store all colums of data into columns var

# Filter the columns to remove data we do not want


columns = [c for c in columns if c not in ["fraudulent"]]
target = "fraudulent"
state = np.random.RandomState(42)
X = data[columns]
Y = data["fraudulent"]
X_outliers = state.uniform(low=0, high=1, size=(X.shape[0], X.shape[1]))
from imblearn.under_sampling import RandomUnderSampler
under_sampler = RandomUnderSampler()
X_res, y_res = under_sampler.fit_resample(X, Y)
df1 = pd.DataFrame(X_res)
df3 = pd.DataFrame(y_res)
result = pd.concat([df1, df3], axis=1, join='inner')
display(result)
data=result;

31
DATA PREPARATION:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
data['text'] = data['text'].apply(lambda x:x.lower())
data['text'] = data['text'].apply(lambda x:' '.join([word for word in x.split() if word not in
(stop_words)]))

Splitting the dataset:


from sklearn.model_selection import train_test_split #import this func
X_train, X_test, y_train, y_test = train_test_split(data.text, data.fraudulent, test_size=0.3)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)

Model Building & evaluation:


#NaiveBayes
nb = MultinomialNB()
%time nb.fit(X_train_dtm, y_train)
y_pred_nb = nb.predict(X_test_dtm)
#DecisionTree
dt = DecisionTreeClassifier()
%time dt.fit(X_train_dtm, y_train)
y_pred_class = dt.predict(X_test_dtm)
input_data_features = vect.transform(input_text)
prediction = dt.predict(input_data_features)
if (prediction[0]==1):
print('Fraudulent Job')
else:
print('Real Job')
joblib.dump(dt, 'filename.pkl')
32
dt_from_joblib = joblib.load('filename.pkl')

6.2.2. Django file:


"""Django's command-line utility for administrative tasks."""

import os

import sys

def main():

"""Run administrative tasks."""

os.environ.setdefault('DJANGO_SETTINGS_MODULE',
'aComparative_study_onFake_jobPost.settings')

try:

from django.core.management import execute_from_command_line

except ImportError as exc:

raise ImportError(

"Couldn't import Django. Are you sure it's installed and "

"available on your PYTHONPATH environment variable? Did you "

"forget to activate a virtual environment?"

) from exc

execute_from_command_line(sys.argv)

if __name__ == '__main__':

6.2.3 Frontend Code (HTML, CSS, Bootstrap):

1. Remote User:
<div class="topnav">

<a href="{% url 'Search_DataSets' %}">PREDICT JOB POST PREDICTION</a>

<a href="{% url 'ViewYourProfile' %}">VIEW YOUR PROFILE</a>

<a href="{% url 'login' %}">LOGOUT</a>

33
</div>

<div class="mainholder">

{% block userblock %}

{% endblock %}

</div>

2. Service Provider:
<div class="topnav">

<a href="{% url 'train_model' %}">Train and Test Data Sets </a>

<a href="{% url 'likeschart' 'bar' %}">View Trained and Tested Accuracy in Bar Chart </a>

<a href="{% url 'charts1' 'line' %}"> View Trained and Tested Accuracy Results </a>

<a href="{% url 'View_Remote_Users' %}">View All Remote Users</a>

<a href="{% url 'login' %}">Logout</a>

</div>

6.2.4 Python Code (Views.py file):

1. Remote User:
dt_from_joblib = joblib.load('Remote_User/filename.pkl')

import joblib

vect = joblib.load('Remote_User/vectorizer.pkl')

def login(request):

if request.method == "POST" and 'submit1' in request.POST:

username = request.POST.get('username')

password = request.POST.get('password')

try:

enter=
34
ClientRegister_Model.objects.get(username=username,password=password)

request.session["userid"] = enter.id

return redirect('Add_DataSet_Details')

except:

pass

return render(request,'RUser/login.html')

def Register1(request):

if request.method == "POST":

username = request.POST.get('username')

email = request.POST.get('email')

password = request.POST.get('password')

phoneno = request.POST.get('phoneno')

country = request.POST.get('country')

state = request.POST.get('state')

city = request.POST.get('city')

ClientRegister_Model.objects.create(username=username, email=email,
password=password, phoneno=phoneno, country=country, state=state, city=city)

return render(request, 'RUser/Register1.html')

else:

return render(request,'RUser/Register1.html')

def Search_DataSets2(request):

if request.method == "POST":

title= request.POST.get('title')

location= request.POST.get('location')

company_profile= request.POST.get('company_profile')

35
description= request.POST.get('description')

requirements= request.POST.get('requirements')

benefits= request.POST.get('benefits')

industry= request.POST.get('industry')

text = [title+ " " +location+ " " +company_profile+ " " +description+ " "
+requirements+ " " +benefits+ " " +industry]

input_data_features = vect.transform(text)

prediction = dt_from_joblib.predict(input_data_features)

print(prediction)

if (prediction[0]==1):

result='Fraudulent Job'

return render(request, 'RUser/Search_DataSets.html',{'objs': result})

else:

result='Real Job'

return render(request, 'RUser/Search_DataSets.html',{'objs': result})

X_live=[[job_id,

title,

location,

department,

salary_range,

company_profile,

description,

requirements,

benefits,

telecommuting,

has_company_logo,

36
has_questions,

employment_type,

required_experience,

required_education,

industry,

function,

fraudulent

]]

return render(request, 'RUser/Search_DataSets.html')

2. Service Provider:
def serviceproviderlogin(request):

if request.method == "POST":

admin = request.POST.get('username')

password = request.POST.get('password')

if admin == "Admin" and password =="Admin":

detection_accuracy.objects.all().delete()

return redirect('View_Remote_Users')

return render(request,'SProvider/serviceproviderlogin.html')

def View_Remote_Users(request):

obj=ClientRegister_Model.objects.all()

return render(request,'SProvider/View_Remote_Users.html',{'objects':obj})

def charts(request,chart_type):

chart1 = detection_ratio.objects.values('names').annotate(dcount=Avg('ratio'))

return render(request,"SProvider/charts.html", {'form':chart1, 'chart_type':chart_type})

37
def charts1(request,chart_type):

chart1 = detection_accuracy.objects.values('names').annotate(dcount=Avg('ratio'))

return render(request,"SProvider/charts1.html", {'form':chart1, 'chart_type':chart_type})

6.2.5 Charts’ Code:


var chart = new CanvasJS.Chart("chartContainer", { animationEnabled: true,

title: {

text: ""

},

data: [{

{% if chart_type == "line" %}

type: "line",

{% elif chart_type == "pie" %}

type: "pie",

{% elif chart_type == "spline" %}

type: "spline",

{% endif %}

startAngle: 240,

yValueFormatString: "##0.00\"%\"",

indexLabel: "{label} {y}",

dataPoints: [

{% for o in form %}

{y: {{o.dcount}}, label: "{{o.names}}"},

{% endfor %}

chart.render();

}
38
CHAPTER-7

SYSTEM TESTING

The purpose of testing is to discover errors. Testing is the process of looking for any flaws
or weaknesses in a piece of work. It is the process of testing software to make sure that it satisfies
user expectations and meets requirements without failing in an unacceptable way. There are
various types of tests. Each test type addresses a specific testing requirement.

7.1 TYPES OF TESTS:

7.1.1 Unit testing:


The internal program logic is tested for accuracy and that program inputs lead to valid
outputsby designing test cases for unit testing. All decision branches and internal code flow should
be validated. It involves the testing of individual software units of the application. It is done after
the completion of an individual unit before integration. This is an intrusive structural test that
depends on understanding how it was built. Fundamental tests are performed at the component
level by unit tests, which also look at a specific application, business procedure, or system
configuration. Unit testsmake sure that each distinct path of a business process adheres precisely
to the documented specifications and has inputs and outputs that are well-defined.

7.1.2 Integration testing:

Software components that have been integrated are tested in integration tests to see if they
actually operate as a single program. Event-driven testing is more concerned with the core outcome
of fields or screens. Unit tests that were successful show that the individual components were
satisfactory, but integration tests show that the combination of the components is correct
consistent. Integration testing is specifically made to draw attention to problems that arise when
combining components.

39
7.1.3 Functional test:

Functional tests provide methodical evidence that the tested functions are available and meet
all technical and business requirements as well as those listed in the system documentation and
user manuals. Functional testing is centered on the following items:

Valid Input: Recognized valid input classes need to be accepted.

Invalid Input: identified classes of invalid input must be rejected.

Functions: It is necessary to use the identified functions.

Output: Specific application output classes must be put to use.

Systems or procedures: interfacing systems or procedures must be invoked

Organization and preparation of functional tests are focused on requirements, key functions,
or special test cases. In addition, systematic coverage pertaining to identifying business process
flows, including data fields, predefined processes, and successive processes, must be considered
for testing. Before functional testing is complete, additional tests are identified and the effective
value of current tests are determined.

7.2 System Test

System testing ensures that the entire integrated software system meets requirements. It tests
a configuration to ensure known and predictable results. An example of system testing is the
configuration-oriented system integration test. System testing is based on process descriptions and
flows, emphasizing pre-driven process links and integration points.

7.2.1 White Box Testing


White box testing is a type of software testing where the software tester is aware of the
inner workings, language, and structure of the software, or at the very least, what it is intended to

do. It's for a reason. It is employed to test regions that are inaccessible at the black box level.

White box testing, also known as structural testing or glass box testing, is a software testing
technique that examines the internal structure of the software being tested. Unlike black box

40
testing, which focuses on testing the functionality of the software without knowledge of its internal
workings, white box testing requires knowledge of the internal code and logic.

In white box testing, testers design test cases based on the internal structure of the software,
including code paths, branches, and conditions. The goal is to ensure that all paths through the code
are tested, and that the software behaves as expected under different conditions. Techniques used
in white box testing include code coverage analysis, path testing, and mutation testing.

White box testing is particularly useful for identifying errors or bugs in the code, as well as
ensuring code quality and reliability. It is commonly used during the development phase of
software engineering, although it can also be applied during maintenance or debugging processes.

7.2.2 Black Box Testing

The term "black box testing" refers to software testing done without any prior knowledge
of the inner workings, architecture, or language of the module being tested. Black box tests, like

the majority of other types of tests, must be created from a clear source document, such as a
specification or requirements document. It is a type of testing where the software being tested is
handled like a black box. It is impossible to "see" into. Without taking the software's operation
into account, the test generates inputs and reacts to outputs.

Black box testing is a software testing technique where the internal structure, design, and
implementation of the software being tested are not known to the tester. Instead, the tester focuses
solely on the external behavior of the software, treating it as a "black box" whose internal workings
are not visible.

In black box testing, testers design test cases based on the software's specifications, requirements,
and functionality. They do not have access to the source code or knowledge of the underlying
algorithms or logic. The goal is to verify that the software behaves correctly and produces the
expected outputs based on different inputs and conditions.

Black box testing is useful for validating the functionality, usability, and reliability of the software
from an end-user perspective. It helps identify issues such as incorrect or missing features, user
interface flaws, and compatibility issues. Common techniques used in black box testing include
equivalence partitioning, boundary value analysis, and error guessing.

Black box testing is typically performed at various stages of the software development life cycle,
41
including requirements analysis, system testing, and acceptance testing. It complements white box
testing by providing a different perspective on the software's quality and ensuring thorough test
coverage.

7.3 Unit Testing:

Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as two
distinct phases.

Unit testing is a software testing technique where individual units or components of a software
system are tested in isolation to verify that they behave as expected. A unit can be a function,
method, class, or module, depending on the granularity of the software being tested.

In unit testing, test cases are designed to exercise specific functionalities or behaviors of the unit
under test. These test cases are typically automated, allowing them to be executed repeatedly and
efficiently. Unit tests are written to cover various scenarios, including normal inputs, boundary
conditions, and error cases.

The primary goals of unit testing are to:

1. Verify that each unit of the software performs correctly according to its specifications.

2. Detect and fix defects early in the development process, reducing the cost of debugging
and maintenance.

3. Provide a safety net for refactoring and code changes, ensuring that existing
functionalities remain intact.

Unit testing frameworks, such as JUnit for Java, pytest for Python, and NUnit for .NET,
provide tools and utilities to facilitate the creation, execution, and organization of unit tests.
Incorporating unit testing into the software development process promotes code quality, reliability,
and maintainability.

42
Test strategy and approach

Field testing will be performed manually, and functional tests will be documented in detail.

Test objectives

● All field entries must work properly.

● Pages must be activated from the identified link.

● The entry screen, messages, and responses must not be delayed.

Features to be tested
● Verify that the entries are of the correct format
● No duplicate entries should be allowed

● All links should take the user to the correct page.

7.4 Integration Testing

Software integration testing involves incrementally integrating two or more software

components that have been integrated on a single platform in order to produce failures brought on

by interface flaws. The task of the integration test is to check that components or software
applications, e.g., components in a software system or—one step up—software applications at the
company level—interact without error.

7.5 Acceptance Testing


User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional requirements.

43
7.6 TESTING METHODOLOGIES
The following are the testing methodologies:
● Unit Testing.
● Integration Testing.
● User Acceptance Testing.
● Output Testing.

7.6.1 Unit Testing

Unit testing focuses verification efforts on the smallest unit of software design,
which is the module. In order to ensure complete coverage and maximum error detection,
unit testing exercises particular paths in a module's control structure. This test focuses on
each module separately to make sure that they all work together properly. Hence, the name
is Unit Testing.

Each module is tested separately during this testing, and the module interfaces are checked
for consistency with the design specification. All significant processing routes are
examined for the anticipated outcomes. Every path for handling errors is tested as well so
tested.

7.6.2 Integration Testing

Integration testing addresses the issues associated with the dual problems of
verification and program construction. After the software has been integrated, a series of
high-level tests are conducted. The main objective of this testing process is to take unit-
tested modules and build a program structure that has been dictated by design.

The following are the types of Integration Testing:

7.6.2.1 Top-Down Integration:


This method is an incremental approach to the construction of program structure.
Moving down the control hierarchy, starting with the main program module, modules are
integrated. Either a depth-first or a breadth-first approach is used to incorporate the modules
44
that are subordinate to the main program module into the structure. This approach replaces
each individual stub as the test descends, starting with the main module of the software.

7.6.2.2. Bottom-up Integration:

This method begins the construction and testing with the modules at the lowest level
in the program structure. Since the modules are integrated from the bottom up, processing
required for modules subordinate to a given level is always available, and the need for stubs
is eliminated. The bottom-up approach tests each module individually, then each module is
integrated with the main module and tested for functionality.

7.6.3 User Acceptance Testing

User Acceptance of a system is the key factor for the success of any system. The
system under consideration is tested for user acceptance by constantly keeping in touch
with the prospective system users at the time of development and making changes wherever
required. The system developed provides a friendly user interface that can be easily
understood even by a person who is new to the system.

7.6.4 Output Testing


After performing the validation testing, the next step is to test the output of the
proposed system, since no system could be useful if it did not produce the required output
in the specified format.
Output testing, also known as results-based testing or outcome testing, is a software testing
approach focused on verifying the correctness and reliability of the output produced by a
system or application. It ensures that the software generates the expected results in response
to specific inputs or actions.

Output testing is integral to various software development methodologies, including:

1. Waterfall Model: In the traditional waterfall model, output testing typically


occurs towards the end of the development life cycle, after the completion of coding and
integration phases. Testers compare the actual outputs against expected outputs based on
45
predefined requirements and specifications.

2. Agile Methodology: In Agile methodologies such as Scrum, output testing is


performed iteratively throughout the development process. Test cases are continuously
executed against new features or increments to validate their functionality and correctness.
The emphasis is on delivering working software in short cycles, with frequent testing and
feedback loops.

3. DevOps: In DevOps practices, output testing is automated and integrated into the
continuous integration/continuous deployment (CI/CD) pipeline. Automated tests are
executed automatically upon code commits or deployments to ensure that changes do not
introduce regressions or errors in the output. This approach enables rapid and reliable
delivery of software while maintaining high quality.

4. Behavior-Driven Development (BDD): In BDD, output testing is closely tied to


defining and verifying desired behaviors or outcomes through scenarios written in a natural
language format (e.g., Gherkin syntax). Test scenarios are used as acceptance criteria to
validate that the software produces the correct output for various user interactions or system
behaviors.

Regardless of the methodology used, output testing aims to ensure that the software
meets the desired functionality, performance, and quality standards, ultimately delivering
value to users and stakeholders. It involves designing appropriate test cases, executing tests,
and analyzing results to validate the correctness of the output produced by the software.

7.6.5 Validation Testing


Validation testing is a crucial aspect of software testing that focuses on ensuring
that a software product or system meets the requirements and expectations of its users
and stakeholders. It involves evaluating the software against the intended use cases and
scenarios to validate that it functions correctly, meets user needs, and aligns with
business objectives. Validation testing is typically performed towards the end of the
development process, once the software has been built and is ready for deployment.

46
Validation checks are performed on the following fields.
1. Text Field: The text field can contain only the number of characters lesser than or
equal to its size. Incorrect entry always flashes and error message.
2. Numeric Field: The numeric field can contain only numbers from 0 to 9. An entry
of any character flashes an error message. Testing involves executing the real data
information is used in the program the existence of any program defect is inferred
from the output.

It encompasses various testing activities, including:

1. User Acceptance Testing (UAT): This involves testing the software from the perspective
of end-users to ensure that it meets their requirements and expectations. UAT is often
conducted by actual users or representatives of the target audience in a real-world
environment.

2. Functional Testing: Functional testing verifies that the software performs the functions
and tasks specified in the requirements and design documents. Test cases are designed to
cover various functional aspects of the software to ensure its correctness and completeness.

3. Usability Testing: Usability testing assesses the ease of use, intuitiveness, and user-
friendliness of the software interface. It evaluates factors such as navigation, accessibility,
and user satisfaction to ensure a positive user experience.

4. Compatibility Testing: Compatibility testing ensures that the software works correctly
across different platforms, devices, browsers, and operating systems. It verifies that the
software is compatible with various configurations and environments commonly used by
its target audience.

5. Regression Testing: Regression testing is performed to ensure that recent changes or


updates to the software have not introduced any new defects or regressions. It involves
retesting the software's existing features and functionalities to validate their continued
correctness and stability.

47
6. Performance Testing: Performance testing evaluates the software's performance,
scalability, and responsiveness under different load conditions. It ensures that the software
can handle expected levels of user traffic and data volume without degradation in
performance.

Overall, validation testing plays a critical role in ensuring that the software meets quality
standards, delivers value to users, and satisfies the business requirements effectively and
efficiently. It helps identify any discrepancies between the expected and actual behavior of
the software, enabling necessary adjustments and improvements to be made before
deployment.

48
CHAPTER-8
RESULTS AND OUTPUT SCREENS

8.1 Homepage/User Login:


This shows the user login page, which allows the user to login using their username and
password. In case they are a new user, they can register using the register option below. If the admin
is trying to login, he can click on the “Service Provider” option.

Figure. 8.1: Homepage/Login screen

8.2 Registration Page:


A new user can register by entering his name, email, password, state, city and country and
then click on the register button to successfully register.

Figure. 8.2: New User Registration screen


49
8.3 Service Provider/Admin Login page:
The service provider refers to the admin. The admin can login to this page using his
credentials.

Figure. 8.3: Service Provider Login screen

8.4 User Module Features:


Once the user logs in using his username and password, he can access a variety of features such
as:
1. Predict Job Posts: This allows the user to enter the job post details like job title,
description, location, requirements, benefits and industry. They can then click the “Predict”
button to predict if the job is fake or real. It displays either “Real Job” or “Fraudulent Job”.

50
Figure. 8.4.1: Real/Fraudulent Job Post Prediction screen

2. View your Profile: This allows the user to view their profile details like their name, email,
mobile number, password, city, state and country.

Figure. 8.4.2: View your profile screen

3. Logout: The user can logout of their account.

8.5 Service Provider/Admin Module Features:


1. Train and Test Datasets: Clicking this trains the EMSCAD dataset and tests it – showing the
evaluation metrics like Accuracy, Confusion Matrix, Precision, Recall and F1 Score for all four
algorithms used.

51
Figure 8.5.1: Accuracy of the algorithms used screen
2. View Trained and Tested Accuracy in Bar Chart:

Figure 8.5.2: Bar Chart of Algorithms screen

3. View Trained and Tested Accuracy Results (in Pie and Line Charts):

52
Figure 8.5.3: Line and Pie Chart of Algorithms screen

4. View all remote users: The admin can view all the registered users with their name, email, mobile
number, country, state and city.

Figure. 8.5.4: Viewing all remote users’ screen

5. Logout: The admin can logout.

53
CHAPTER-9
CONCLUSION

Job scams have become a great concern all over the world at present. In this project, we
have analyzed the impacts of job scams which can be a very prosperous area in research filed
creating a lot of challenges to detect fraudulent job posts. The EMSCAD dataset which contains
job posts from real situations, over the last ten years, has been experimented with. We have
experimented with machine learning algorithms (SVM, Naïve Bayes , Random Forest and LR).
This work shows a comparative study on the evaluation of traditional machine learning classifiers.
Logistic Regression is seen to have the highest classification accuracy among the traditional
machine learning algorithms.

This is a versatile and practical tool in the fight against fraudulent jobs. It helps users avoid
financial losses and protects their personal information. This will guide job-seekers to get only
legitimate offers from companies. It adds value to the reputation and credibility of online job
platforms and ensures that legitimate job opportunities receive the attention they deserve. It has a
user-friendly interface and high accuracy, as we have used multiple algorithms to predict the
output.

In essence, this tool project offers a valuable solution to the challenges posed by using online job
recruitment portals, promoting a more efficient, secure, and focused digital environment for users
across various contexts, whether fresher or senior.

54
CHAPTER-10
FUTURE ENHANCEMENT

This project has a wide scope as it is not intended for a particular platform, but can be applied to
any job recruitment platform through the job description or the details provided.

1. A mobile application could be developed catering to users who primarily access job platforms
through mobile devices.

2. Integration of image processing, to detect postings that may contain misleading multimedia
content. This could be used for job posts made via Instagram, LinkedIn, Twitter, WhatsApp, etc.

3. Cross-platform compatibility to extend the system's capabilities beyond web-based job


platforms, covering social media and other emerging job search channels. It could be integrated to
be used for job posts on social media or anywhere other than job recruitment portals.

4. Aim to deploy this alongside job portals to flag fraudulent job postings. It could be used as a tool
alongside job portals like Naukri, LinkedIn, Glassdoor, Indeed, etc, so users can securely access
them and focus their time on valid and legitimate job postings, leading to an efficient job search
process.

5. User Feedback Mechanism: Implement a user feedback mechanism where users can report
suspicious job postings, providing valuable data for model retraining and improving the accuracy
of the detection system over time.

55
CHAPTER-11

REFERENCES

[1] S. Vidros, C. Kolias , G. Kambourakis ,and L. Akoglu, “Automatic Detection of Online


Recruitment Frauds: Characteristics, Methods, and a Public Dataset”, Future Internet 2017, 9, 6;
doi:10.3390/fi9010006.
[2] B. Alghamdi, F. Alharby, “An Intelligent Model for Online Recruitment Fraud Detection”,
Journal of Information Security, 2019, Vol 10, pp. 155176,
https://doi.org/10.4236/iis.2019.103009 .
[3] Scanlon, J.R. and Gerber, M.S., “Automatic Detection of Cyber Recruitment by Violent
Extremists”, Security Informatics, 3, 5, 2014, https://doi.org/10.1186/s13388-014-0005-5
[4] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv Prepr.
arXiv1408.5882, 2014.
[5] https://www.indusedu.org/pdfs/IJREISS/IJREISS_4057_46806.pdf
[6]https://www.irjmets.com/uploadedfiles/paper//issue_10_october_2022/30925/final/fin_irjmets
1668589338.pdf
[7] I. Rish, ―An Empirical Study of the Naïve Bayes Classifier An empirical study of the naïve
Bayes classifier,‖ no. January 2001, pp. 41–46, 2014.
[8] D. E. Walters, ―Bayes’s Theorem and the Analysis of Binomial Random
Variables,Biometrical J., vol. 30, no. 7, pp. 817–825, 1988, doi: 10.1002/bimj.4710300710.
[9] Tin Van Huynh1, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen1, and Anh GiaTuan Nguyen,
“Job Prediction: From Deep Neural Network Models to Applications”,RIVF International
Conference on Computing and Communication Technologies (RIVF), 2020.

56

You might also like