Fake Job Posting Detection Documentation Updated
Fake Job Posting Detection Documentation Updated
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
By
P. SHIRIN --------------------------- (20911A0554)
CH. SATHWIKA ------------------ (20911A0551)
G. MEGHANA --------------------- (21915A0504)
A. ANOKH -------------------------- (21915A0501)
CERTIFICATE
This is to certify that the project report titled “FAKE JOB POSTING DETECTION” is
being submitted by P. SHIRIN (20911A0554), CH. SATHWIKA (20911A0551), G.
MEGHANA (21915A0504), A. ANOKH (21915A0501) in partial fulfillment for the award of the
Degree of Bachelor of Technology in Computer Science and Engineering, is a record of bonafide
work carried out by them under my guidance and supervision. These results embodied in this project
report have not been submitted to any other University or Institute for the award of any degree or
diploma.
External Examiner
DECLARATION
We, P. SHIRIN, CH. SATHWIKA, G. MEGHANA, A. ANOKH Hereby declare that the
project entitled, “FAKE JOB POSTING DETECTION” submitted for the degree of
Bachelor of Technology in Computer Science and Engineering is original and has been done
by us and this work is not copied and submitted anywhere for the award of any degree.
A. ANOKH (21915A0501)
ACKNOWLEDGEMENT
We wish to express our sincere gratitude to the project guide, Mrs. V. Prathima, Assistant
Professor, Vidya Jyothi Institute of Technology, Hyderabad for her timely cooperation and
valuable suggestions while carrying out this work. It is her kindness that made us learn more
from her.
We are grateful to Dr. D Aruna Kumari, Professor, and HOD, department of CSE, for her
help and support during our academic year.
We wholeheartedly convey our gratitude to Principal Dr. E. Sai Baba Reddy for providing
the necessary infrastructure to complete this project.
We would thank my parents and all the faculty members who have contributed to our
progress through the course to come to this stage.
P. SHIRIN (20911A0554)
A. ANOKH (21915A0501)
CH. SATHWIKA (20911A0551)
G. MEGHANA (21915A0504)
ABSTRACT
This project aims to address the problem of fraudulent job postings, which is a
significant concern in the realm of Online Recruitment Frauds (ORF). These fraudulent
postings not only pose risks to individuals but also damage the reputation of companies.
With the widespread use of online platforms for job advertisements, fake job postings have
become increasingly common. These postings deceive job-seekers by offering non-
existent positions and often result in financial loss. Even reputable job advertising
platforms are not immune to such scams. To tackle this issue, an automated system based
on machine learning algorithms has been developed. Users provide details such as job
title, location, requirements, description, benefits, company profile, and industry. Through
feature extraction, the system classifies job postings as either legitimate or fraudulent by
analyzing their textual content and descriptions. Various data mining techniques and
classification algorithms, including decision trees, support vector machines, naive Bayes
classifiers, and random forest classifiers, are employed for this purpose. The dataset used
for training and testing is the Employment Scam Aegean Dataset (EMSCAD), which
contains 17,014 legitimate and 866 fraudulent job ads published between 2012 and 2014,
obtained from the University of the Aegean, Greece. This project provides a practical
solution to the problem of fraudulent job postings, contributing to a safer online job-
seeking environment and protecting both job seekers and the reputation of companies
advertising on these platforms.
INDEX
1. Introduction 1-2
2. Literature Survey 3
3. Feasibility Study 4
3.1 Technological sides 4
3.2 Web Technologies 4
3.3 Economical side 4
6.1.4 NLTK 28
6.1.5 Pandas 28
6.1.7 CSS 29
6.1.8 JavaScript 29
6.1.9 Django 30
9. Conclusion 54
11. References 56
LIST OF FIGURES
INTRODUCTION
Through job portals and recruitment sites, individuals explore potential job opportunities
based on their qualifications, experience, and preferences. The impact of social media on job
advertisement is significant, with the rapid sharing of job posts leading to an increase in fraudulent
postings. This not only poses challenges for job seekers but also for recruitment platforms.
Consequently, individuals may become wary of new job postings due to concerns about the
security and accuracy of their personal and professional information.
The proliferation of fake job posts results in wasted time and inconvenience for job seekers. Job
scams involve offering enticing job offers to candidates, only to deceive them and obtain their
money or personal information. These scams can take various forms, such as advance fee scams
or illegal money laundering schemes. Scammers often utilize fake websites and emails to lure job
seekers, targeting platforms like LinkedIn to pose as legitimate recruiters. Despite efforts to
combat cybercrime, job scams persist, with fraudsters constantly adapting their tactics.
In India, a significant percentage of job seekers encounter scams during their job search, with job
scams rising dramatically in recent years. About 56 per cent of job seekers in India face scams
during their job hunt process. Job scams rose 250 percent in the first quarter of 2023, nearly triple
compared to the same period in 2022. A recent survey from the UK has shown that more than 67%
of people are at great risk who look for jobs through online advertisements but are unaware of fake
job posts or job scams. Traditional cybercrime prevention methods have proven ineffective against
the evolving nature of job scams. Filtering out false job postings would significantly benefit the
recruitment process, safeguarding both job seekers and the reputation of companies. This project
aims to address this challenge by implementing strategies to predict and identify fraudulent job
posts, ultimately enhancing trust in the recruitment process and protecting the integrity of company
brands.
1
more efficient job search process.
3. Identity Protection: A detection system helps safeguard sensitive personal data, reducing the risk
of identity theft.
4. Preserving Employer Reputation: To protect the reputation of genuine employers by preventing
misuse of their branding.
5. International Reach: To be used for job markets worldwide as online job searching is a global
phenomenon and the impact of fake job postings is not limited by geographical boundaries.
To combat the growing threat of job scams, it's imperative to implement proactive measures that
go beyond traditional cybersecurity tactics. This involves leveraging advanced technologies such
as machine learning and natural language processing to analyze job postings and detect fraudulent
patterns. By harnessing the power of data-driven insights, we can develop more sophisticated
algorithms capable of identifying subtle indicators of fraudulent activity.
Additionally, collaboration between industry stakeholders, government agencies, and
cybersecurity experts is essential to share information and best practices for combating job scams
effectively. Moreover, raising awareness among job seekers about the red flags of fraudulent job
postings and educating them on how to verify the legitimacy of job opportunities can empower
individuals to protect themselves from falling victim to scams. By adopting a multi-faceted
approach that combines technological innovation, collaboration, and education, we can create a
safer and more trustworthy environment for online job seeking.
2
CHAPTER-2
LITERATURE SURVEY
Vidros et al. found statistics about many real and renowned companies who produced fake
job advertisements with ill-motives. They experimented on EMSCAD dataset using several
classification algorithms like Naive Bayes classifier, Random Forest classifier, etc. Random Forest
Classifier showed the best performance on the dataset with 89.5% classification accuracy. They
found logistic regression performing very poor on the dataset.
Alghamdi et al. worked on this dataset in three steps - data pre-processing, feature selection
and fraud detection using classifier. They applied feature selection technique to reduce the number
of attributes effectively and efficiently. SVM was used for feature selection and ensemble classifier
using Random Forest was used to detect fake job posts from the test data. This classifier showed
97.4% classification accuracy to detect fake job posts.
“A Study on Fake Job Posts Detection Techniques” using NLP by Akhtar et al. explores
various NLP techniques, including sentiment analysis, keyword extraction, and syntactic analysis.
“Detecting Fraudulent Job Advertisements using Machine Learning Techniques" by Founta et al.
investigates the application of machine learning techniques, including support vector machines
and random forests, for detecting fraudulent job advertisements based on textual features.
"Identifying Fraudulent Job Advertisements through Text Mining" by Okou et al. focuses
on identifying fraudulent job advertisements using text mining approaches such as TF-IDF and
topic modeling. Text mining techniques enable the extraction of relevant features from job
postings, which are then used to classify them as fraudulent or legitimate.
"Detection of Fake Job Postings using Machine Learning Algorithms" by Kumar et al. proposes a
framework for detecting fake job postings using machine learning algorithms, including decision
trees and logistic regression, trained on textual data features. These algorithms are trained on
labeled data to learn patterns indicative of fraudulent job postings and then applied to classify new
postings.
3
CHAPTER-3
FEASIBILITY STUDY
In order to evaluate if the project can be done in the given time frame, we are using the TEL-
evaluation methods, where we cover the feasibility of the project from a technological, economical,
and legal perspective.
4
CHAPTER-4
SYSTEM REQUIREMENTS
1. Time – consuming: Job seekers can currently identify fake postings by verifying the contact
details, searching the company’s name, etc. This takes a lot of time and patience.
3. Not foolproof: It is not always possible to be able to detect fraud postings as scammers are
evolving in their schemes and tactics to present a fake job as a real one, so it is not easy to identify
a job posting.
4. No dedicated website or tool in place: Websites like Glassdoor, LinkedIn, ScamAdviser, etc
allow users to read reviews and ratings for a particular job posting, and to verify the trustworthiness
of a website. These are helpful, but not always accurate and don’t solely focus on identifying
fraudulent job postings. Scammers have learnt to bypass these websites to make their postings
look legitimate.
5. Fragmentation of Resources: Existing systems are fragmented, with job seekers needing to visit
multiple websites or platforms to verify the legitimacy of job postings.
6. Limited Focus: Some existing websites and tools may not prioritize fake job posting detection as
their primary focus, leading to gaps in coverage and effectiveness.
7. Dynamic Nature of Job Scams: Job scams are constantly evolving, requiring continuous updates
and adaptations to detection strategies.
5
4.2 PROPOSED SYSTEM:
A more robust system can address the limitations of the existing system and offer additional
features. This system uses ML algorithms and NLP approaches.
1. An EMSCAD dataset that contains 18,000 job posts, including attributes such as job id, title,
location, department, salary_range, company_profile, description, requirements, benefits,
telecommunication, has_company_logo, has_questions, employment_type,
required_experience, required_education, industry, function, fraudulent (class label). This is
used to train the system.
2. The data is cleaned and provided to the classifier for predictions to make it prediction ready.
3. Feature selection and extraction: Among these 18 attributes, we have used only 7 attributes
which are converted into categorical attribute. Telecommuting, has_company_logo,
has_questions, employment_type, required experience, required_education and fraudulent are
changed into categorical value from text value. As an illustration, the "employment_type"
attribute values undergo transformation as follows: "none" is represented by 0, "full-time" by
1, "part-time" by 2, "others" by 3, "contract" by 4, and "temporary" by 5. The primary objective
behind converting these attributes into categorical form is to perform the classification of
fraudulent job advertisements without using text processing or natural language processing
techniques. This study exclusively focuses on utilizing these categorical attributes for analysis.
4. The classifier gets trained regularly as new training data is feed into the classifier (keyword
analysis, pattern recognition).
5. A website (UI) is provided for administrators and users, allowing them to login, use the tool
and see the prediction.
Features:
1. User-Friendly Interface: Develop a website using HTML and CSS web-based interface using a
framework like Django. Users can easily register, login, post their job datasets, predict a job
posting, view their profile and logout.
2. Free: This tool is absolutely free to use and has an easy to interact UI that non-technical users can
use.
6
3. Multiple classifiers: It uses multiple algorithms like RF, NB, SVM and LR to compare their
4. Database Integration: Store logs of user details like their names, emails, passwords, city, state
and country in a database for persistence.
5. Error Handling: Provide clear error messages to enhance the user experience.
1. Improved accuracy: Machine learning algorithms can be trained on large datasets to detect
patterns that are indicative of fake job postings.
2. Efficiency: Machine learning models can process large volumes of data much more quickly
and efficiently than humans.
3. Scalability: Machine learning models can be easily scaled up to handle large volumes of job
postings.
4. Adaptability: Machine learning models can adapt to new fraud techniques and patterns as they
emerge, making them more effective at identifying fake job postings.
7
4.4 REQUIREMENTS DEFINITION
After the severe continuous analysis of the problems that arose in the existing system, we
are now familiar with the requirement required by the current system. The requirements that the
system needs are categorized into functional and non-functional requirements. These requirements
are listed below:
● User friendly
● The system should provide better accuracy.
● To perform efficiently with better throughput and response time
8
CHAPTER-5
SYSTEM DESIGN
UML diagram is designed to let developers and customers view a software system from a
different perspective and in varying degrees of abstraction. In its simplest form, a use case can be
described as a specific way of using the system from a User’s (actor’s) perspective. A more
detailed description might characterize a use case as:
● a pattern of behavior the system exhibits
● a sequence of related transactions performed by an actor and the system
● delivering something of value to the
actor Use cases provide a means to
● capture system requirements
● communicate with the end users and domain experts
● Test the system
Use cases are best discovered by examining the actors and defining what the actor will be able to
do with the system. Since all the needs of a system typically cannot be covered in one use case, it
is usual to have a collection of use cases. Together this use case collection specifies all the ways
of using the system. A UML system is represented using five different views that describe the
system from a distinctly different perspective. Each view is defined by a set of diagrams, which is
as follows:
User Model View
● This view represents the system from the user’s perspective.
● The analysis representation describes a usage scenario from the end
user’s perspective.
Structural model view
● In this model, the data and functionality come from inside the system.
● This model view models the static structures.
Behavioral Model View
● It represents the dynamic of behavior as parts of the system, depicting the interactions of
collection between various structural elements described in the user model and structural
model view.
9
Implementation Model View
● In this, the structural and behavioral parts of the system are represented as they are to be built.
Environmental Model View
● In this, the structural and behavioral aspects of the environment in which the system is to
be implemented are represented.
UML is specifically constructed through two different domains they are:
● UML Analysis modeling, which focuses on the user model and structural model views of
the system.
• Actors represent the users or external systems interacting with the system being modelled. They
are depicted as stick figures.
• Actors can be human users, other systems, or external entities that interact with the system.
2. Use Cases:
• Use cases represent the specific functionalities or features provided by the system.
• They describe a sequence of actions or interactions between the system and its actors to achieve a
specific goal.
3. Relationships:
10
Principles of Use Case Diagrams:
1. Simplicity: Use case diagrams should be simple and easy to understand and avoid excessive
detail and focus on essential interactions.
2. Relevance: Include only relevant actors and use cases. Prioritize use cases based on their
importance to system functionality.
3. Clarity: Use clear and descriptive names for actors and use cases. Ensure that relationships
are clearly defined and labelled.
• Use Cases: Login, Train and Test Datasets, View Trained and Tested Accuracy in Bar Chart,
View Trained and Tested Accuracy results, Register and Login, Predict job post, View your
profile, View all remote users.
The above use case diagram represents the actors and the process in the application.
11
5.1.2 ACTIVITY DIAGRAM:
The activity diagram depicts the flow from one activity to another activity. They show the
workflow from a start point to the finish point detailing the many decision paths that exist in the
progression of events contained in the activity. The following section describe the elements that
constitute an activity diagram:
1. Activities
An activity is shown as a round-cornered rectangle enclosing all the actions, control flows
and other elements that make up the activity.
2. Actions
An action represents a single step within an activity. Actions are denoted by round-cornered
rectangles.
3. Control Flow
A control flow shows the flow of control from one action to the next. Its notation is a line with
an arrowhead.
4. Initial Node
5. Final Node
• Activities: Registration, Login, Submit fake job post details, Applying feature extraction,
Classify, Predict the job post, Verify true positive result.
12
Figure. 5.1.2: Activity Diagram
The class diagram represents the class, relations between the classes in the activity. Class diagram
consists of classes, attributes, relations. It describes the attributes and operations of the classes.
Class diagrams provide a high-level overview of a system’s design, helping to communicate and
document the structure of the software. They are a fundamental tool in object-oriented design and
play a crucial role in the software development lifecycle.
● Class Name: The name of the class is typically written in the top compartment of the class
box and is centered and bold.
13
● Attributes: Attributes, also known as properties or fields, represent the data members of the
class. They are listed in the second compartment of the class box and often include the
visibility (e.g., public, private) and the data type of each attribute.
● Methods:
• Visibility Notation:
14
5.1.4 SEQUENCE DIAGRAM:
A sequence diagram is a Unified Modeling Language (UML) diagram that illustrates the sequence
of messages between objects in an interaction. It describes the flow of messages in the system and
how and in what order the objects in a system function. It consists of Remote User, Service
Provider (Admin) and Web Server.
1. Actors:
An actor in a UML diagram represents a type of role where it interacts with the system and
its objects. It is important to note here that an actor is always outside the scope of the system
we aim to model using the UML diagram. We use actors to depict various roles including
human users and other external subjects. We represent an actor in a UML diagram using a
stick person notation. We can have multiple actors in a sequence diagram.
2. Lifelines:
3. Messages:
Synchronous messages: A synchronous message waits for a reply before the interaction can
move forward. The sender waits until the receiver has completed the processing of the
message. The caller continues only when it knows that the receiver has processed the
previous message i.e. it receives a reply message. We use a solid arrow head to represent a
synchronous message.
15
Asynchronous Messages: An asynchronous message does not wait for a reply from the
receiver. The interaction moves forward irrespective of the receiver processing the previous
message or not. We use a lined arrow head to represent an asynchronous message.
4 . Create message:
We use a Create message to instantiate a new object in the sequence diagram. There are
situations when a particular message call requires the creation of an object. It is represented
with a dotted arrow and create word labelled on it to specify that it is the create Message
symbol.
5. Self Message:
Certain scenarios might arise where the object needs to send a message to itself. Such
messages are called Self Messages and are represented with a U-shaped arrow.
A data flow diagram shows the way information flows through a process or system. It includes
data inputs and outputs, data stores, and how the subprocesses the data moves through. It shows
the graphical representation of the “flow” of data through a system, modelling its process aspects
and to visualize the processes in a system. Development is done at different levels: 0,1,2. It is also
called as “bubble chart”.
Components of a DFD:
1. External Entities:
External entities represent sources or destinations of data outside the system being modeled.
They are depicted as squares or rectangles. Examples include users, other systems, or external
data sources.
2. Processes:
Processes represent activities or functions that transform data inputs into outputs. They
describe the actions or operations performed within the system. Processes are represented as
circles or ovals.
3. Data Stores:
Data stores represent repositories or storage locations where data is held within the system.
They can be databases, files, or other data repositories. Data stores are depicted as rectangles
with two parallel lines at the top and bottom.
4. Data Flows:
Data flows represent the movement of data between processes, data stores, and external
entities. They show how data is input, processed, and output within the system.
Data flows are depicted as arrows indicating the direction of data movement.
17
Level 0:
Level 1:
18
5.1.6 ARCHITECTURE:
Data Collection: Data is collected from a dataset, which contains 18K job descriptions out of
which about 800 are fake.
Data Pre-Processing & Cleaning: Unnecessary information is removed like copies, punctuation
(How are you? - > How are you) etc. It involves activities like:
Lemmatization: To reduce tokens to a standard form. Eg: "running" → "run“, "ate" → "eat“
Stemming: Reducing a word to its stem/root form by removing suffixes. Eg: Celebrating,
Celebrated -> Celebrate.
Tokenization: Splitting text into individual words or tokens. Eg: This is new-> ‘This’, ’is’, ’new’
Stopwords: To remove words that don’t carry much meaning, such as "the", "and", and "a".
Machine Learning Models: Train machine learning models on labeled data to identify patterns
associated with fake job postings. Use algorithms such as decision trees, random forests, etc.
Data Splitting: Dataset is divided into two parts: 70% training data and 30% test data. The test
dataset is given as input to the model and the output is predicted.
19
5.1.7 ALGORITHMS:
1. RANDOM FOREST:
This consists of many decision trees. The random forest takes the prediction from each
tree and based on the majority votes of predictions; it predicts the final output. The
‘forest’ generated is trained through bagging or bootstrap aggregating. The steps are:
b. A decision tree is constructed for every sample and produces a result of prediction
for each sample.
Random Forest is a popular machine learning algorithm used for both classification
and regression tasks. It belongs to the ensemble learning techniques, which combine
multiple individual models to improve predictive performance.
1. Decision Trees: Random Forest is built upon the concept of decision trees.
Decision trees are hierarchical structures that recursively split the data into subsets
based on the features, aiming to create simple decision rules.
3. Random Feature Selection: At each node of the decision tree, Random Forest
randomly selects a subset of features from the available features. This helps in
reducing correlation among the trees and promotes diversity in the ensemble.
4. Decision Tree Construction: For each bootstrap sample and at each node of the
tree, the algorithm selects the best split among the randomly chosen features, based
on criteria such as Gini impurity or entropy for classification tasks.
20
5. Ensemble of Trees: After constructing multiple decision trees using different
bootstrap samples and feature subsets, Random Forest combines the predictions of
all individual trees to make the final prediction. For classification tasks, it employs
a majority voting mechanism, while for regression tasks, it averages the predictions
of all trees.
Due to its versatility, robustness, and ease of use, Random Forest is widely used in
various domains, including finance, healthcare, and marketing, for tasks such as
classification, regression, and feature selection.
2. NAIVE BAYES:
The Naive Bayes algorithm is a simple yet powerful classification algorithm used in
machine learning. It's based on Bayes' theorem, which describes the probability of an
event occurring given prior knowledge of conditions related to the event.
21
Here's a brief overview of how the Naive Bayes algorithm works:
2. Naive Assumption: The "naive" assumption in Naive Bayes is that the features are
conditionally independent given the class label. In other words, the presence of one
feature does not affect the presence of another feature. Although this assumption may
not hold true in many real-world scenarios, Naive Bayes often performs well in practice
despite this simplification.
3. Training: During the training phase, the algorithm calculates the prior probabilities
\(P(A)\) for each class and the likelihoods \(P(B|A)\) for each feature given each class.
These probabilities are estimated from the training data.
4. Prediction: To make predictions on new data, the algorithm uses Bayes' theorem to
calculate the posterior probability \(P(A|B)\) for each class given the input features
\(B\). The class with the highest posterior probability is then assigned as the predicted
class.
5. Types of Naive Bayes: There are different variants of the Naive Bayes algorithm,
such as:
- Multinomial Naive Bayes: Suitable for discrete features, often used in text
classification tasks.
- Bernoulli Naive Bayes: Assumes that features are binary (e.g., presence or absence
of a feature).
22
Naive Bayes is known for its simplicity, efficiency, and scalability, especially with
high-dimensional data. It's commonly used in text classification, spam filtering,
sentiment analysis, and other classification tasks where the independence assumption
holds reasonably well. However, it may not perform as well as more complex
algorithms when the independence assumption is violated or when dealing with highly
correlated features.
3. LOGISTIC REGRESSION:
2. Logistic Function (Sigmoid): The linear combination \(z\) is then passed through a
logistic function (also known as the sigmoid function) to squash the output into the
range [0, 1]. This function ensures that the output represents a valid probability.
3. Probability Estimation: The output of the logistic function represents the estimated
probability that the input belongs to the positive class (class 1 in binary classification).
If the estimated probability is greater than or equal to a predefined threshold (usually
0.5), the input is classified as belonging to the positive class; otherwise, it is classified
as belonging to the negative class (class 0).
23
4. Training: During the training phase, logistic regression learns the optimal weights
\(w_0, w_1, ..., w_n\) by minimizing a loss function (e.g., logistic loss or cross-entropy
loss) using an optimization algorithm such as gradient descent. The model is trained
on labeled data, where the true class labels are known.
5. Evaluation: Once trained, the logistic regression model can make predictions on
new, unseen data by computing the probability of the input belonging to the positive
class and applying the threshold to classify it accordingly.
Here are some key points about the Support Vector Machine algorithm:
2. Margin Maximization: SVM aims to find the hyperplane that best separates the data
points of different classes while maximizing the margin, which is the distance between
the hyperplane and the nearest data points (support vectors). Maximizing the margin
helps improve the generalization ability of the model and reduces the risk of
24
overfitting.
4. Sensitivity to Outliers: SVM is sensitive to outliers as they can influence the position
and orientation of the decision boundary. Techniques such as using soft-margin SVM
or robust kernel functions can help mitigate the impact of outliers on the model's
performance.
25
CHAPTER-6
SOFTWARE IMPLEMENTATION
Features Of Python:
1. Easy to code:
Python is high level programming language. Python is very easy to learn language as
compared to other language like c, c#, javascript, java etc. It is very easy to code in python
language and anybody can learn python basic in few hours or days. It is also developer-friendly
language.
3. Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports object-
oriented language and concepts of classes, objects encapsulation etc.
4. High-Level Language:
26
5. Python is Portable language:
Python language is also a portable language. for example, if we have python code for windows
and if we want to run this code on other platform such as Linux, Unix and Mac then we do not
need to change it, we can run this code on any platform.
6. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a time. like
other language c, c++, java etc there is no need to compile python code this makes it easier to
debug our code. The source code of python is converted into an immediate form called bytecode.
27
6.1.4 NLTK:
NLTK is a leading platform for building Python programs to work with human language
data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet,
along with a suite of text processing libraries for classification, tokenization, stemming, tagging,
parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active
discussion forum.
6.1.5 Pandas:
Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. It is free software released under the three-clause BSD license.
The name is derived from the term "panel data", an econometrics term for data sets that include
observations over multiple time periods for the same individuals. Its name is a play on the phrase
"Python data analysis" itself.
Pandas is mainly used for data analysis and associated manipulation of tabular data in
DataFrames. Pandas allows importing data from various file formats such as comma-separated
values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel. Pandas allows
various data manipulation operations such as merging, reshaping, selecting, as well as data
cleaning, and data wrangling features. The pandas library is built upon another library, NumPy,
which is oriented to efficiently working with arrays instead of the features of working on
DataFrames.
6.1.6 HTML:
HTML is an abbreviation for Hyper Text Markup Language is the standard markup
language used to create web pages. HTML is written in the form of HTML elements consisting of
tags enclosed in angle brackets (like <html>). HTML tags most commonly come in pairs like <h1>
and </h1>, although some tags represent empty elements and so are unpaired, for example <imp>.
The first tag in a pair is the start tag, and the second tag is the end tag (they are also called opening
tags and closing tags).
28
The purpose of a web browser is to read HTML documents and compose them into visible or
audible web pages. The browser does not display the HTML tags but uses the tags to interpret the
content of the page. HTML elements form the building blocks of all websites. HTML allows
images and objects to be embedded and can be used to create interactive forms. It provides a means
to create structured documents by denoting structural semantics for text such as headings,
paragraphs, lists, links, quotes, and other items. It can embed scripts written in languages such as
JavaScript which affects the behavior of HTML web pages.
6.1.7 CSS:
CSS stands for Cascading Style Sheet. CSS is used to design HTML tags. CSS is a widely
used language on the web. HTML, CSS, and JavaScript are used for web designing. It helps web
designers to apply a style to HTML tags. It can also be used to allow the web page to display
differently depending on the screen size or device on which it is being viewed.
With plain HTML you define the colors and sizes of text and tables throughout your pages. If you
want to change a certain element you will therefore have to work your way through the document
and change it. With CSS you define the colors and sizes in "styles". Then as you write your
documents you refer to the styles Another big advantage is that CSS offers much more detailed
attributes than plain HTML for defining the look and feel of your site.
6.1.8 JavaScript:
29
6.1.9 DJANGO:
Django is a high-level Python web framework that facilitates rapid development and clean,
pragmatic design. One of the key features of Django is its ability to seamlessly integrate the
frontend and backend components of a web application. Key features of Django include its
powerful ORM (Object-Relational Mapping) system, which simplifies database management and
interactions, along with a robust admin interface that automates administrative tasks.
Additionally, Django's URL routing mechanism enables clean and concise URL patterns,
enhancing the overall readability of web applications. Moreover, Django boasts a comprehensive
authentication system, aiding in the implementation of user authentication and authorization
functionalities Overall, Django's emphasis on rapid development, security, and scalability,
combined with its rich feature set and thriving community, solidify its position as one of the
leading web frameworks for building robust and maintainable web applications.
#Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data=pd.read_csv("C:/Users/PERSONAL/Desktop/New folder (5)/Fake-Job-Prediction-Model-
main/Job_Posting_DataSets.csv",encoding='latin1')
30
data.drop(['job_id','salary_range','telecommuting','has_company_logo','has_questions'],axis=1,in
place = True)
data.fillna(' ', inplace=True)
#Create independent and Dependent Features
columns = data.columns.tolist() #store all colums of data into columns var
31
DATA PREPARATION:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
data['text'] = data['text'].apply(lambda x:x.lower())
data['text'] = data['text'].apply(lambda x:' '.join([word for word in x.split() if word not in
(stop_words)]))
import os
import sys
def main():
os.environ.setdefault('DJANGO_SETTINGS_MODULE',
'aComparative_study_onFake_jobPost.settings')
try:
raise ImportError(
"Couldn't import Django. Are you sure it's installed and "
) from exc
execute_from_command_line(sys.argv)
if __name__ == '__main__':
1. Remote User:
<div class="topnav">
33
</div>
<div class="mainholder">
{% block userblock %}
{% endblock %}
</div>
2. Service Provider:
<div class="topnav">
<a href="{% url 'train_model' %}">Train and Test Data Sets </a>
<a href="{% url 'likeschart' 'bar' %}">View Trained and Tested Accuracy in Bar Chart </a>
<a href="{% url 'charts1' 'line' %}"> View Trained and Tested Accuracy Results </a>
</div>
1. Remote User:
dt_from_joblib = joblib.load('Remote_User/filename.pkl')
import joblib
vect = joblib.load('Remote_User/vectorizer.pkl')
def login(request):
username = request.POST.get('username')
password = request.POST.get('password')
try:
enter=
34
ClientRegister_Model.objects.get(username=username,password=password)
request.session["userid"] = enter.id
return redirect('Add_DataSet_Details')
except:
pass
return render(request,'RUser/login.html')
def Register1(request):
if request.method == "POST":
username = request.POST.get('username')
email = request.POST.get('email')
password = request.POST.get('password')
phoneno = request.POST.get('phoneno')
country = request.POST.get('country')
state = request.POST.get('state')
city = request.POST.get('city')
ClientRegister_Model.objects.create(username=username, email=email,
password=password, phoneno=phoneno, country=country, state=state, city=city)
else:
return render(request,'RUser/Register1.html')
def Search_DataSets2(request):
if request.method == "POST":
title= request.POST.get('title')
location= request.POST.get('location')
company_profile= request.POST.get('company_profile')
35
description= request.POST.get('description')
requirements= request.POST.get('requirements')
benefits= request.POST.get('benefits')
industry= request.POST.get('industry')
text = [title+ " " +location+ " " +company_profile+ " " +description+ " "
+requirements+ " " +benefits+ " " +industry]
input_data_features = vect.transform(text)
prediction = dt_from_joblib.predict(input_data_features)
print(prediction)
if (prediction[0]==1):
result='Fraudulent Job'
else:
result='Real Job'
X_live=[[job_id,
title,
location,
department,
salary_range,
company_profile,
description,
requirements,
benefits,
telecommuting,
has_company_logo,
36
has_questions,
employment_type,
required_experience,
required_education,
industry,
function,
fraudulent
]]
2. Service Provider:
def serviceproviderlogin(request):
if request.method == "POST":
admin = request.POST.get('username')
password = request.POST.get('password')
detection_accuracy.objects.all().delete()
return redirect('View_Remote_Users')
return render(request,'SProvider/serviceproviderlogin.html')
def View_Remote_Users(request):
obj=ClientRegister_Model.objects.all()
return render(request,'SProvider/View_Remote_Users.html',{'objects':obj})
def charts(request,chart_type):
chart1 = detection_ratio.objects.values('names').annotate(dcount=Avg('ratio'))
37
def charts1(request,chart_type):
chart1 = detection_accuracy.objects.values('names').annotate(dcount=Avg('ratio'))
title: {
text: ""
},
data: [{
{% if chart_type == "line" %}
type: "line",
type: "pie",
type: "spline",
{% endif %}
startAngle: 240,
yValueFormatString: "##0.00\"%\"",
dataPoints: [
{% for o in form %}
{% endfor %}
chart.render();
}
38
CHAPTER-7
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of looking for any flaws
or weaknesses in a piece of work. It is the process of testing software to make sure that it satisfies
user expectations and meets requirements without failing in an unacceptable way. There are
various types of tests. Each test type addresses a specific testing requirement.
Software components that have been integrated are tested in integration tests to see if they
actually operate as a single program. Event-driven testing is more concerned with the core outcome
of fields or screens. Unit tests that were successful show that the individual components were
satisfactory, but integration tests show that the combination of the components is correct
consistent. Integration testing is specifically made to draw attention to problems that arise when
combining components.
39
7.1.3 Functional test:
Functional tests provide methodical evidence that the tested functions are available and meet
all technical and business requirements as well as those listed in the system documentation and
user manuals. Functional testing is centered on the following items:
Organization and preparation of functional tests are focused on requirements, key functions,
or special test cases. In addition, systematic coverage pertaining to identifying business process
flows, including data fields, predefined processes, and successive processes, must be considered
for testing. Before functional testing is complete, additional tests are identified and the effective
value of current tests are determined.
System testing ensures that the entire integrated software system meets requirements. It tests
a configuration to ensure known and predictable results. An example of system testing is the
configuration-oriented system integration test. System testing is based on process descriptions and
flows, emphasizing pre-driven process links and integration points.
do. It's for a reason. It is employed to test regions that are inaccessible at the black box level.
White box testing, also known as structural testing or glass box testing, is a software testing
technique that examines the internal structure of the software being tested. Unlike black box
40
testing, which focuses on testing the functionality of the software without knowledge of its internal
workings, white box testing requires knowledge of the internal code and logic.
In white box testing, testers design test cases based on the internal structure of the software,
including code paths, branches, and conditions. The goal is to ensure that all paths through the code
are tested, and that the software behaves as expected under different conditions. Techniques used
in white box testing include code coverage analysis, path testing, and mutation testing.
White box testing is particularly useful for identifying errors or bugs in the code, as well as
ensuring code quality and reliability. It is commonly used during the development phase of
software engineering, although it can also be applied during maintenance or debugging processes.
The term "black box testing" refers to software testing done without any prior knowledge
of the inner workings, architecture, or language of the module being tested. Black box tests, like
the majority of other types of tests, must be created from a clear source document, such as a
specification or requirements document. It is a type of testing where the software being tested is
handled like a black box. It is impossible to "see" into. Without taking the software's operation
into account, the test generates inputs and reacts to outputs.
Black box testing is a software testing technique where the internal structure, design, and
implementation of the software being tested are not known to the tester. Instead, the tester focuses
solely on the external behavior of the software, treating it as a "black box" whose internal workings
are not visible.
In black box testing, testers design test cases based on the software's specifications, requirements,
and functionality. They do not have access to the source code or knowledge of the underlying
algorithms or logic. The goal is to verify that the software behaves correctly and produces the
expected outputs based on different inputs and conditions.
Black box testing is useful for validating the functionality, usability, and reliability of the software
from an end-user perspective. It helps identify issues such as incorrect or missing features, user
interface flaws, and compatibility issues. Common techniques used in black box testing include
equivalence partitioning, boundary value analysis, and error guessing.
Black box testing is typically performed at various stages of the software development life cycle,
41
including requirements analysis, system testing, and acceptance testing. It complements white box
testing by providing a different perspective on the software's quality and ensuring thorough test
coverage.
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as two
distinct phases.
Unit testing is a software testing technique where individual units or components of a software
system are tested in isolation to verify that they behave as expected. A unit can be a function,
method, class, or module, depending on the granularity of the software being tested.
In unit testing, test cases are designed to exercise specific functionalities or behaviors of the unit
under test. These test cases are typically automated, allowing them to be executed repeatedly and
efficiently. Unit tests are written to cover various scenarios, including normal inputs, boundary
conditions, and error cases.
1. Verify that each unit of the software performs correctly according to its specifications.
2. Detect and fix defects early in the development process, reducing the cost of debugging
and maintenance.
3. Provide a safety net for refactoring and code changes, ensuring that existing
functionalities remain intact.
Unit testing frameworks, such as JUnit for Java, pytest for Python, and NUnit for .NET,
provide tools and utilities to facilitate the creation, execution, and organization of unit tests.
Incorporating unit testing into the software development process promotes code quality, reliability,
and maintainability.
42
Test strategy and approach
Field testing will be performed manually, and functional tests will be documented in detail.
Test objectives
Features to be tested
● Verify that the entries are of the correct format
● No duplicate entries should be allowed
components that have been integrated on a single platform in order to produce failures brought on
by interface flaws. The task of the integration test is to check that components or software
applications, e.g., components in a software system or—one step up—software applications at the
company level—interact without error.
43
7.6 TESTING METHODOLOGIES
The following are the testing methodologies:
● Unit Testing.
● Integration Testing.
● User Acceptance Testing.
● Output Testing.
Unit testing focuses verification efforts on the smallest unit of software design,
which is the module. In order to ensure complete coverage and maximum error detection,
unit testing exercises particular paths in a module's control structure. This test focuses on
each module separately to make sure that they all work together properly. Hence, the name
is Unit Testing.
Each module is tested separately during this testing, and the module interfaces are checked
for consistency with the design specification. All significant processing routes are
examined for the anticipated outcomes. Every path for handling errors is tested as well so
tested.
Integration testing addresses the issues associated with the dual problems of
verification and program construction. After the software has been integrated, a series of
high-level tests are conducted. The main objective of this testing process is to take unit-
tested modules and build a program structure that has been dictated by design.
This method begins the construction and testing with the modules at the lowest level
in the program structure. Since the modules are integrated from the bottom up, processing
required for modules subordinate to a given level is always available, and the need for stubs
is eliminated. The bottom-up approach tests each module individually, then each module is
integrated with the main module and tested for functionality.
User Acceptance of a system is the key factor for the success of any system. The
system under consideration is tested for user acceptance by constantly keeping in touch
with the prospective system users at the time of development and making changes wherever
required. The system developed provides a friendly user interface that can be easily
understood even by a person who is new to the system.
3. DevOps: In DevOps practices, output testing is automated and integrated into the
continuous integration/continuous deployment (CI/CD) pipeline. Automated tests are
executed automatically upon code commits or deployments to ensure that changes do not
introduce regressions or errors in the output. This approach enables rapid and reliable
delivery of software while maintaining high quality.
Regardless of the methodology used, output testing aims to ensure that the software
meets the desired functionality, performance, and quality standards, ultimately delivering
value to users and stakeholders. It involves designing appropriate test cases, executing tests,
and analyzing results to validate the correctness of the output produced by the software.
46
Validation checks are performed on the following fields.
1. Text Field: The text field can contain only the number of characters lesser than or
equal to its size. Incorrect entry always flashes and error message.
2. Numeric Field: The numeric field can contain only numbers from 0 to 9. An entry
of any character flashes an error message. Testing involves executing the real data
information is used in the program the existence of any program defect is inferred
from the output.
1. User Acceptance Testing (UAT): This involves testing the software from the perspective
of end-users to ensure that it meets their requirements and expectations. UAT is often
conducted by actual users or representatives of the target audience in a real-world
environment.
2. Functional Testing: Functional testing verifies that the software performs the functions
and tasks specified in the requirements and design documents. Test cases are designed to
cover various functional aspects of the software to ensure its correctness and completeness.
3. Usability Testing: Usability testing assesses the ease of use, intuitiveness, and user-
friendliness of the software interface. It evaluates factors such as navigation, accessibility,
and user satisfaction to ensure a positive user experience.
4. Compatibility Testing: Compatibility testing ensures that the software works correctly
across different platforms, devices, browsers, and operating systems. It verifies that the
software is compatible with various configurations and environments commonly used by
its target audience.
47
6. Performance Testing: Performance testing evaluates the software's performance,
scalability, and responsiveness under different load conditions. It ensures that the software
can handle expected levels of user traffic and data volume without degradation in
performance.
Overall, validation testing plays a critical role in ensuring that the software meets quality
standards, delivers value to users, and satisfies the business requirements effectively and
efficiently. It helps identify any discrepancies between the expected and actual behavior of
the software, enabling necessary adjustments and improvements to be made before
deployment.
48
CHAPTER-8
RESULTS AND OUTPUT SCREENS
50
Figure. 8.4.1: Real/Fraudulent Job Post Prediction screen
2. View your Profile: This allows the user to view their profile details like their name, email,
mobile number, password, city, state and country.
51
Figure 8.5.1: Accuracy of the algorithms used screen
2. View Trained and Tested Accuracy in Bar Chart:
3. View Trained and Tested Accuracy Results (in Pie and Line Charts):
52
Figure 8.5.3: Line and Pie Chart of Algorithms screen
4. View all remote users: The admin can view all the registered users with their name, email, mobile
number, country, state and city.
53
CHAPTER-9
CONCLUSION
Job scams have become a great concern all over the world at present. In this project, we
have analyzed the impacts of job scams which can be a very prosperous area in research filed
creating a lot of challenges to detect fraudulent job posts. The EMSCAD dataset which contains
job posts from real situations, over the last ten years, has been experimented with. We have
experimented with machine learning algorithms (SVM, Naïve Bayes , Random Forest and LR).
This work shows a comparative study on the evaluation of traditional machine learning classifiers.
Logistic Regression is seen to have the highest classification accuracy among the traditional
machine learning algorithms.
This is a versatile and practical tool in the fight against fraudulent jobs. It helps users avoid
financial losses and protects their personal information. This will guide job-seekers to get only
legitimate offers from companies. It adds value to the reputation and credibility of online job
platforms and ensures that legitimate job opportunities receive the attention they deserve. It has a
user-friendly interface and high accuracy, as we have used multiple algorithms to predict the
output.
In essence, this tool project offers a valuable solution to the challenges posed by using online job
recruitment portals, promoting a more efficient, secure, and focused digital environment for users
across various contexts, whether fresher or senior.
54
CHAPTER-10
FUTURE ENHANCEMENT
This project has a wide scope as it is not intended for a particular platform, but can be applied to
any job recruitment platform through the job description or the details provided.
1. A mobile application could be developed catering to users who primarily access job platforms
through mobile devices.
2. Integration of image processing, to detect postings that may contain misleading multimedia
content. This could be used for job posts made via Instagram, LinkedIn, Twitter, WhatsApp, etc.
4. Aim to deploy this alongside job portals to flag fraudulent job postings. It could be used as a tool
alongside job portals like Naukri, LinkedIn, Glassdoor, Indeed, etc, so users can securely access
them and focus their time on valid and legitimate job postings, leading to an efficient job search
process.
5. User Feedback Mechanism: Implement a user feedback mechanism where users can report
suspicious job postings, providing valuable data for model retraining and improving the accuracy
of the detection system over time.
55
CHAPTER-11
REFERENCES
56