CHAPTER I
INTRODUCTION
1.1Introduction
In the world of internet, Internet of Things (IoT) known as the advance transformative that can possibly influence our lifestyles to be more convenient and also make our lives simpler. According to Chang et al. (2014) stated that the IoT device means physical devices connect to Internet-based on traditional telecommunications with address could search and communicate each other. IoT devices represent a general concept for the ability of network devices to sense and collect data from the world around us, and then share that data across the Internet where it can be processed and utilized for various interesting purposes. The one particularly dangerous part of cybercrime is the threat forced by IoT botnet. Based on (McAfee Sample Database, 2018) report, total malware files discovered has been growing in year 2018. The number of malicious files detected daily reflects the average activity of cybercriminals involved in the creation and distribution of malware.

Figure 1.1 Total Malware Files in Year 2018 (McAfee Lab, 2018)
According to the (Kaspersky Lab, 2018), the amount of malware targeting IoT devices more than doubled in 2018. The figure 1.2 describe the IoT botnet attack by country in year 2018. Based on Kaspersky Lab report, The Top 10 regions by number of botnet C&C servers underwent some significant changes. Top spot went to the US with almost half of all C&C centers (44.75% against 29.32% in Q1). South Korea (11.05%) sank from first to second, losing nearly 20 p.p. China also dropped significantly (from 8.0% to 5.52%). Its place was taken by Italy, whose share climbed from 6.83% in the previous quarter to 8.84%. The Top 10 saw the departure of Hong Kong, but was joined for the first time since our records began by Vietnam, whose 3.31% was good enough for seventh place.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Figure 1.2 IoT Botnet Attack by Country in Year 2018 (Kaspersky Lab Report, 2018)
In fact, the large number of uncertain devices with high computation power make them an easy and attractive target for attackers seeking to compromise these devices and use them to create large-scale IoT botnet by Elisa and Nayeem (2017). According to Robert and Eric (2017) stated that users are often unaware of their system being infected, as infected devices will stay idle until they receive commands from their commander to start an attack. IoT botnet is known as a set of hijacked Internet-connected devices and each of the devices will be affected by threats. Without the knowledge of the device’s rightful owner, this process can allow an attacker to remote unsecured device. Based on Felix report describe that the data privacy leaks because of the security issues in IoT devices. According to (Bernard, 2017), the Mirai IoT botnet are not new. In order to form these attacks, the hackers have been using botnet by gaining access to unsecured IoT devices. The main purpose of IoT botnet attack are for spamming, identity theft, information stealing, reputation theft, botnet hosting services, click fraud, manipulating online polls and also attacking bank computers by (Sorensen, 2017).

1.2Problem Statement (PS)
The increasing number of malicious programs been attacking the IoT devices in the year 2017. Worldwide, smart devices now number 6 billion, and many of them are vulnerable, making them a juicy prospect for intruders by (Kaspersky lab, 2017). The main problem of IoT devices are security issues. IoT devices are often designed with poor security or even none at all. Internet is already very complex to secure, with additional 9+ billion insecure IoT devices, the task has become more difficult by (Angrishi, 2017). As users are often designed to be plugged in and forgotten about, the users often do not apply security updates and it is easy for an attack on such devices to go unnoticed. Besides, the machine learning uses to enhance the accuracy of the data analysis for IoT botnet to be more scalable. The challenge is to separate and to filter the important data from the rest and interpret them in a valuable way by (Michael, 2017).
Table 1.1 Summary of Problem Statement (PS)
PS Problem Statement (PS)
PS1 Lack of security on IoT devices.

PS2 Machine learning is not comprehensive enough for IoT botnet detection.

1.3Project Question (PQ)
Table 1.2 Summary of Project Question (PQ)
PS PQ Project Question (PQ)
PS1 PQ1 What kind of IoT botnets attack that possible infect in the devices?
PS1 PQ2 What is the behavior of the IoT botnets attack?
PS2 PQ3 How much easier and effective are the machine learning detection are being used?
1.4Project Objectives (PO)
Table 1.3: Summary of Project Objectives (PO)
PS PQ PO Project Objective (PO)
PS1 PQ1 PO1 To study possible attacks which are used to infect IoT devices.

PS1 PQ1 ; PQ2 PO2 To analyze the behavior of IoT botnet attack based on basic mode of operations and communication.

PS2 PQ1, PQ2 ; PQ3 PO3 To measure the best method of machine learning network based on IoT botnet detection.

1.5Project Scope
The project scope is going to be handled as follows:
Focusing on the IoT botnet attack namely Mirai that exploit in devices which may positively affect user behavior.

This project uses the machine learning as a platform that will detect the attack.

This project also will also focus in Windows 7 Operating System.

1.6Expected Output
The project’s purpose is to evaluate the machine learning classifiers for effective detection of IoT botnet flows that have high predictive accuracy. This project also to study, understand, analyze and also summarize the behavior of IoT botnet attack using machine learning. Moreover, in this project need to test the machine learning based classification techniques on flow data captured from Mirai botnet only. In addition, this project also needs to test the machine learning technique in large-scale network set-ups.

1.7Thesis Organization
-601980333597Chapter I
Introduction
Chapter II
Literature Review
This chapter will explain more details about this project, supported with the reading materials. Moreover, other related projects will also be included such as the example of possible attacks, the purpose of attacks, the classification of IoT botnets, the behavior of IoT botnets and the type of machine learning.

Chapter III
Project Methodology
This chapter provides the methodology of the analysis process that will be used as part of this project. The project methodology will ease the task of analyzing and organizing the project.

Chapter IV
Analysis and Design
This chapter discusses the analysis of the problem and requirement needed in the IoT botnets. Next, this section also briefly covers the high-level design, user interface design and the system architecture.

This chapter will elaborate more about the machine learning classifier, scenario design, machine learning methodology, dataset description, results based on performance measure and also effectiveness measure.

Chapter V
Result and Discussion
This chapter provides an initial overview of the IoT botnets and the suitability of the machine learning involved to ensure this project can be set up within the specified time without any problems.

The last chapter describes the overall project summarization, project contribution and also project limitation. This chapter will explain on additional work that can be possible done in future.

Chapter VI
Project Conclusion
00Chapter I
Introduction
Chapter II
Literature Review
This chapter will explain more details about this project, supported with the reading materials. Moreover, other related projects will also be included such as the example of possible attacks, the purpose of attacks, the classification of IoT botnets, the behavior of IoT botnets and the type of machine learning.

Chapter III
Project Methodology
This chapter provides the methodology of the analysis process that will be used as part of this project. The project methodology will ease the task of analyzing and organizing the project.

Chapter IV
Analysis and Design
This chapter discusses the analysis of the problem and requirement needed in the IoT botnets. Next, this section also briefly covers the high-level design, user interface design and the system architecture.

This chapter will elaborate more about the machine learning classifier, scenario design, machine learning methodology, dataset description, results based on performance measure and also effectiveness measure.

Chapter V
Result and Discussion
This chapter provides an initial overview of the IoT botnets and the suitability of the machine learning involved to ensure this project can be set up within the specified time without any problems.

The last chapter describes the overall project summarization, project contribution and also project limitation. This chapter will explain on additional work that can be possible done in future.

Chapter VI
Project Conclusion

1.8Summary
Overall, Chapter I helps to comprehend the project background and issues happened before begun this project. According to the related topic in Chapter I such as the problem statement, project question, project objective, project scope and the expected output concludes that this study wants to propose a new approach of machine learning that capable detect the IoT botnet. Besides, the growing number of IoT botnet programs focusing on the previous security cases determines how genuine the issues of smart device security are. The following chapter which is Chapter II will be focusing on the Literature Review that will cover about the methodology approached and related work about the IoT botnet that affect unsecured devices indirectly give awareness to users how important to have advanced security features. Besides, this project approaches with a comparative analysis of machine learning method of the best results and concluding remarks.

CHAPTER II
LITERATURE REVIEW
2.1Introduction
In Chapter II explains the Literature Review for this project that layout the abstract of researching of IoT botnet by using machine learning. The Literature Review mentioned here provides the compilation from varied authors and studies that have made this project before. This is why by comparing and contrasting the selection of using the right methodology is crucial to get the best experience for this project. In this chapter, published information regarding topics related to this project is reviewed and mentioned. Besides, the problems related with this project is studied and analyzed. Further information regarding definition of IoT, IoT issues, Malware Analysis, botnet life-cycle, type of botnet attacks, DDoS attack, characteristics of botnet to arrange DDoS attacks, overview of Mirai attacks, overview of IDS, IDS approach, analysis techniques for IoT botnet detection by using machine learning and also the previous research in the area of this topic are studied and the possible solution to the problem is proposed.

2.2Related Work
2.2.1Domain Related to this Project
(a)Network Security

Based on Keung (2014), a simple but widely-applicable security model is the CIA triad known as Confidentiality, Integrity and Availability. These principles which should be assured in any kind of security system.

Figure 2.1 CIA Principle (Keung, 2014)
Confidentiality
The confidentiality in the network security has the capability to protect data from any unauthorized users to display it. The technique is to guarantee the confidentiality of data transferred from on PC to another are Cryptography and Encryption. It is possibly the most obvious aspect of the CIA triad when it comes to security but, it is also the one which is attacked most often.

Integrity
An accurate and unchanged representation of the original secure data can guarantee the important of the data. Before sending the important data to the intended receiver, the security attack will intercept the data and make changes first. Data encryption and hashing are used as a security mechanism to implement the integrity.

Availability
The availability principle in the network security is important to guarantee the data concerned readily accessible to the authorized viewer at all times. The security attack attempt to deny access to the appropriate user, either for some secondary effect or because of the sake of inconveniencing them. For example, by breaking the website for a particular search engine, an opponent may turn out to be more popular.

(b)IoT Security
Based on June (2017), there are three components of IoT security areas that are examined such as IoT vulnerabilities, the connected workplace and also IoT management.

IoT Vulnerabilities
The medical devices PCs is an internet-able device that has significantly grown while some harmless devices such as a printer that could any easy route into a network for a hacker. It could be the route into a network for a hacker to cripple a network or to access important data or use together even these internet-able devices may not prime purposes to protect.

The Connected Workplace
The number of internet-connected devices is increasingly used by the lack of security or malicious threats. For instance, printers can be connected in the workplace without any security updates and patches of laptops and mobile phones. The IoT devices must be controlled to identify malicious threats and also to be considered as endpoints such as the tablet, computer and mobile.

IoT Management
The designers need to start from scratch with every new application because there is no commonplace platform to leverage the development of IoT applications. The system maintains different technological standards for communication between network architecture and devices. This determines the system is managed on totally different platforms to help IoT devices.

(c)Security Incident
As the IoT is closely related to communication and information technology, it is justified to consider security and privacy challenges already known in information security and examine how these concerns are transferred to the current and future state of the IoT. According to (Embitel Report, 2017) a Japanese malware called Mirai was developed to attack Linux based devices connected to a network and turn them into remotely controlled bots. The botnet was introduced to attack various IoT devices, IP cameras and also primarily home routers. According to a White hat malware research group, it was the largest attack leading to a widespread Distributed Denial of Service (DDoS) attack. The online business becomes unavailable by overwhelming it with traffic from multiple sources. This program distinguished defenseless IoT devices through a table of 60 default username and password, sign in to contaminate the devices with the Mirai malware. The malware continued in the system unless the system is rebooted and the secret key is changed after the boot.

(d)Malware Analysis
According to (Yanhui, 2017) stated that the Kaspersky’s annual report released at the end of 2016 showed that its Internet-based malware database had reached 1 billion, including viruses, Trojans, worms and other malicious objects. The growth rate had been raised from 7.53% in 2012 to 40.5% in 2016, and the number of malwares that was found in each day had increased from 70,000 in 2011 to 323000 in 2016.

Malicious software exploits vulnerabilities in computing system. Malware includes viruses, worms, Trojan horses, spyware that gather information about a computer user and access to a system without permission. It can appear in the form of code, scripts, active content, or other software. According to Sanjeev and Ankur (2017), Malware programs are divided into 2 classes, the first class of malware needs a host program (viruses, Trojan horses, logic bombs, trapdoors) and the second class of malware is independent programs (worms, zombie). Other categorization of malware does not replicate (activated by trigger) and others that producing copies of themselves.
Malware (especially viruses and worms) are self-replicate programs. Viruses require user interaction and propagate slower than worms because it needs user interaction while worms do not require user interaction and propagate quickly. All the bots are under the controlled of BotMaster. If bots exist on the computer, it is not harmful until it receives a command from BotMaster. After receiving the command from BotMaster, it is dangerous for system. These bots are not self-propagated from one network to another network. They are in idle state. After receiving the commands from BotMaster, they propagate from one system/network to system/network and to malicious activities.

Type of Malware Analysis
The malware analysis is a study of malware behavior to detect its different components. The components may be different variants which may affect system without any influence. The malware analysis has two type static analysis and dynamic analysis.

According to Anusha et al. (2015) stated that the static analysis of software is performed without actually executing the program. The feature sets can be used individually or in combination for malware detection. A malware detection technique that relies on static analysis and is based on control flow graphs. Their approach focuses on detecting obfuscation patterns in malware and they are able to achieve good accuracy. Machine learning techniques have been applied to malware detection in the context of static detection. The paper employs clustering, based on features derived from static analysis, for malware classification.
Dynamic techniques were used to overcome the limitations of signature-based methods. Based on Anusha et al. (2015), dynamic analysis requires that we execute the program, often in a virtual environment. The interaction of the malware with the system is thoroughly analyzed by executing the malware in a completely isolated environment. The major advantage of dynamic analysis is that executables need not to be disassembled to perform analysis in contrast to static analysis. Dynamic analysis has certain limitations such as extensive time and resource consumption, thus limiting the scalability of the analysis.

2.2.2Keywords
Internet of Things Internet of Things refers to scenarios where network connectivity and computing capability extends to objects, sensors and everyday items not normally considered computers, allowing these devices to generate, exchange and consume data with minimal human intervention.

Botnet Botnet is formed from the words ‘robot’ and ‘network’. Cybercriminals use special Trojan viruses to breach the security of several users’ computers, take control of each computer and organize all of the infected machines into a network of ‘bots’ that the criminal can remotely manage.

IoT Botnet IoT botnet is known as a set of hijacked Internet-connected devices and each of the devices will be affected by threats. Without the knowledge of the device’s rightful owner, this process allows an attacker to remote unsecured.

Machine Learning Machine learning is to develop systems with the capacity to gain from past understanding.

2.3Critical Review

Basically, the critical review is a writing task from the summarization and evaluation of a text. It can be a book, a journal article or other medium. In order to present a fair and reasonable evaluation of the chose content, people must examine the selected text in details. The material must be clearly understood so that, the analyzation and evaluation of that material will be done perfectly using appropriate criteria. Therefore, there are several journals that were to be used as guidelines for this project. Among them are as follows:
(a)Malware Trends
Based on Kaspersky Lab report (2017), the reason behind the rise because of the IoT is fragile and exposed in the face of cybercriminals. The vast majority of smart devices are running operating systems based on Windows 7, making attacks on them easier because criminals can write generic malicious code that targets a huge number of devices simultaneously. The team of manufacturers usually do not produce any security updates or new firmware and most of them do not even have a security solution. This implies there are a huge number of possibly powerless devices that have been already compromised.

Figure 2.2 Malware Analysis in year 2013 – 2017 (Kaspersky Lab, 2017)
Smart devices such as smartwatches, smart TVs, routers, and cameras are connecting to each other and building the growing IoT phenomenon, a network of devices equipped with embedded technology that allows them to interact with each other or the external environment. Because of the large number and variety of devices, the IoT has become an attractive target for cybercriminals. By successfully hacking IoT devices criminals are able to spy on people, blackmail them, and even discreetly make them their partners in crime. What’s worse, botnet such as Mirai have indicated that the threat is on the rise.

According to Kaspersky Lab report (2017) have conducted research into IoT malware to examine how serious ` risk is. Most of the attacks registered by the company’s experts targeted digital video recorders or IP cameras (63%), and 20% of hits were against network devices, including routers, and DSL modems, etc. About 1% of targets were people’s most common devices, like printers and smart home devices.

Figure 2.3 Distribution of Attack Sources by Device Type (Kaspersky Lab, 2017)
(b)Overview Internet of Things (IoT)
The term “Internet of Things” means connecting the devices with one another it was introduced by Kevin in the year 1998 by (Effy et al. 2016). The word “Things” in IoT can be referred to a wide variety of devices such as mobile phones and remote. Basically, IoT is a revolution allowing to build connection among various of people come across in their day to day life and their everyday interaction with the network with help of no human.
Based on (Kaspersky Lab, 2017) report stated that IoT devices often have weak security that is very easy to bypass. In this year, the number of malicious programs attacking the IoT devices has been expanded. Besides, the (Cisco Internet Business Solutions Group, 2017) report stated that the IoT is basically the point in time when more things or objects were connected to the Internet than people for instance from anytime, anyplace uniquely identifiable objects or “things” with a digital presence can be connected for anyone on any network.
According to Karen et al. (2015), there are 5 lists of IoT issues areas are examined to explore some of the most pressing challenges and questions related to the technology. The table 2.1 below describes in details about the security, privacy, interoperability and standards, legal, regulatory and rights and lastly emerging economies and development.

Table 2.1 IoT Issues (Karen, 2015)
IoT Issues Description
Security Users need to trust that IoT devices and related data services are secure from vulnerabilities, especially as this technology become more pervasive and integrated into users’ daily lives.
Poorly secured IoT devices and services can serve as potential entry points for cyber-attack and expose user data to theft by leaving data streams inadequately protected.

Privacy The full potential of the IoT depends on strategies that respect individual privacy choices across a broad spectrum of expectations.
The data streams and user specificity afforded by IoT devices can unlock incredible and unique value to IoT users but concerns about privacy and potential harms might hold back full adoption of the IoT.

This means that privacy rights and respect for user privacy expectations are integral to ensuring user trust and confidence in the Internet, connected devices, and related services.

Interoperability/ Standards In addition, poorly designed and configured IoT devices may have negative consequences for the networking resources they connect to and the broader Internet.
The use of generic, open, and widely available standards as technical building blocks for IoT devices and services will support greater user benefits, innovation, and economic opportunity.

Legal, Regulatory and Rights The use of IoT devices raises many new regulatory and legal questions as well as amplifies existing legal issues around the Internet.
One set of issues surrounds cross border data flows, which occur when IoT devices collect data about people in one jurisdiction and transmit it to another jurisdiction with different data protection laws for processing.

Further, data collected by IoT devices is sometimes susceptible to misuse, potentially causing discriminatory outcomes for some users.

Emerging Economy and Development Issues The Internet of Things holds significant promise for delivering social and economic benefits to emerging and developing economies.

In addition, the unique needs and challenges of implementation in less-developed regions will need to be addressed, including infrastructure readiness, market and investment incentives, technical skill requirements, and policy resources.

(c)Botnet Life Cycle
According to (Sanjay 2006), botnet have been around since early 2004. The attacker machines are usually running on the Microsoft 7 operating system. Based on Sheharbano et al. (2013) stated that botnet is a collection of compromised machines (bots) receiving and responding to commands from a server (the C;C server) that serves as a rendezvous mechanism for commands from a human controller (the BotMaster). A Bot meaning robot and also known as Zombie. The BotMaster can control remotely the affected computer by executing a few requests through the received commands to install the new malware. The computer turns into a Bot or Zombie after the Bot code successfully installed into the affected computers. Hence, current malware for example worm and virus which are concentrating on attacking the infecting host can use bots to receive commands from BotMaster and are used in distributed attack platform.

Figure 2.4 Structure of a Typical Botnet (Sheharbano, 2014)
Generally, specific attacker created botnet using one piece of malware to infect a large number of compromised machines. Botnet are known as a number of internet-connected devices used by the botnet’s owner to do several tasks. The botnet’s owner can manage the attack by command and control (C;C) software. The compromised computer that forms a botnet attack will be processed to redirect transmission to the particular computer. It also assumed that C;C communication is very flexible, and it is hard for any botnet detection to rely on particular communication features. Moreover, the main difference between Botnet and other kinds of malware is the presence of C;C.

Type of Botnet Attacks
Based on Hongmei et al. (2009), botnet can perform various task such as Distributed Denial of Service attack, send spam and spread malware, steal data because of information leakage, click fraud and lastly identity fraud.

DDoS Attacks
Botnet can disable the network services of victim system by expending its data transfer capacity that regularly used for DDoS attacks. Firstly, a perpetrator may arrange the botnet to connect a victim’s IRC channel and then this target can be flooded by a huge number of services demands from the botnet. The victim IRC network is taken down with this kind of DDoS attack. The UDP flooding attacks and TCP SYN are the evidence reveals that usually performed by botnet attack. Based on Kaspersky Lab Report 2018, UDP attacks are in second place (10.6%), while TCP, HTTP, and ICMP constitute a relatively small proportion.

Figure 2.5 DDoS Attacks by Type (Kaspersky Lab, 2018)
According to Kaspersky Lab (2018), in Q2 2018, Sunday went from being the quietest day for cybercriminals to the second most active: it accounted for 14.99% of attacks, up from 10.77% in the previous quarter. But gold in terms of number of attacks went to Tuesday, which braved 17.49% of them. Thursday, meanwhile, went in the opposite direction: only 12.75% of attacks were logged on this day. Overall, as can be seen from the graph, in the period April-June the attack distribution over the days of the week was more even than at the beginning of the year.

Figure 2.6 DDoS Attacks by Day of the Week 2018 (Kaspersky Lab, 2018)
Spamming and Spreading Malware

Nowadays, the botnet also can use for spamming and spread the malware. It is about 70% to 90% of the world’s spam is created by botnet, which has most experienced in the Internet security industry affected. Since the victims’ system might not have activated the ISS services, a botnet can launch Witty worm to start attack ICQ protocol. Kaspersky Lab, 2017 stated that in year 2018, the largest share of spam was recorded in January (54.50%). The average share of spam in global email traffic was 51.82%, in year 2017.

Figure 2.6 Proportion of Spam in Global Email Traffic in Year 2017 and 2018 (Kaspersky Lab, 2018)
Information Leakage
In fact, some bots may sniff not only the traffic passing by the compromised machines but also the command data within the victims, perpetrators can retrieve sensitive information like usernames and passwords from botnet easily. Since the bots rarely affect the performance of the running infected systems, they are often out of the surveillance area and hard to be caught. This enables the attacker to steal thousands of private information and credential data.

Click Fraud
With the guidance of the botnet, browser helper objects (BHOs) for business purpose and perpetrators are able to install advertisement add-ons. Each affected host owns a unique IP address scattered across the globe in online polls or games. Every single click will be regarded as a valid action from an authorized person.

Identity Fraud
The identity fraud also known as identity theft is a quickly developing wrongdoing on the Internet. Usually, it includes reliable-like URLs and confidential information through spamming mechanisms or asks the receiver to submit personally. Next step, the botnet will pretend to be an official business site to gather all the victims’ data by setting up the various fake websites. Once a fake site is closed by its owner, another one can pop up, until you shut down the computer.
Based on the (Consumer Sentinel Network, 2016), stated that the tracks consumer fraud and identity theft complaints that have been filed with private organizations, state, federal and local law enforcement agencies. About 3.1 million complaints received in 2016, 1.3 million were fraud-related, costing consumers over $744 million. In this report, almost 28 % of all the complaints reported to the FTC and 66 percent of all fraud complaints. In year 2016, 13 % of all complaints were related to identity theft. Identity theft complaints were the third most reported to the FTC and had increased by more than 47 % from 2013 to 2015 but fell about 19 % from 2015 to 2016.

Figure 2.7 The Identity Theft and Fraud Complaints in year 2013-2016 (Consumer Sentinel Network, 2016)
IoT Botnet: Mirai Attacks
Based on the statistics from the Malaysia Computer Emergency Response Team (MyCERT), the timeline below illustrates the emergence of Mirai from late 2016 to early 2017. Donno et al. (2017) also stated that the Mirai infected hundreds of thousands of connected devices all over the world in year 2016. Beginning in September 2016, a DDoS attack incidentally disabled Krebs on Security, OVH and Dyn. Besides, the initial attack on OVH using the Mirai botnet exceeded 1 Tbps in volume among the largest on record. MyCERT observed a large number of IP addresses from Malaysia infected with the Mirai botnet that was recruited to launch the DDoS attack. The Mirai infection in Malaysia is visualized beginning in October 2016, which was the first month, until September 2017. The graph is categorized into state, port number and variant.

Figure 2.8 Mirai Infections in Malaysia 2016 – 2017 (Roziah, 2017)
The most predominant malware of the last years is Mirai attacks on IoT devices by Nicola et al. (2017). Mirai botnet can be classified as a worm-like family of malware that affected IoT devices and corralled them into a DDoS botnet by (Roziah and Sahrom 2017). Once exploited, the devices are reported to a control server in order to be used as part of a large-scale botnet. Hence, the botnet can be used to perpetrate several types of DDoS attacks exploiting a wide range of protocols.

Figure 2.9 IoT Malwares with DDoS Capabilities (Michele, 2018)
Mirai botnet is perhaps the most famous of all IoT malware that took down a significant portion of the Internet. In order to gain shell access, Mirai botnet uses the default password for the telnet or SSH accounts. After successfully access to this account, it will install malware on that system. This malware generates delayed procedures and then deletes documents that might signal antivirus software to its behavior. Without doing a memory analysis, it might be difficult to identify an infected system. Without doing a memory analysis, it might be difficult to identify an infected system. Mirai botnet opens ports and performs a connection with BotMaster and then begin searching for other devices it can infect. Next, it waits for more guidance. Since it has no activity while it waits and no documents left on the system, it is hard to recognize.

The Mirai botnet’s source code was released to the public which provided intrusion analysts insight into the attack and associated intrusion detection by (Anna, 2016). The DDoS traffic was produced by a different type of IoT devices. The malware tries to access with a series of common default passwords used by manufacturers when it recognizes an insecure IoT devices. The Mirai botnet will use brute force attacks to guess the password if those passwords do not work. After that, the infected device will connect to C&C infrastructure and can divert different amounts of traffic toward a DDoS target.

According to Kambourakis et al. (2017), the bot part used for investigating the IP space for new victims and also responsible for unleashing one of several DDoS attacks. In fact, Mirai botnet mostly target Linux-based IoT devices. Based on Kambourakis et al. (2017), the Mirai’s infrastructure is composed of a C;C module that provides the multiple attacks with a management console, a “report” or “collector” server that gathers and maintains information about the active bots in the botnet, as well as “loader” devices that facilitate the propagation of the malware to newly-discovered victims.

Figure 2.10 Overview of Mirai Communication and Basic Components (Kambourakis, 2017)
(d)Intrusion Detection System (IDS)
According to Vijayarani et al. (2015) stated that the IDS meant to be a software application which monitors the network or system activities and finds if any malicious operations occur. IDS are executed in the system to recognize the presence of intruders especially those that manage or endeavoring to sidestep the security barrier layer such as an anti-virus, firewall and access control with the goal that preventive measures can be taken. IDS also a mechanism to detect and prevent unauthorized access or malicious traffic
which may cause the system crash or data loss.

Generally, these attacks are partitioned into two classifications, network-based attacks and host-based attacks. Host-based attack detection typically uses system call data from an audit process that tracks all system calls made for the benefit of every client on a specific machine. Network-based attack detection methods for the most part utilize network traffic data from a network packet sniffer.

Table 2.2 Comparison of HIDS and NIDS performance (Xavier, 2016)
Performance in terms of: Host-Based IDS
(HIDS) Network-Based IDS
(NIDS)
Intruder deterrence Strong deterrence for inside intruders Strong deterrence for outside intruders
Threat response time Weak real-time response but performs better for a long-term attack Strong response time against outside intruders
Assessing damage Excellent in determining the extent of damage Very weak in determining the extent of damage
Intruder prevention Good at preventing inside intruders Good at preventing outside intruders
Threat anticipation Good at trending and detecting suspicious behavior patterns Good at trending and detecting suspicious behavior patterns
IDS Approach
IDS appliances can be used for examining purposes. Other than that, they simply identify if specific software or protocol is being used on the observed network.

Anomaly-based
Anomaly-based is a detection method that generally used in protocols since all the actual forms of the protocol are clearly described in RFCs. The anomaly is a deviation from those forms that have been identified while the anomaly detection intends to recognize designs that do not agree to expected behavior. Moreover, the anomaly detection techniques might be utilized from normal traffic to observe attack traffic. The disadvantage of this strategy is obvious because of the traffic follows defined standards, the content cannot be considered as not malicious. Basically, machine learning techniques are under the anomaly-based approach.

Behavior-based
Behavior-based is a mechanism which watches the ongoing network activity and identifies for uncommon situations. In addition, the behavior-based detection is baselined on everyday activity and looks for anything that deviates. This technology allows detecting any difference, including unknown issues such as zero-day attacks.

Signature-based
This detection mechanism compares event patterns against known signatures and attack patterns. Next, the detection capability is limited only to known signatures and malicious activity. The comparison to antivirus software resolutions comes to mind. Besides, the constant updates are important.

Analysis Approach
Based on Silvia et al. (2017) classify malware analysis methods by the mode of analysis whether it is static, dynamic or a mix from both (hybrid). Static analysis is breaking down the software without executing it, it looks at the file itself and tries to obtain information about the structure and the data in the file such that the time the program is compiled, which compiler is used, information about structure and data in the file can be determined. Next, the static analysis can be done either on the binary executable or on the source code. The issue of analysis of the code will be very complicated and some information will be lost when the code is compiled from source code to binary code. Furthermore, static analysis can be classified into either advanced or basic static analysis.

Dynamic analysis is trying to find errors in the program while running and trying the program by executing it at real time. Basic dynamic analysis actually runs malware to understand its behavior and its functionality and also to recognize technical indicators which can be used in detection signatures. Technical indicators showed with basic dynamic analysis can include IP addresses, registry keys, file path locations and domain names. Additionally, the interaction with an attacker-controlled external server for C;C purposes or trying to download additional extra malware records can be identified.

Table 2.3 Comparison between Static Analysis and Dynamic Analysis Methods (Ayman, 2017)
Factors Static Analysis Dynamic Analysis
Time Less time if automated but more time if conducted manually More time is needed
Input Source code, Bytecode of interpreted
language or binary code of a compiled application Memory snapshots and run-time data
Resource
consumption More cost efficient Needs more resources in memory and processing
Accuracy Less than dynamic analysis Better because it detects run-time vulnerability
Advantages Faster and code weaknesses are found earlier in the development lifecycle
More cost efficient than dynamic analysis Find vulnerabilities at runtime
More flexible and accurate
More attractive than static analysis because it is concerned with actual code execution
Limitations Cannot find vulnerabilities at run-time
Hard to perform Analyzes only a single malware at a time
(e)Introduction to Machine Learning
Based on Anand et al. 2016, WEKA is an open framework programming tool consisting of various inbuilt classification algorithms like J48, Random Forest, Decision Tree, Random Tree, NaiveBayes, SimpleNaiveBayes, NaiveBayes, DecisionStump, etc. However, the comparison of these algorithms has been made with the help of approaches that include correctly classified, incorrectly classified, accuracy and many others parameters. WEKA is a machine learning tool, thereby forming a basis of data mining. It was developed by professionals of University of Waikato, New Zealand in 1997.
WEKA executes calculations for information preprocessing, data preprocessing, association rules, visualization, regression, clustering. For such calculations, it comprises of 49 data preprocessing, 3 association rules and nearby 76 order classifications. The entire calculation is performed with the utilization of a Graphical User Interface (GUI) known by explorer that helps in investigating situations from information contained in the dataset. In addition, various dataset formats like .csv, .arff, are supported by WEKA in order to extract the relevant information from the crude information.

Machine learning Classifier
Machine learning was introduced in the late 1950’s as a technique for artificial intelligence (AI) by Yue, 2015. Machine Learning is an algorithms program that learn to collect data. The different algorithms that exist to learn from multiple data by using the Machine Learning algorithm. According to Koroniotis et al. (2017), there are four types of machine learning techniques for IoT botnet detection. The Machine Learning classifier consists of RandomForest, J48, JRip, Naive Bayes and BayesNet. An explanation of the machine learning is given in the first place, then this project also provides an analysis of results obtained based on the accuracy and false alarm rate.
A RandomForest is a collection or ensemble of decision trees. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction (Shahaboddin et al. 2013). Random forests is an idea of the general technique of random decision forests that are an ensemble learning technique for classification, regression and other tasks. Anjay et al. 2016 stated that in a random forest, every node is split using the best among the subset of predicators randomly chosen at that node. Random Forests are considered general purpose vision tools and considered as efficient. Besides high prediction accuracy, Random Forest is efficient, interpretable and non-parametric for various types of datasets by Nasir et al. 2012.

Decision trees are the base classifiers for Random Forests. According to Yael et al. 2010, the J48 algorithms are effective in that they provide human-readable rules of classification. Based on Shahaboddin et al. 2013, stated that J48 is a renowned, relatively simple classifier. It is a popular classifier since it is easy to interpret and explain. The decision is made based on whether a record of data belongs to a branch or not. A J48 is constructed from nodes which represent circles and the branches are represented by the segments that connect the nodes. According to (Koroniotis, 2017), J48 technique was the best at distinguishing between botnet and normal network traffic.
JRip is also a well-known algorithm for supervised data classification. It is a rule set is easy to understand and usually better than decision tree learners (Salman, 2013). In ripper classifiers training data is randomly distributed into growing set and pruning set. Classes are examined in increasing size and an initial set of rules for the class is generated using incremental reduced-error pruning by (Ramesh et al. 2011). Each rule keeps on growing until no information gain is possible further. In JRip instances of the dataset are evaluated in increasing order, for given dataset of threat a set of rules are generated. JRip algorithm treats each dataset of given database and generates a set of rules including all the attributes of the class. Then next class will get evaluated and does the same process as previous class, this process continues until all the classes have been covered.

Moreover, NaïveBayes algorithm is used also in machine learning systems to conclude the new data or testing data by (Vigan et al.2016). According to Shahaboddin et al. 2013, it is a simple probabilistic classifier based on the Bayes theorem with a strong features independence assumption. While this assumption is clearly false in most real-world tasks, naive Bayes often performs classification very well. Because of the independence assumption, the parameters for each attribute can be learned separately and this greatly simplifies learning, especially when the number of attributes is large (Albert and Ting 2011). Naïve Bayes models allow each attribute to contribute towards the final decision equally and independently from other attributes, in which it is more computational efficient when compared with other text classifiers. Thus, the present study focuses on employing Naïve Bayes approach as the text classifier for document classification and thus evaluates its classification performance against other classifiers.

Based on previous researcher, Priyanka and Sujata 2015 stated that the Bayesian classifiers are statistical classifiers. BayesNet can predict class membership probabilities, such as probability that a given tuple belongs to a particular class. It uses various searching algorithms and quality measures based on BayesNet classifier and provide data structure. According to Nigel et al. 2006, BayesNet is structured as a combination of a directed acyclic graph of nodes and links and a set of conditional probability tables. Nodes represent features or classes, while links between nodes represent the relationship between them. In BayesNet classifier conditional probability on each node is calculated first and then a Bayesian Network get formed by (Vaithiyanathan et al. 2013). The assumption made in BayesNet is, that all attributes are nominal and there are no missing values any such value replaced globally. Moreover, in BayesNet, the output of can be visualized in terms of graph. The power of Bayesian networks as a representational tool stems both from this ability to represent large probability distributions compactly. When user have a lot of missing data, BayesNet’s can be very effective since modeling the joint distribution by (Darwiche et al. 2008).

Detection Measures
According to Affendey et al. 2010, detection of attack can be measured by following metrics. The accuracy of an intrusion detection system is measured regarding to detection rate and false alarm rate.

False positive (FP): Corresponds to the number of detected attacks but it is in fact normal.

False negative (FN): Corresponds to the number of detected normal instances but it is actually attack, in other words these attacks are the target of intrusion detection systems.

True positive (TP): Corresponds to the number of detected attacks and it is in fact attack.

True negative (TN): Corresponds to the number of detected normal instances and it is actually normal.

Precision: It lists the proportion of those instances true to a particular class divided by overall instances classified with respect to that class.
Precision = TP / (TP + FP)
Recall: It defines the proportion of those instances that have been classified by a class divided by the total instances present in the class.
Recall = TP / (TP + FN)
F-Measure: It is computed by combining the measure of Recall and Precision.
Measure = (2 * Precision * Recall) / (Precision + Recall)
According to Elaheh et al. 2014, the classification accuracy metric is to evaluate the effectiveness of the considered features such as detection rate and false positive rate. The accuracy of an IoT botnet attack is measured regarding to detection rate and false alarm rate.

Detection rate: Refers to the percentage of detected attack among all attack data, and is defined as follows:
Detection Rate = TP/TP+FN
False alarm rate (FAR): Refers to the percentage of normal data which is wrongly recognized as attack, and is defined as follows:
FAR = FP/(FP+TN)
Kappa Statistic
Kappa refers to a chance-corrected measure that is calculated between classification and true classes. The value of then divided by the maximum value of the attribute. Value greater than zero indicates better performance as compared to chance. According to Suman et al. 2007, when dataset is small the Kappa Statistic is high and when the dataset is large the Kappa Statistic is low. Surabhi 2015 stated that:
Poor agreement = Less than 0.20
Fair agreement = 0.20 to 0.40
Moderate agreement = 0.40 to 0.60
Good agreement = 0.60 to 0.80
Very good agreement = 0.80 to 1.00
Confusion Matrix
A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. According to Jesse and Mark 2006, the confusion matrix has four categories: True positives (TP) are examples correctly labeled as positives. False positives (FP) refer to negative examples incorrectly labeled as positive. True negatives (TN) correspond to negatives correctly labeled as negative. Finally, false negatives (FN) refer to positive examples incorrectly labeled as negative.

Table 2.4 Classification Table (Jesse, 2006)
Attack Normal
TN FP Attack
FN TP Normal
2.4Proposed solution / further project
In this project, it is found that the machine learning for locally detecting IoT botnet attack is not comprehensive enough. More botnet complication cases can be considered for improving the detection rate of IoT botnet attack. Hence, to the best knowledge, there is currently no systematical evaluation on IoT botnet complication and several critical questions are yet to be answered, such as whether IoT botnet is complicated in a similar way to traditional botnet, and how limited resources influence complication methods. In addition, new attacks image extraction methods can be proposed for classification to obtain more representative features of botnet.
2.5Summary
Overall, this literature review provides details of the whole project in order to make sure that the studies had been done based on the topic and subtopic as mentioned. In the next chapter will be elaborated more about the proposed solution methodology. Based on this chapter, there are several domains involved in this chapter such as network security, IoT security, security incident and also malware analysis. Furthermore, this chapter also to ensure that the project to be developed can give the contribution as well as ensure that the objectives of the project have been stated successfully achieved. In addition, there are some previous studies that have been used as references to this project. It is to reinforce the reasons why this project should be implemented. All related or past research, references, case study and other findings that relate to this project title will be used for the purpose of successfully studying the project in time without any mistake.

CHAPTER III
PROJECT METHODOLOGY
3.1Introduction
In Chapter III, the project method or approach was identified as a part of finishing this task effectively will be described in more details. Moreover, the issues that related to the project methodology will also be described briefly. The goal of purposing the methodology will verify that the task is performed and conducted in a correct sequence with the correct way. Each of the phases will be discussed and explained in Chapter III. Each phase of the milestone is listed out in the form of Gantt chart. The discussion will be including requirements for the project as well. For this section, high-level requirements consist of hardware and software that is necessary for this project.

3.2Methodology
In analyzing this project, framework of the methodology model will give outline for each phase taken from the begin until the finish of the progress. Besides, the methodology model used to ensure that the objective of the project can be fulfilled successfully. This section defines actions to be taken to study the project problem and the method of reasoning for the use of particular methodology or techniques used to classify, select and examine all the information applied in order to understanding the problem. Moreover, Chapter III includes project methodology of the dissertation. In figure 3.1 below stated that six phases that required in order to complete the project. Start with analysis on previous research and information gathering, methodology, design, analysis results and lastly project evaluation.

Figure 3.1 Methodology Model
3.2.1Phase I: Analysis on Previous Research
Firstly, analysis of previous research phase is done in order to identify the specification made based on the previous research in building proper studies related to the topic of the project. The gathering of data from the trusted website such as Kaspersky Lab, MyCERT and many more to get current analysis about the IoT, IoT issues and botnet life-cycle. There are several important things to highlight such as type of botnet attacks, DDoS attack, characteristics of botnet to arrange DDoS attacks, Mirai attacks, IDS approach and Machine Learning techniques. All of the domain involved in this project will be discovered first to get the main idea before the project is continued. All the architecture, framework, and structure of the domains will be discovered and the next step will be explained in phase two.

3.2.2Phase II: Information Gathering
After all the domains had been discovered, few issues related to the vulnerabilities and threats due to the IoT botnet is identified. Throughout this phase, data obtained is analyzed for categorization and observation before conclude to a definite hypothesis. The limitation of IoT devices’ features had made it easier for intruders to attack within the network, especially by using default username and password. A few attacks that are able to access devices is Mirai botnet. The botnet is to determine a DDoS attack within the IoT devices. The online service inaccessible because of the overwhelming with traffic from multiple sources that were attempted by the DDoS attack. The machine learning techniques will be used in the project to make the better decision for botnet detection.

3.2.3Phase III: Methodology
During this phase, the methodology is the guidelines to provide a sequence of flow to ensure that the project is on the track as within the timeline. This phase will be focusing on the IoT botnet attack that exploits in devices which may positively affect user behavior. In this project, machine learning techniques will be used to detect the IoT botnet easily. However, the current techniques are not efficient for the botnet detection.
3.2.4Phase IV: Design
This phase is usually the longest and most extensive part of the process. Starting from the specification of machine learning requirements that fulfill the users’ needs, continued with a decision on suitable machine learning to be used. During the performance of requirement specifications, the data, functional and non-functional requirement is determined. The data requirement indicates what data is input to the system and what output does the techniques should produce.

3.2.5Phase V: Analysis Result
During this phase, it will be analyzing the data obtained to measure the botnet’s behavior and also the consequence of IoT botnet attack towards user experiences by examining on Windows 7. However, the analysis result makes plenty of sense. It explains all of the subproblem symptoms. Most important, the analysis result offers a new way forward that, if the root causes are anywhere close to correct, will work.

3.2.6Phase VI: Project Evaluation
It mainly focuses in order to define the quality of the expected output. Moreover, this phase proves that the project meets all requirements and objectives, including those for efficiency and effectiveness. If the user evaluates the techniques and the user is not satisfied with it, the current techniques are refined according to the requirements and the additional information provided by the user. This process is done continuously until the machine learning techniques are able to fulfill every requirement stated by the user.
3.3Project Flow
A project flow is a graphical representation of a process or system that details the sequencing of steps required to create output. A typical project flow uses a set of basic symbols to represent various functions and shows the sequence and interconnection of functions with lines and arrows. It begins with the input of data or materials into the system and traces all the procedures needed to convert the input into its final output form. Flow charts may include different levels of detail as needed, from a high-level overview of an entire system to a detailed diagram of one component process within a larger project. In any case, the flow chart shows the overall structure of the process, traces the flow of information and work through it, and highlights key processing and decision points.

Figure 3.2 Project Flow
3.4Project Milestone and Gantt Chart
In the project management, project milestone is essential as it refers to tools used to mark specific points along a project timeline. The project milestone used to signal anchors such as a task begin and end date or others. The project duration is not affected by the project milestone. In fact, the project milestone is used as a way to focus on big progress points that you need to make before making progress. Besides, the Gantt chart usually used in the project management. In this project, Gantt chart helps to plan, organize and track specific tasks based on the graphical representation of a schedule. Both Gantt chart and project milestone are very important in any project management as both charts will be used in order to make sure that the project will complete at the exact date. The development of any project also will be done based on the Gantt Chart and project milestone.

The Gantt chart and project milestone are very important in any project development as based on the milestone and Gantt chart, the developers will know what they need to achieve at one time. For this project, the Gantt chart is used in order to complete the project and also the documentation too. Gantt chart is an easy way to construct and schedule activities. The complete activities giving a visual timeline for beginning and wrapping up particular tasks in order to keep the project developer on track. This also helping the decision makers look forward to guarantee each task is working toward the project’s long-term key goals.

Table 3.1 Project Milestone
Week Activity Note / Action
1 Proposal PSM: Discussion Deliverable – Proposal
Action – Student
Proposal Assessment ; Verification Action – Supervisor
2 Proposal Correction / Improvement Action – Student
List of supervisor / title Action – PSM / PD Committee
3 Proposal Presentation ; Submission via PSM Online System
Chapter 1 Deliverable – Proposal Presentation (PP)
Action – Student
4
Chapter 1
Chapter 2 Deliverable – Chapter 1
Action – Student, Supervisor
5 Chapter 2 Action – Student
MID SEMESTER BREAK
6
Chapter 2
Chapter 3 Deliverable – Chapter 2
Progress Presentation 1 /
Pembentangan Kemajuan 1 (PK 1)
Action – Student, Supervisor
Student Status Warning Letter 1
Action – Supervisor, PSM / PD Committee
7 Chapter 3
Chapter 4 Action – Student
8 Chapter 4
Project Demo Deliverable: Chapter 3
Action – Student, Supervisor
9 Chapter 4
Project Demo Deliverable – Progress Presentation 2 / Pembentangan Kemajuan 2 (PK 2)
Action – Student, Supervisor
Student Status Warning Letter 2
Action – Supervisor, PSM / PD Committee
10 Project Demo Action – Student
Determination of student status (Continue / Withdraw) Submit student status to Committee
Action – Supervisor, PSM / PD Committee
11 Project Demo
PSM 1 Report Action – Student, Supervisor
12 Project Demo
PSM 1 Report
Schedule the Presentation Action – Student, Supervisor
Action – PSM / PD Committee
Presentation Schedule
13 Project Demo Deliverable – Complete PSM1 Draft Report
Action – Student, Supervisor
14 FINAL PRESENTATION
Submission of the PSM1 Report onto the PSM e-Repository online system Action – Student, Supervisor, Evaluator, PSM / PD Committee
15 REVISION WEEK
Correction on the draft report based on the comments by the Supervisor and Evaluator during the final presentation Session. Submit PSM1 Logbooks to PSM e-Repository online System Deliverable – Complete PSM1 Logbooks
Action – Student, Supervisor
Submission of overall marks to PSM/PD committee Deliverable: Overall PSM1 score sheet
Action – Supervisor, Evaluator, PSM/PD
Committee
16 ; 17 FINAL EXAMINATION WEEKS
3.5Summary
Overall, Chapter III had mentioned on the project methodology or approach that is used in this project. The methodology is very important as it will show what is this project is based on. The project methodology is one of few elements that need to be considered in a project as it conveys the objectives of a project. This chapter also had discussed the phases that are involved in the Methodology Model approach. The approach is used because all the detailed data that is related to the input and output requirements are currently known. The approach can allow the user to observe how the project operate. This methodology approach is iterative where the try-and-error process been done between the developers and the users. The process will be continuous until the final decision been made by the user. Users will critic the improvement that should be made based on the project and the developers will alter the project until it is success. The six phases that involve in this methodology approach are analysis on previous research, information gathering, define methodology, design, analysis result and the last phase is project evaluation.

CHAPTER IV
ANALYSIS AND DESIGN
4.1Introduction
Analysis and design are the crucial stages in the project progress. This chapter also defines the preliminary design and detailed design of this project. System analysis is important in determining the objectives of a system or its parts. Before proceeding to the next stage, the previous project should be understood thoroughly and the best way to use machine learning as botnet detection is determined so that it is able to operate efficiently. In this chapter, the analysis and design of the IoT botnet attack that using machine learning is done which involves the collection of requirements and carrying out the design phase once all requirements is defined. There are six parts that will be discussed in this chapter which is problem analysis, requirement analysis, design of experiment, high-level design, analysis result and overall of TCP flags segment for this project. Requirement analysis consists of data requirement, functional requirement, non-functional requirement and others requirement such as hardware and software requirements.

4.2Problem Analysis
Given that IoT devices need to be connected to the internet, manufacturers have to ascertain that what they produce poses very minimal risk to the buying public. In the case of the test device, manufacturers should make sure that ports connecting to the devices cannot be accessed directly from the internet. Manufacturers should also secure data that’s being stored or compiled by these IoT devices and conduct security because there is lack of security on IoT devices. In this project also use Machine learning for detecting IoT botnet attack. The existing Machine learning is not comprehensive enough.

4.3Requirement Analysis

This section will describe in detail the components involved in the project. Requirements must be actionable, measurable, testable, related to identified project needs and defined to a level of detail sufficient for system design.

4.3.1Data Requirement
This section will be cover about the data needed for implementation of the machine learning. The Wireshark will be used to capture dataset for analysis. Basically, Wireshark is an open source tool which have many capabilities including capturing packets. It able to filter, sorting, and also have the graphical user interface feedback. Data sample of Mirai botnet can be collected by running the attack in the simulation, and the traffic of the intrusions are captured with the Wireshark. After the analysis, the result of behavior analysis will be the signature that inserted to machine learning in custom rules in order to make those attack detectable by machine learning.

4.3.2Hardware Requirement
This section identifies the hardware items necessary for the experiment that fulfills a certain specification in order to use the Machine Learning. The information of the hardware items that used in the project can be referred from the table 4.1 below.

Table 4.1 Hardware Requirements
Hardware Type Minimum Requirement
Processor Dell OptiPlex 7010 PC
Memory 4 GB RAM
Monitor Dell
Keyboard and Mouse Any
Printer HP Deskjet 3635
Laptop Asus A550L
Memory 4 GB RAM
4.3.3Software Requirement
This section identifies the specific software necessary to perform in the project activities. All the requirements for the software are listed in table 4.2 below.

Table 4.2 Software Requirements
No Software Function
1. Windows 7 Professional Windows 7 is a personal computer operating system developed by Microsoft. Additional features include support for up to 192 GB of random-access memory (increased from 16 GB), operating as a Remote Desktop server, location aware printing, backup to a network location, Encrypting File System, Presentation Mode, Software Restriction Policies and Windows XP Mode.

2. Wireshark This software will act as traffic capturing application for datagram analysis of the botnet attack.

3. Microsoft Office Word 2016 It is very common and trusted tool. This tool is used in documentation of the project.

4. Microsoft Office Visio 2013 It is a software create a professional diagram that uses vector graphics to create diagram such as data flow diagram and to simplify complex information.

5. Weka Weka is an algorithm for data mining task. Weka contains tools for data pre-processing, classification, regression, clustering, association rules and visualization.

4.4Design of Experiment

Generally, the physical design refers to the actual layout of the physical part of the project. This includes the router, switch, workstations etc. Figure 4.1 below will show the physical setting for the project’s environment.

Figure 4.1 IoT Botnet Network Setup
4.5High-Level Design
This topic proposes about the design that would like to implement in this project. The proposed framework for this project is to show the structure of the botnet attack environment. The architecture diagram is shown in figure 4.2. Firstly, download the malicious file from VirusShare.com. Next, extract the file and the malicious file has been run by double clicking the file. The signature and behavior of the malicious file will active in the Wireshark, a report will be produced.

Figure 4.2 High-Level Design Diagram
4.6Analysis Result
Malware Sample
The malware samples collected for the experiment, originated from online sources which are virusshare.com and malwares.com. The malware that will be analyzed in this project is trojan. A computer Trojan is a standalone malware computer program that not be able to replicate itself in order to spread to other computers. A Trojan is a type of malware that is often disguised as legitimate software. Trojans contain malicious code, that, when triggered, cause loss, or even theft, of data. Trojans can be employed by cyber-thieves and hackers trying to gain access to users’ systems. Users are typically tricked by some form of social engineering into loading and executing Trojans on their systems. Once activated, Trojans can enable cyber-criminals to spy on you, steal your sensitive data, and gain backdoor access to your system. There is a sample of malware trojan that has been analyzed according to their md5.

File Sample Botnet I

Figure 4.3 Malware Sample I
File Sample Botnet II

Figure 4.4 Malware Sample II
Malware Analysis in Wireshark
Wireshark is a network packet analyzer. A network packet analyzer will try to capture network packets and tries to display that packet data as detailed as possible.

The Three-Way Handshake
TCP utilizes a number of flags, or 1-bit Boolean fields, in its header to control the state of a connection. The three-way handshake in here are:
SYN – (Synchronize) Initiates a connection
FIN – (Final) Cleanly terminates a connection
ACK – Acknowledges received data
PSH – Push Function

Figure 4.5 Three-way Handshake Data Flow
A packet can have multiple flags set. Wireshark captures packets and lets user examine the contents. Firstly, select any packet in Wireshark and expand the TCP layer analysis in the middle pane, and further expand the “Flags” field within the TCP header. Here we can see all of the TCP flags broken down. Once the three-way connection is established, the data is communicated by exchanging the segments and the PSH flag is set to indicate that the data flows on a connection as a stream of octets.

TCP Window Size
The TCP window size is very important as a speed and congestion factor. A small window can slow down the transmission. If a computer sends very low window sizes, or window sizes of zero, it may be in trouble. The window size is zero because of busy servers and do not receive much data transmission. The TCP window scale option is an option to increase the receive window size allowed in Transmission Control Protocol above its former maximum value of 65,535 bytes.

Sequence and Acknowledgement Numbers
Sequence number is 32 bits long. Each sequence number identifies the byte in the stream of data from the sending TCP to the receiving TCP that the first byte of data in this segment represents. After the Three-way handshake, the connection is open and the participant computers start sending data using the agreed sequence and acknowledge numbers.

TCP Retransmission
Each byte of data sent in a TCP connection has an associated sequence number. When the receiving socket detects an incoming segment of data, it uses the acknowledgement number in the TCP header to indicate receipt. After sending a packet of data, the sender will start a retransmission timer of variable length. If it does not receive an acknowledgment before the timer expires, the sender will assume the segment has been lost and will retransmit it.
Summary panel
TCP captures information about source and destination ports involved in the communication, next sequence number to look out for and different flags. As user can see the information about the time that the packet was received or sent. It also displayed the source of IP address, the destination IP address, which protocol was being used, the length of the actual packet and some info which is probably one of the more important pieces of data on this network. Each result of the analysis contains different colors used in the GUI in parenthesis:
Black: Bad TCP is very normal for Wireshark because the acknowledgment packets are being transmitted via a slower path or the network load is very high.

Dark Grey: This is normal information that helps user understand what has happened, such as a TCP segment with FIN protocol flag. This is used when a host wants to end an established connection.

Sky blue: This is TCP protocols. These can include typical TCP data, such as SYN or ACK flags, as well as everything else below it, IP, ether, frame protocols. But sometimes it includes an extra layer when our transmissions are TLSv1, this extra protocol is the Secure Socket Layer, which provides application layer encryption.File Sample Botnet I

Figure 4.6 TCP Layer Analysis Sample Botnet I
Frame 3717

Figure 4.7 TCP Data Communication for Frame 3717
Select packet #3717 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. Here user can see the acknowledgement that the previously sent data packet was received. FIN is sent by a host when it wants to terminate the connection, the TCP protocol requires both endpoints to send the termination request. Note that the ACK and FIN flags is on (set to 1). The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 80 while the destination port 49268 is a private port. The private ports are those from 49152 – 65535. This range cannot be registered with IANA. When a Web browser connects to a web server the browser will allocate itself a port in this range.

There is the unallocated address in IP Address 192.229.232.200. After has been analyzed, this IP Address is from Unites States according to website malwares.com as shown in Figure 4.8. Besides, this website also stated about 62,601 information of malicious sample communication history.

Table 4.3 Frame 3717 Key values for the TCP Three-Way handshake
SYN flag 0
ACK flag 1
FIN flag 1
Sequence Number 1
Source IP Address 192.229.232.200
Destination IP Address 10.73.32.172
Source TCP Port Number 80 (Well-known port number for HTTP, port from a computer sends and receives Web client-based communication)
Destination TCP Port Number 49268
Window Size 65535

Figure 4.8 Location of IP Address
Frame 3726
Figure 4.9 TCP Data Communication for Frame 3726
Select packet #3726 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. Here user can see the acknowledgement that the previously sent data packet was received. FIN is sent by a host when it wants to terminate the connection, the TCP protocol requires both endpoints to send the termination request. Note that the ACK and FIN flags is on (set to 1). The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 80 while the destination port 49268 is a private port. The private ports are those from 49152 – 65535. This range cannot be registered with IANA. When a Web browser connects to a web server the browser will allocate itself a port in this range.

Table 4.4 Frame 3726 Key values for the TCP Three-Way handshake
SYN flag 0
ACK flag 1
FIN flag 1
Sequence Number 1
Source IP Address 192.229.232.200
Destination IP Address 10.73.32.172
Source TCP Port Number 80 (Well-known port number for HTTP, port from a computer sends and receives Web client-based communication)
Destination TCP Port Number 49268
Window Size 65535
Frame 4647

Figure 4.10 TCP Data Communication for Frame 4647
Select packet #4647 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. Here user can see the acknowledgement that the previously sent data packet was received. The Push flag exists to ensure that the data is given the priority (that it deserves) and is processed at the sending or receiving end. Note that the ACK and PSH flags is on (set to 1). The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 443 while the destination port 49270 is a private port. The private ports are those from 49152 – 65535. This range cannot be registered with IANA. When a Web browser connects to a web server the browser will allocate itself a port in this range.

There is the unallocated address in IP Address 172.217.194.99. After has been analyzed, this IP Address is from Unites States according to website malwares.com as shown in Figure 4.11. Besides, this website also stated the information of malicious sample communication history.

Table 4.5 Frame 4647 Key values for the TCP Three-Way handshake
SYN flag 0
ACK flag 1
FIN flag 0
Sequence Number 1
Source IP Address 172.217.194.99
Destination IP Address 10.73.32.172
Source TCP Port Number 443 (Well-known port number for HTTPS, where the Web Server is listening for incoming requests)
Destination TCP Port Number 49270
Window Size 250

Figure 4.11 Location of IP Address
Frame 4648

Figure 4.12 TCP Data Communication for Frame 4648
Select packet #4648 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. Here user can see the acknowledgement that the previously sent data packet was received. FIN is sent by a host when it wants to terminate the connection, the TCP protocol requires both endpoints to send the termination request. Note that the ACK and FIN flags is on (set to 1). The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 443 while the destination port 49270 is a private port. The private ports are those from 49152 – 65535. This range cannot be registered with IANA. When a Web browser connects to a web server the browser will allocate itself a port in this range.

Table 4.6 Frame 4648 Key values for the TCP Three-Way handshake
SYN flag 0
ACK flag 1
FIN flag 1
Sequence Number 64
Source IP Address 172.217.194.99
Destination IP Address 10.73.32.172
Source TCP Port Number 443 (Well-known port number for HTTPS, where the Web Server is listening for incoming requests)
Destination TCP Port Number 49270
Window Size 250
Frame 4677

Figure 4.13 TCP Data Communication for Frame 4677
Select packet #4677 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. Here user can see the acknowledgement that the previously sent data packet was received. FIN is sent by a host when it wants to terminate the connection, the TCP protocol requires both endpoints to send the termination request. Note that the FIN, PSH and ACK flags is on (set to 1). The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 443 while the destination port 49270 is a private port. The private ports are those from 49152 – 65535. This range cannot be registered with IANA. When a Web browser connects to a web server the browser will allocate itself a port in this range.

Table 4.7 Frame 4677 Key values for the TCP Three-Way handshake
SYN flag 0
ACK flag 1
FIN flag 1
Sequence Number 1
Source IP Address 172.217.194.99
Destination IP Address 10.73.32.172
Source TCP Port Number 443 (Well-known port number for HTTPS, where the Web Server is listening for incoming requests)
Destination TCP Port Number 49270
Window Size 250
The Botnet DDOS attack has been analyze using Wireshark. Initially for testing method only 2 IP addresses has been taken to form the DDOS attack from 3 distinct IP addresses. The data log has been generated using Wireshark network simulator. As a conclusion, these data indicate that IP address 192.229.232.200 and 172.217.194.99 frequently communicate with 10.73.32.172 which is infected machine. Same applies to other IP addresses as well.
Table 4.8: Summary Sample Data of Botnet I
Packet # Source Destination Protocol
#3717 192.229.232.200 10.73.32.172 TCP
#3726 192.229.232.200 10.73.32.172 TCP
#4647 172.217.194.99 10.73.32.172 TCP
#4648 172.217.194.99 10.73.32.172 TCP
#4677 172.217.194.99 10.73.32.172 TCP
File Sample Botnet II
Each result of the analysis contains different colors used in the GUI in parenthesis:
Black: Bad TCP is very normal for Wireshark because the acknowledgment packets are being transmitted via a slower path or the network load is very high.

Dark Grey: This is normal information that helps user understand what has happened, such as a TCP segment with SYN/FIN protocol flags. This is used when a host wants to end an established connection.
Sky blue: This is TCP protocols. These can include typical TCP data, such as SYN or ACK flags, as well as everything else below it, IP, ether, frame protocols. But sometimes it includes an extra layer when our transmissions are TLSv1, this extra protocol is the Secure Socket Layer, which provides application layer encryption.

Figure 4.14 TCP Layer Analysis Sample Botnet II
Frame 1920

Figure 4.15 TCP Data Communication for Frame 1920
Select packet #1920 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. The server receives a SYN packet, but it cannot answer any more because it is overwhelmed. This connection will be ended after server time?out, as described earlier. Note that the SYN flag is on (set to 1). The server is sending data to the client as shown in the packet. The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 27611 is a UDP port while the destination port 7680 is a TCP port. The UDP ports are those from 27500 – 27900.

Table 4.9 Frame 1920 Key values for the TCP Three-Way handshake
SYN flag 1
ACK flag 0
FIN flag 0
Sequence Number 0
Source IP Address 10.73.39.131 (private IP Address)
Destination IP Address 10.73.33.48
Source TCP Port Number 27611
Destination TCP Port Number 7680
Window Size 64240
Frame 2408

Figure 4.16 TCP Data Communication for Frame 2408
Select packet #2408 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. Here user can see the acknowledgement that the previously sent data packet was received. The Push flag exists to ensure that the data is given the priority (that it deserves) and is processed at the sending or receiving end. Note that the ACK and PSH flags is on (set to 1). The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 443 while the destination port 49304 is a private port. The private ports are those from 49152 – 65535. This range cannot be registered with IANA. When a Web browser connects to a web server the browser will allocate itself a port in this range.

There is the unallocated address in IP Address 172.217.194.94. After has been analyzed, this IP Address is from Unites States according to website malwares.com as shown in Figure 4.17. Besides, this website also stated the information of malicious sample communication history.

Table 4.10 Frame 2408 Key values for the TCP Three-Way handshake
SYN flag 0
ACK flag 1
FIN flag 0
Sequence Number 1
Source IP Address 172.217.194.94
Destination IP Address 10.73.33.234
Source TCP Port Number 443 (Well-known port number for HTTPS, where the Web Server is listening for incoming requests)
Destination TCP Port Number 49304
Window Size 246

Figure 4.17 Location of IP Address
Frame 2409
Figure 4.18 TCP Data Communication for Frame 2409
Select packet #2409 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. Here user can see the acknowledgement that the previously sent data packet was received. FIN is sent by a host when it wants to terminate the connection, the TCP protocol requires both endpoints to send the termination request. Note that the ACK and FIN flags is on (set to 1). The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 443 while the destination port 49304 is a private port. The private ports are those from 49152 – 65535. This range cannot be registered with IANA. When a Web browser connects to a web server the browser will allocate itself a port in this range.

Table 4.11 Key values for the TCP Three-Way handshake
SYN flag 0
ACK flag 1
FIN flag 1
Sequence Number 64
Source IP Address 172.217.194.94
Destination IP Address 10.73.33.234
Source TCP Port Number 443 (Well-known port number for HTTPS, where the Web Server is listening for incoming requests)
Destination TCP Port Number 49304
Window Size 246
Frame 2412

Figure 4.19 TCP Data Communication for Frame 2412
Select packet #2412 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. Here user can see the acknowledgement that the previously sent data packet was received. FIN is sent by a host when it wants to terminate the connection, the TCP protocol requires both endpoints to send the termination request. Note that the ACK and FIN flags is on (set to 1). The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 443 while the destination port 49304 is a private port. The private ports are those from 49152 – 65535. This range cannot be registered with IANA. When a Web browser connects to a web server the browser will allocate itself a port in this range.

Table 4.12 Frame 2412 Key values for the TCP Three-Way handshake
SYN flag 0
ACK flag 1
FIN flag 1
Sequence Number 64
Source IP Address 172.217.194.94
Destination IP Address 10.73.33.234
Source TCP Port Number 443 (Well-known port number for HTTPS, where the Web Server is listening for incoming requests)
Destination TCP Port Number 49304
Window Size 246
Frame 2419

Figure 4.20 TCP Data Communication for Frame 2419
Select packet #2419 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. Here user can see the acknowledgement that the previously sent data packet was received. The Push flag exists to ensure that the data is given the priority and is processed at the sending or receiving end. The final flag available is the FIN flag that always appears when the last packets are exchanged between a connection sent by a host when it wants to terminate the connection. Note that the FIN, PUSH and ACK flags is on (set to 1). The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 443 while the destination port 49304 is a private port. The private ports are those from 49152 – 65535. This range cannot be registered with IANA. When a Web browser connects to a web server the browser will allocate itself a port in this range.

Table 4.13 Frame 2419 Key values for the TCP Three-Way handshake
SYN flag 0
ACK flag 1
FIN flag 1
Sequence Number 1
Source IP Address 172.217.194.94
Destination IP Address 10.73.33.234
Source TCP Port Number 443 (Well-known port number for HTTPS, where the Web Server is listening for incoming requests)
Destination TCP Port Number 49304 (Private Port Number, between 49152–65535)
Window Size 246
Frame 9802

Figure 4.21 TCP Data Communication for Frame 9802
Select packet #9802 in Wireshark and expand the TCP layer analysis in the middle pane and further expand the “Flags” field within the TCP header. The server receives a SYN packet, but it cannot answer any more because it is overwhelmed. This connection will be ended after server time?out, as described earlier. Note that the SYN flag is on (set to 1). The server is sending data to the client as shown in the packet. The server is sending data to the client as shown in the packet.

A lot of information is displayed below so we can see the frame information, Ethernet information, IP information and TCP information. See the source port 27602 is a UDP port while the destination port 7680 is a TCP port. The UDP ports are those from 27500 – 27900.
Table 4.14 Frame9802 Key values for the TCP Three-Way handshake
SYN flag 1
ACK flag 0
FIN flag 0
Sequence Number 0
Source IP Address 10.73.39.131 (private IP Address)
Destination IP Address 10.73.33.137
Source TCP Port Number 27602
Destination TCP Port Number 7680
Window Size 64240
The Botnet DDOS attack has been analyze using Wireshark. Initially for testing method only one IP addresses has been taken to form the DDOS attack from 5 distinct IP addresses. The data log has been generated using wireshark network simulator. As a conclusion, these data indicate that IP address 172.217.194.94 frequently communicate with 10.73.33.48, 10.73.33.234 and 10.73.33.137 which is infected machine. Same applies to other IP addresses as well.
Table 4.15: Summary Sample Data of Botnet II
Packet # Source Destination Protocol
#1920 10.73.39.131 10.73.33.48 TCP
#2408 172.217.194.94 10.73.33.234 TCP
#2409 172.217.194.94 10.73.33.234 TCP
#2412 172.217.194.94 10.73.33.234 TCP
#2419 172.217.194.94 10.73.33.234 TCP
#9802 10.73.39.131 10.73.33.137 TCP
Protocol Hierarchy Statistic
User can see just what protocols are being used on the network from the Protocol Hierarchy tool, located under the Statistics menu. Each row contains the statistical values of one protocol. The Display filter will show the current display filter.
File Sample Botnet I
Packets will usually contain multiple protocols, so more than one protocol will be counted for each packet. For example, the IP has 100% and TCP 100% (which is together much more than 100%). The difference is due to TCP packets that have no data, known as “pure TCP” or sometimes “naked TCP”.

Figure 4.22 Protocol Hierarchy Statistic Sample Botnet I
These would include the ACK packets with no data and FIN packets. If a packet has no data, then Wireshark does not consider it to be HTTP even if it uses port 80 and even if it is part of an HTTP session. It is TCP only. This is how Wireshark treats all higher-level protocols that run on TCP. To see these packets, apply a display filter of “tcp.len=0”.

Figure 4.23 Higher-Level Protocol in Sample Botnet I
File Sample Botnet II
Packets will usually contain multiple protocols, so more than one protocol will be counted for each packet. For example, the IP has 100% and TCP 100% (which is together much more than 100%). The difference is due to TCP packets that have no data, known as “pure TCP” or sometimes “naked TCP”.

Figure 4.24 Protocol Hierarchy Statistic Sample Botnet II
These would include the SYN packets, ACK packets with no data and FIN packets. If a packet has no data, then Wireshark does not consider it to be HTTP even if it uses port 80 and even if it is part of an HTTP session. It is TCP only. This is how Wireshark treats all higher-level protocols that run on TCP. To see these packets, apply a display filter of “tcp.len==0”.

Figure 4.25 Higher-Level Protocol in Sample Botnet II

Flow Graph
This tool enables user to track the behavior of TCP connections because it intuitively illustrates how sequence and acknowledgement numbers are used throughout the duration of a TCP session. It will show user a detailed flow graph of every message used in that particular TCP stream. It also shows the comments that help in understanding the flow of messages.
Navigate to Statistics Flow Graph select TCP flow. Wireshark automatically builds a graphical summary of the TCP flow. Each row represents a single TCP packet. The left column indicates the direction of the packet, TCP ports, segment length and the flag(s) set. The column at right lists the relative sequence and acknowledgement numbers in decimal. Selecting a row in this column also highlights the corresponding packet in the main window.

File Sample Botnet I

Figure 4.26 Flow Graph Sample I
Figure 4.26 shows the connection initiation process between the server and the client. Once the connection is established, the data frames start to flow. The essential details of a frame are shown in the flow graph. For instance, the time of transmission, the size of the frame, the sequence number of the frame and the TCP ports used for the connection. General flow will show all captured or displayed packets. In this flow graph, the flags that are involved are FIN, ACK, PSH, ACK and FIN, PSH, ACK.

In a short period of time that there are a number of connection attempts by the IP address 192.229.232.200 (source) port 80 to port 49268 of machine 10.73.32.172 (destination). The FIN, ACK flags are requiring both endpoints to send the termination request and user can see the acknowledgement that the previously sent data packet was received. The server has tried to resolve the MAC of the client machine several times, but no response is received and not having the physical address of the host. The PSH flag tells the receiver’s network stack to “push” the data straight to the receiving socket, and not to wait for any more packets before doing so. Once the connection is established, ACK flag being set has very little significance. Next, there are also a number of connection attempts by the IP address 172.217.194.99 (source) port 443 to port 49270 of machine 10.73.32.172 (destination). The FIN, PSH, ACK flags may also be seen at the beginning of a graceful teardown.
File Sample Botnet II

Figure 4.27 Flow Graph Sample II
Figure 4.27 shows the connection initiation process between the server and the client. Once the connection is established, the data frames start to flow. The essential details of a frame are shown in the flow graph. For instance, the time of transmission, the size of the frame, the sequence number of the frame and the TCP ports used for the connection. General flow will show all captured or displayed packets. In this flow graph, the flags that are involved are SYN, PSH, ACK, FIN, ACK and FIN, PSH, ACK.

In a short period of time that there are a number of connection attempts by the IP address 10.73.39.131 (source) port 27611 to port 7680 of machine 10.73.33.48 (destination). A large number of TCP segments with the SYN flag activated from the client but do not receive a response from the server. The server has tried to resolve the MAC of the client machine several times, but no response is received and not having the physical address of the host. It cannot send an ACK, SYN flags to the same to continue with the three-step connection. Next, there are also a number of connection attempts by the IP address 172.217.194.94 (source) port 443 to port 49304 of machine 10.73.33.234 (destination). The PSH flag tells the receiver’s network stack to “push” the data straight to the receiving socket and not to wait for any more packets before doing so. Once the connection is established, ACK flag being set has very little significance. The FIN, ACK flags are requiring both endpoints to send the termination request and user can see the acknowledgement that the previously sent data packet was received. Besides, the FIN, PSH, ACK flags may also be seen at the beginning of a graceful teardown.

I/O Graph
Wireshark IO Graphs will show the overall traffic seen in a capture file which is usually measured in rate per second in bytes or packets. This can be useful to graph the occurrence of events or packet exchanges over time, or to graph the relationship between multiple types of packets over time. This automates many analysis scenarios, eliminating manual compilation of such data. It can see the highs and lows in the traffic, which can be used to rectify problems or can even be used for monitoring purpose. In default the x-axis is the tick interval per second, and y-axis is the packets per tick (per second). The scale for the x and y axis can be altered if needed, where x axis will have a range between 10 and 0.001 seconds and y axis values will range between packets/bytes/bits. To create the IO graph, select any TCP packet in your capture file and then click on IO Graph under Statistics. Refer to the following screenshot:
File Sample Botnet I
Based on figure 4.28, it shows the I/O Graph of packet captured while running the botnet sample I. The interval time is 1 second. The highest packet captured based on the time is at 170 seconds and 290 seconds which is above 100 compared to others. Moreover, the duration time is at 150 seconds show the highest of TCP errors occurred in the traffic.

Figure 4.28 I/O Graph for Sample Botnet I

File Sample Botnet II
Based on figure 4.29, it shows the I/O Graph of packet captured while running the sample botnet II. The interval time is 1 second. The highest packet captured based on the time is at 135 seconds which is above 100 compared to others. TCP errors occurred only in between duration 45 to 90 seconds only.

Figure 4.29 I/O Graph for Sample Botnet II
4.7Overall of TCP Flags Segment
Based on sample botnet I, the result in general depend on specific packet types such as FIN, ACK, PSH, ACK and FIN, PSH, ACK. According to Guillaume (2017), the standard way to close TCP sessions is to send a FIN packet, then wait for a FIN response from the other party. FIN flag serves as a connection termination request to the other device, while also possibly carrying data like a regular segment. The connection as a whole is not considered terminated until both sides have finished the shutdown procedure by sending a FIN and receiving an ACK.
Table 4.16 Summary of TCP Flags Segment for both Samples
TCP Flags Segment Sample Botnet I Sample Botnet II
SYN PSH, ACK FIN, ACK FIN, PSH, ACK As shown in figure 4.30, the highest TCP flags segment FIN, ACK is 59% that communicated with the server while TCP flags FIN, PSH, ACK only 35%. About 6% of the TCP PSH, ACK flags but this sample not found the TCP SYN flags.

Figure 4.30 Overall TCP Flags for Sample Botnet I
Based on sample botnet II, the result in general depend on specific packet types such as SYN, PSH, ACK, FIN, ACK and FIN, PSH, ACK. According to Saravanan (2012), SYN and FIN is probably the best-known illegal combination. The SYN is used to start a connection, while FIN is used to end an existing connection. Based on previous researcher Saravanan (2012), stated that any SYN and FIN packets are malicious. When the malicious user sends a signal to them, they begin to attack to the same server. As shown in figure 4.31, the highest TCP flags segment FIN, PSH, ACK is 55% that communicated with the server while TCP flags SYN and FIN, ACK only 18%. About 9% of the TCP PSH, ACK flags.

Figure 4.31 Overall TCP Flags for Sample Botnet II
4.8Summary
This chapter basically about the designs for before, during and after analyze the project. This chapter outlines the analysis and design part of this project to give a big picture how would the project portray and works within an environment. Lastly, from the thorough problem analysis facilitates the data, functional, hardware and software requirements gathering. A well-defined requirement can ensure the development is on the right track to achieve its objective. The result and discussion phase will be discussed in the next chapter.CHAPTER V
RESULT AND DISCUSSION
5.1Introduction
This chapter will elaborate more about Weka that known as Waikato Environment for Knowledge Analysis. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can be applied directly to a dataset. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. The analysis and design made in the previous chapter are applied in this chapter. The dataset environment setup and discussion for each activity is described in this chapter. It will only contain information on how the design became and how this was complete. The basic design of this experiment was an isolated environment run on the real device but not connected to the internet. The first phase of the experiment was used to gather the needed data by running the mentioned attack from attacker host and captured with network capture software that known as Wireshark.
5.2Scenario Design
Before the experiment started, the host must be set up to establish the botnet. This project consists of only one item, such as the personal computers (PCs) that installed with the Windows operating system. This project uses the website VirusShare.com that provide samples of live malicious code. The use of sample bots, scripts or other methods to scrape data from the site, download samples at an excessive rate. When the computer communicates, either on the network or across the internet, they send bits of information called ‘packets’ to one another.
Wireshark is a very powerful tool with varied applications to start capturing packets. Wireshark also can filter that traffic based on the IP address of that device using Wireshark’s built-in filters. User can right-click on any of those packets to inspect it, follow the conversation between both ends and filter the whole capture by IP address. Wireshark also tells the ports being used, so Google the port number and see what applications use it. The file type in Wireshark is in .pcap format.
For the detection method, Weka is a popular suite of machine learning software written in Java. According to Anand et al. 2015, Weka operates on the predication that the user data is available as a flat file or relation, this means that each data object is described by a fixed number of attributes that usually are of a specific type, normal alpha-numeric or numeric values. Weka contains tools for data pre-processing, classification, regression, clustering, association rules and visualization. Weka prefers to load data in the ARFF format. It is an extension of the CSV file format where a header is used that provides metadata about the data types in the columns.

The training and testing data are already selected and kept in separate files. After loading both the training and testing file, the classifier and its parameters are chosen and the classification is carried out. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a test file.

Figure 5.1 Project Flow of Statistical Test
5.3Methodology
For this project, an experimental methodology has been used. After going through the deep analysis on botnet’s sample, few attributes that can have a great impact on it have been identified. These attributes are here referred as information variables. In this project, all the dataset has been gathered which is then filtered with the help of some manual techniques. The filtered data is then converted into the format used by Weka. After five-minute monitoring, RAW packet data were passed to sniffer server for calculation and statistics and then output the result into a CSV text file. Weka then analyzes these identified attributes along with the corresponding implementation. After the analysis of the attributes, the machine learning methodology works in the sequence as shown in figure 5.2.

Figure 5.2 Machine Learning Methodology
5.4Dataset Description
These datasets are taken from the trusted website VirusShare.com. In all 10,355 packets, 9,890 belong to malicious, while the remaining 465 packets are non-malicious. The samples taken from malicious and non-malicious that include 6 attributes. In addition, both kinds of traffic should be labeled so that it can effectively evaluate the detection model.

Table 5.1. shows the number of entries in a data set, number of attributes represents the number of values contained in one instance and also the number of class represents the number of categories needed to be classified in a data set. In this dataset has attributes like time, source, destination, protocol, length and class. The two output classes are categorized as attack and normal.

Table 5.1 List of the Dataset
Dataset Size (Bytes) Number of Attributes Number of Class
Malicious 569,739 bytes 6 2
Non-Malicious 24,996 bytes 6 2
Total 594,735 bytes 6 2
5.5Weka Explorer Preprocess Interface
The first view of Weka Explorer is the Preprocess tab. This interface is designed to manage the data sources.
Load a dataset using the Open File button to generate a random dataset.

This frame summarizes the dataset characteristics such as the relation that actually the name of the dataset, the number of attributes and the number of instances. Since the instances are not weighted, the sum of weight is always equal to the number of instances.

Each line of this frame corresponds to an attributes descriptor. Selecting a line in this frame updates the information displayed in frames 4 and 5.
The frame Selected attribute summarizes some statistics about the attributes descriptor selected in the area 3. The summary of minimal and maximal value observed, mean value and also standard deviation.

This frame provides a histogram of the observed values of the selected attribute.
The Status frame is used to display information about the current run.

Figure 5.3 Weka Preprocess Interface
This site provides histogram of the all attributes such as time, source, destination, protocol, length and class. It can be colored using information of the observed values. The blue color represents for the attack which is malicious while the red color represents for the normal which is non-malicious.

Figure 5.4 Histogram based on All Attributes
5.6Weka Classify Interface
Classifier
The Classifier frame is the place used to select a particular data mining method. The Choose button will open tree architecture of folders in which methods are classified. A click on the text box will open the configuration interface of the selected algorithms. The classification algorithms that used in this project are RandomForest, J48, JRip, NaiveBayes and BayesNet. The Start button will launches the calculations.

Figure 5.5 Weka Classify Interface
The Test options frame is dedicated to different types of model validation. It is possible to assess the performances of a model on training set, in cross-validation or split the initial dataset into training and test using a given percentage of data for training.

Figure 5.6 Weka Test Options Interface
Test Options
It is possible to assess the performances of a model on training set or in cross-validation.

Cross-validation
Cross-validation is a simple method which guarantees that there is no overlap between the training and test sets. It also guarantees that there is no overlap between the k test sets, where k is the number of splits to make in the dataset. Moreover, cross validation used for training and testing an equal number of times while reducing the variance of an accuracy score is to use cross validation. For example, choose a value of k=10. This will split the dataset into 10 parts (10 folds) and the algorithm will be run 10 times. Each time the algorithm is run, it will be trained on 90% of the data and tested on 10%, and each run of the algorithm will change which 10% of the data the algorithm is tested on.

Correctly and Incorrectly Instance Classification
From Table 5.2, it can be concluded that accuracy of classification algorithms like RandomForest, J48, JRip, NaiveBayes and BayesNet for the correctly classified instances is more as compared to the incorrectly classified instances. Actually, the total number of correctly and incorrectly instance classification is 100%.
The important number to focus on here are the numbers next to the Correctly Classified Instances 99.073% and the Incorrectly Classified Instances 0.723% for JRip algorithms. The classification results show that the JRip gives better results because of higher percent while NaiveBayes is lower number of instances which is 97.847 %. The results from RandomForest 98.909%, J48 98.580% and BayesNet 98.262% around 98% correctly classified instances. As a conclusion, the accuracy of this dataset is good, classifying around 99% of the data records correctly.
Table 5.2 Correctly and Incorrectly Instance Classification of Cross-Validation
Algorithms Correctly Classified Instance Incorrectly Classified Instance
RandomForest98.909% 1.091%
J48 98.580% 1.420%
JRip99.073% 0.723%
NaiveBayes97.847% 2.153%
BayesNet98.262% 1.738%
The dataset during this project is tested and analyze with five classification algorithms those are RandomForest, J48, JRip, NaïveBayes and BayesNet by using cross-validation test. To avoid over fitting problem, this test obtained the accuracy using 10-fold cross validation which uses 9/10 of data as for training the algorithm and the remaining for testing purpose and repeats the process 10 times. All the statistics results are provided in figure 5.7. Moreover, a comparison of accuracy of all classifiers is done and finally it has been investigated that JRip technique performs best with accuracy 99.073%.

Figure 5.7 Comparison of Correctly and Incorrectly Instances Based on Cross-Validation
Kappa Statistics
From table 5.3 show the performance of Kappa Statistic on sample botnet dataset. It can see clearly JRip have highest Kappa statistic size which is 0.898 and follow by RandomForest which is 0.871. Next, the BayesNet algorithm have 0.816 while NaiveBayes have 0.778. J48 algorithm have the lowest Kappa statistic size which is only 0.024. According to Anthony et al. 2005, 1 is perfect agreement, 0 is exactly what would be expected by chance, and negative values indicate agreement less than chance. Overall, it can be considered as a better performance because the size is more than 0 by using cross-validation’s test option.

Table 5.3 Kappa Statistics Comparison among Algorithms of Cross-Validation
Algorithms Kappa Statistics
RandomForest0.871
J48 0.024
JRip0.898
NaiveBayes0.778
BayesNet0.816

Using this scale, a kappa statistics size of JRip 0.898, RandomForest 0.871 and BayesNet 0.816 are in the very good agreement range. The good agreement only for NaiveBayes which is 0.778 while the J48 algorithms is a poor agreement range because the kappa statistic size 0.024 is less than 0.20. According to Anthony et al. 2005, 1 is perfect agreement, 0 is exactly what would be expected by chance, and negative values indicate agreement less than chance. Overall, it can be considered as a better performance because the size is more than 0 except for J48 algorithm.

Figure 5.8 Kappa Statistics Size Comparison Based on Cross-Validation
Confusion Matrix
From Confusion matrix, TPR, FPR, Precision, Recall and Accuracy of different algorithms has been obtained. The number of correctly classified instances is the sum of diagonals in the matrix, others are incorrectly classified. Table 5.4 illustrates the RandomForest, J48, JRip, NaiveBayes and BayesNet for cross-validation option representing confusion matrix for classification.

Table 5.4 Values Confusion Matrix Comparison among Algorithms of Cross-Validation
Algorithms A (attack) B (normal)
RandomForest9841 49
64 401
J48 9823 67
80 385
JRip9811 79
17 448
NaiveBayes9715 175
48 417
BayesNet9750 140
40 425
The True Positive (TP) rate is the proportion of examples which were classified as class attack or normal. It is equivalent to Recall. It can be concluded that the RandomForest algorithm achieved highest TP Rate of 0.995 for attack’s class. In other words, NaïveBayes algorithm reached the lowest value 0.982 of attacks classification process. For normal class, the highest TP Rate is from JRip algorithm which is 0.963 while the lowest TP Rate from J48 which is 0.828. In the confusion matrix, this is the diagonal element divided by the sum over the relevant row 9,841/(9,841+49) = 0.995 for class attack and 448/(17+448) = 0.963 for class normal according to the dataset.
The False Positive (FP) rate is the proportion of examples which were classified as class attack or normal, but belong to a different class. It can be concluded that the J48 algorithm achieved highest FP Rate of 80/(80+385) = 0.172 for attack’s class. In other words, JRip algorithm reached the lowest value 17/(17+448) = 0.037 of attacks classification process. For normal class, the highest FP Rate is from NaiveBayes algorithm which is 175/(175+9,715) = 0.018 while the lowest FP Rate from RandomForest which is 49/(49+9,841) = 0.005.
From another perspective, the JRip algorithm reached the highest precision value of 0.998 while the lowest precision value of J48 is 0.992 and that indicates the JRip have stable FP Rate value. In the matrix, this is the diagonal element divided by the sum over the relevant column 9,811/(9,811+17) = 0.998 for JRip algorithm and 9,823/(9,823+80) = 0.992 for J48 algorithm. For normal class value, there is a large value of precisions from RandomForest which is 0.891 while the smaller value from NaiveBayes which is only 0.704. In the matrix, this is the diagonal element divided by the sum over the relevant column 401/(401+49) = 0.891 for RandomForest and for NaiveBayes is 417/(417+175) = 0.704.

The F-Measure is simply 2*Precision*Recall/(Precision+Recall), a combined measure for precision and recall. The F-Measure value obtained is JRip 2*0.998*0.992/(0.998+0.992) = 0.995 for the Segment Dataset which shows the highly accurate model of the attack class while RandomForest of normal class is 2*0.891*0.862/(0.891+0.862) = 0.876. These measures are useful for comparing classifiers.
Table 5.5 Values of TP, FP, Precision, Recall and F-Measure of Cross-Validation
Algorithms TPR FPR Precision Recall F-Measure Class
RandomForest0.992 0.138 0.994 0.992 0.992 Attack
0.862 0.008 0.891 0.862 0.876 Normal
J48 0.993 0.172 0.992 0.993 0.993 Attack
0.828 0.007 0.852 0.828 0.840 Normal
JRip0.995 0.037 0.998 0.995 0.992 Attack
0.963 0.005 0.891 0.963 0.926 Normal
NaiveBayes0.982 0.103 0.995 0.982 0.989 Attack
0.897 0.018 0.704 0.897 0.789 Normal
BayesNet0.986 0.086 0.996 0.986 0.991 Attack
0.914 0.014 0.752 0.914 0.825 Normal
Training Set
A set of examples used for learning, that is to fit the parameters i.e., weights of the classifier. This data set is used to adjust the weights on the neural network. Means you will test your knowledge on the same data you learned. Not very accepted because you can just make build your code to memorize the training instances (which will be in the test). I haven’t seen use for that in publications.

Correctly and Incorrectly Instance Classification
In both datasets, the Correctly Classified Instances is 100% and the Incorrectly Classified Instances is 0%. As a conclusion, the accuracy of this dataset is good, classifying 100% of the data records correctly. Comparing the Correctly Classified Instances from the cross-validation around 98% with the Correctly Classified Instances from the training set is 100%, it can see that the accuracy of the model is much better than the previous model, which indicates that the model will not break down with unknown data, or when future data is applied to it.

For JRip algorithm, 99.199% is correctly classified instance and 0.802% is incorrectly classified instance. Next, J48, NaiveBayes and BayesNet have around 98% of the data correctly.
Table 5.6 Correctly and Incorrectly Instance Classification of Training Set
Algorithms Correctly Classified Instance Incorrectly Classified Instance
RandomForest100% 0%
J48 98.735% 1.265%
JRip99.199% 0.802%
NaiveBayes97.875% 2.125%
BayesNet98.416% 1.584%
The dataset during this project is tested and analyze with five classification algorithms those are RandomForest, J48, JRip, NaïveBayes and BayesNet by using training set test. All the statistics results are provided in figure 5.9. A comparison of accuracy of all classifiers is done and finally it has been investigated that RandomForest technique performs the very good with accuracy 100 %. The classifier is perfect, at least on the training data, all instances were classified correctly and all errors are zero. As is usually the case, the training set accuracy is too optimistic.

Figure 5.9 Comparison of Correctly and Incorrectly Instances Based on Training Set
Kappa Statistics
From Table 5.7 show the performance of Kappa Statistic on sample botnet dataset. It can see clearly that RandomForest have accurate kappa statistic size 1.00. Besides, both J48 and JRip has 0.849. Next, the BayesNet algorithm have 0.831 while NaiveBayes have 0.781. Overall, it can be considered as a better performance because the kappa statistic size for all algorithms are more than 0 by using training test option.

Table 5.7 Kappa Statistics Comparison among Algorithms of Training Set
Algorithms Kappa Statistics
RandomForest1.00
J48 0.849
JRip0.849
NaiveBayes0.781
BayesNet0.831
According to Igor and Alexandre 2012, Kappa statistic” is an analog of correlation coefficient. It can see clearly that RandomForest have accurate kappa statistic size 1.00. Remember that perfect agreement would equate to a kappa of 1.00 and chance agreement would equate to 0. Using this scale, a kappa statistics size of J48 0.849, JRip 0.849 and BayesNet 0.831 are in the very good agreement range. The good agreement only for NaiveBayes which is 0.781. Overall, it can be considered as a better performance according to the range 0.60 to 0.80 and 0.80 to 1.00.

Figure 5.10 Kappa Statistics Size Comparison Based on Training Set
Confusion Matrix
Table 5.8 illustrates the RandomForest, J48, JRip, NaiveBayes and BayesNet for training set option representing confusion matrix for classification. There are 388 true positive, 9,836 true negative, 54 false positive and 77 false negative for J48 algorithm.

Table 5.8 Values Confusion Matrix Comparison among Algorithms of Training Set
Algorithms A (attack) B (normal)
RandomForest9890 0
0 465
J48 9836 54
77 388
JRip9816 74
9 456
NaiveBayes9717 173
47 418
BayesNet9764 126
38 427
Accuracy Measurement
The True Positive (TP) rate is the proportion of examples which were classified as class attack or normal. It is equivalent to Recall. It can be concluded that the RandomForest algorithm achieved highest TP rate of 1.000 for attack’s class. In other words, NaïveBayes algorithm reached the lowest value 0.983 of attacks classification process. For normal class, the highest TP Rate is from RandomForest algorithm which is 1.000 while the lowest TP Rate from J48 which is 0.834. In the confusion matrix, this is the diagonal element divided by the sum over the relevant row 9,890/(9,890+0) = 1.000 for class attack and 9,890/(9,890+0) = 1.000 for class normal according to the dataset.
The False Positive (FP) rate is the proportion of examples which were classified as class attack or normal, but belong to a different class. It can be concluded that the J48 algorithm achieved highest FP Rate of 77/(77+388) = 0.166 for attack’s class. In other words, RandomForest algorithm reached the lowest value 0/(0+9,980) = 0.00 of attacks classification process. For normal class, the highest FP Rate is from NaiveBayes algorithm which is 173/(173+9,717) = 0.017 while the lowest FP Rate from RandomForest which is 0/(0+9,890) = 0.000.
From another perspective, the RandomForest algorithm reached the highest precision value of 1.000 while the lowest precision value of J48 is 0.992 and that indicates the RandomForest have stable FP Rate value. In the matrix, this is the diagonal element divided by the sum over the relevant column 9,890/(9,890+0) = 1.000 for RandomForest algorithm and 9,836/(9,836+77) = 0.992 for J48 algorithm. For normal class value, there is a large value of precisions from RandomForest which is 1.000 while the smaller value from NaiveBayes which is only 0.707. In the matrix, this is the diagonal element divided by the sum over the relevant column 465/(465+0) = 1.000 for RandomForest and for NaiveBayes is 418/(418+173) = 0.707.

The F-Measure is simply 2*Precision*Recall/(Precision+Recall), a combined measure for precision and recall. The F-Measure value obtained is RandomForest 2*1.000 *1.000/(1.000+1.000) = 1.000 for the Segment Dataset which shows the highly accurate model of the attack class while NaiveBayes of normal class is 2*0.707*0.899 /(0.707+0.899) = 0.876. These measures are useful for comparing classifiers.
Table 5.9 Values of TP, FP, Precision, Recall and F-Measure of Training Set
Algorithms TPR FPR Precision Recall F-Measure Class
RandomForest1.000 0.000 1.000 1.000 1.000 Attack
1.000 0.000 1.000 1.000 1.000 Normal
J48 0.995 0.166 0.992 0.995 0.993 Attack
0.834 0.005 0.987 0.834 0.856 Normal
JRip0.993 0.019 0.999 0.993 0.996 Attack
0.981 0.007 0.860 0.981 0.917 Normal
NaiveBayes0.983 0.101 0.995 0.983 0.989 Attack
0.899 0.017 0.707 0.899 0.792 Normal
BayesNet0.987 0.082 0.996 0.987 0.992 Attack
0.918 0.013 0.772 0.918 0.839 Normal
Results
Performance Measure
Time is defined as how long it takes to prepare a final model in the training phase. The emphasis is on the 10-fold method time as it is more logical compared to the training set method. Because the 10-fold validation is done 10 times on data and the average is the last result, the training set method applies an algorithm on data only once and it is considered a very positive aspect.
JRip and RandomForest classifier produced the best time result with 0.07 and 0.03 seconds respectively for both test options. In cross-validation, Naïve Bayes and the J48 are second and third with 0.08 and 0.41 seconds respectively. In the fourth and fifth place are BayesNet and RandomForest with 0.55 and 2.97 seconds accordingly. In training set, J48 and the BayesNet are second and third with 0.05 and 0.06 seconds respectively. In the fourth and fifth place are NaiveBayes and JRip with 0.11 and 0.62 seconds accordingly.

The timing results for both test options are show in Figure 5.11. As a conclusion, the JRip and RandomForest are exhibits optimal performance in terms of detection rate. With respect to time, the JRip and RandomForest is the first among other classifiers.

Figure 5.11 Comparison of processing time of classifiers in seconds
Effectiveness Measure
Based on table 5.10 showed that the RandomForest machine learning of training set has the capabilities to predict 100% correctly for the overall accuracy 0% detection rate with 0% of false alarm rate of the IoT botnet attack. The accuracy of J48 is 98.74% and JRip is 99.25%. Next, classifier NaiveBayes and BayesNet produces accuracy 97.88% and 98.42% respectively. Meanwhile, the highest accuracy value is 99.07% for JRip classifier of cross-validation method. Next, RandomForest has 98.91% correctly dataset and then followed by J48 98,58%. The BayesNet and NaïveBayes have 98.26% and 97.85% respectively.

The classifier JRip have the lowest detection rate with 10.90% and 14.80% for J48 classifier. The detection rate for RandomForest is 15% compared with BayesNet and NaiveBayes which 24.80% and 29.60%. The result of classifier RandomForest in training set produces 0% detection rate because there is no attack found in the system. Meanwhile, the detection rate of classifier J48 was increased to 12.20% and for JRip is 14%. In the meantime, classifier NaiveBayes and BayesNet produces detection rate with 29.30% and 22.80% respectively. Based on Table 5.10, it concludes that the JRip and RandomForest classifier have the lowest value of detection rate with compared to another classifier.

The average of false alarm rate according to the cross-validation is 0.50% for JRip algorithm, 0.68% for J48 and 0.80% for RandomForest. The BayesNet algorithm have 1.42% and the highest value of FAR is NaiveBayes 1.77%. Moreover, the average of false alarm rate according to the training set is 0% for RandomForest algorithm, 0.55% for J48 and 0.75% for JRip. The BayesNet algorithm have 1.27% and the highest value of FAR is NaiveBayes 1.75%. As the results show, JRip and RandomForest performs better in false alarm average because the value is lowest among other classifiers for the all test options.

Table 5.10 Performance Measure based on Accuracy, Detection Rate and False Alarm Rate
Cross-Validation Training Set
Performance Measures in %
Classifier Algorithms Accuracy (A) Detection Rate (DR) False Alarm Rate (FAR) Accuracy (A) Detection Rate (DR) False Alarm Rate (FAR)
RandomForest98.91% 15% 0.50% 100% 0% 0%
J48 98.58% 14.80% 0.68% 98.74% 12.20% 0.55%
JRip99.07% 10.90% 0.80% 99.25% 14% 0.75%
NaiveBayes97.85% 29.60% 1.77% 97.88% 29.30% 1.75%
BayesNet98.26% 24.80% 1.42% 98.42% 22.80% 1.27%

The JRip classifiers of cross-validation produced accuracy value of 99.07%. However, the best result in this project was achieved with the RandomForest classifier of training set, with as much as 100%. It was discussed that the 10-fold validation usually produces enhanced results. In this study, the best result of the 10-fold validation is 99.07% while the best result of the training set method is 100%. Therefore, a comparison of accuracy concur that an improvement has been achieved in this research project through selecting the most suitable network traffic features as well as increasing the detection rate.

Figure 5.12 Comparison of Accuracy
In this project, 9,890 malicious and 465 non-malicious data samples were used. The results are expressed in terms of performance measurements. Detection rate, also known as a true positive rate (TPR), is the probability of correctly detecting an instance as malware. The higher the Detection Rate, the better the result is. The results of detection rate for different types of attacks are shown in Figure 5.13. The JRip and RandomForest classifiers produced Detection Rate of 10.90% and 0% respectively. In this project, the best result of the 10-fold validation is 10.90% while the best result of the training set validation is 0%. The lower the value of detection rate, the more the effectiveness of the classifier.

Figure 5.13 Comparison of Detection Rate
False alarm rate corresponds to the number of detected attacks but it is in fact normal. A more elaborate method is cross-validation. The test results are collected and averaged over all folds. This gives the cross-validation estimate of the accuracy. All the statistics results are provided in the figure below. A comparison of false alarm rate of all classifiers is done and finally it has been investigated that JRip technique performs best with lowest value false alarm rate 0.50%.
The result from Figure 5.14 showed that the RandomForest classifier has the capabilities to predict 0% of false alarm rate of the IoT botnet attack. The improvement of overall detections in the signature-based module from classification table in data mining module are indicated that this signature-based system technically effective for outcome attack detection. Therefore, it can be summarized that RandomForest is more effective classifier than JRip because it can predict 0% of dataset.

Figure 5.14 Comparison of False Alarm Rate
5.7Conclusion
As a conclusion, this chapter actually carries out a clear idea on how to develop the project, method needed and proper management techniques. The activity in the implementation phase transform the output of data analysis and design phase by using a software which is Weka. The suitability of the machine learning involved to ensure this project can be set up within the specified time without any problem. The environment is based on the network design and software design respectively. Last but not least, the conclusion of the project will later be discussed in the next chapter.

CHAPTER VI
PROJECT CONCLUSION
6.1Introduction
This chapter discusses and concludes the overall outcomes obtained in this project.

In fact, it is the last and final phase in studying the project and it is used to see whether the objectives have been met and fulfilled. The project can be summarized by stating its objectives and how it is achieved. Apart from that, the contribution of the project, how it is analyzed and the limitation of the project will also be further explained in this chapter. These are all the important traits that will be discussed so that improvements can be done for future studies and researches.
6.2Project SummarizationIoT devices have become an attractive substitute device because of the rapid development in compute intensive device technologies. Ultimately, these trends have opened the door for cybercriminals to expand their malevolent motivations towards recent evolving platform. In order to be able to detect the possible attack in IoT devices such as DDoS attack proper analysis is needed. The first objective is to study possible attacks which are used to infect IoT devices. The IoT botnet phenomenon is aiming to gain illegitimate access to IoT devices to carryout various malicious activities. Once the behavior has been identified, the objective of this project which is to analyze the behavior of IoT botnet attack on basic mode of operations and communications.

This project approach Weka, a collection of machine learning algorithms for data mining tasks. The framework is decomposed into two components such as dynamic analysis component and learning component. During dynamic analysis, applications are required to be executed in a secure Wireshark environment and the results are collected for further classification in Weka. Finally, in the learning component the sample of a known botnet dataset are trained with the help of five classifiers such as RandomForest, J48, JRip, NaiveBayes and BayesNet. Various machine learning classifiers are applied to determine the most suitable classification algorithm to draw a clear line between botnet and other types of malicious applications.

Analysis of botnet with its threat can be done with experiments. This machine learning should be able to detect an botnet attack. Once the signature of attack have been identified, the last objective which is to measure the best method of machine learning network based on botnet detection which is using Weka. Therefore, it can be summarized that, this signature-based detection has better prediction and capabilities to distinguish between the normal and attack events reached for thousands of dataset for each variant.

6.3Project Contribution
This project is to help user study the IoT botnet attack by using machine learning classifiers in short time. Besides, this project gives the contribution on identify specific IoT botnet attack and its behavior. In addition, the comparison between the classifiers has been made to improve the time used to train and general the machine learning, as proven in the experiment section. This project also gives contribution on how the machine learning can be used as simple as detection method in the real environment.

6.4Project Limitation
There were some constraints and limitations discussed in the project:
The Weka machine learning software cannot be applied to large datasets. There is a limit to the size of datasets that the Weka Explorer can deal with.
This approach lacks of proper and adequate documentations. Therefore, when captured the terms from these documents, thousands of terms are found. However, there are some terms that are usefulness and uninteresting to the results, it is then important to discover and interpret which features are useful and critical.

6.5Future Work
Every project has weakness and strength points, this project is not an exception. Because of the weakness and strength points, the study of certain field can continue to analyze more and more. That’s why future study for this study is very important. There are several things that need to be explored more in the future, especially in the approach
itself and experiment.
As this project represents a preliminary investigation, there are a number of potential avenues for further work, such as an in-depth evaluation of as to why different algorithms exhibit different classification accuracy and computational performance. In a wider context, investigating the robustness of machine learning classification and a comparison between machine learning and non-machine learning techniques on an identical dataset would also be valuable. In the future, must be able to explore different methods for sampling and constructing training datasets.

6.6Conclusion
Ultimately, this project succeeds in fulfilling all the objectives set by analyze the IoT botnet behavior and measure the best machine learning classifier for IoT botnet detection. Besides from the objective that has been fulfilled, there are several expected results which be proposed from the beginning of the project which is to evaluate the machine learning classifiers for effective detection of IoT botnet flows that have high predictive accuracy. Next is to study, understand, analyze and also summarize the behavior of IoT botnet attack using machine learning. Furthermore, detecting a botnet often needs advanced analyzing capabilities which are related to the selected data for analysis track and the characteristics of issues performed.

In Chapter II, it is about literature review to justify about methodology, techniques and parameter that being used in past research paper which is related to the project title. This chapter also will be focusing on the Literature Review that will cover about the methodology approached and related work about the IoT botnet that affect unsecured devices indirectly give awareness to users how important to have advanced security features. Overall, this literature review provides details of the whole project in order to make sure that the studies had been done based on the topic and subtopic as mentioned.

Next, Chapter III has concluded that the project methodology has been taken from the begin until the finish of the progress. This methodology is very easy to understand and applied. Start with analysis on previous research and information gathering, methodology, design, analysis results and lastly project evaluation. By using this methodology, the project is divided into small phases and must be completed before the next phase can begin. Besides, this chapter also discussed about project milestones to keep an eye on the progress of the project.

During Chapter IV, it discussed about problem analysis, requirement analysis, high-level design and detailed design for this project. Besides, this part also will be explained in detail about the problem statement as mentioned in Chapter I. Next, this chapter will describe about the requirement analysis. Requirement analysis is explained about the hardware equipment and software equipment. The example of method that is used in the project is using Wireshark. In the Wireshark, the traffic network and port are observed to analyze the malware behavior.

Chapter V will elaborate more about Weka, discussed about the scenario design applied that consist of project flow and machine learning methodology. The dataset environment setup and discussion for each activity is described in this chapter. It will only contain information on how the design became and how this was complete. In Weka, it is possible to assess the performances of a model on training set or in cross-validation.

x

Hi!
I'm Katy

Would you like to get a custom essay? How about receiving a customized one?

Check it out