# 算法图解

选择排序
- 复杂度
- 算法
什么是递归?
什么是调用栈?
什么是分而治之(Divide and conquer, D&C)?
常用算法
- 二分查找算法
快速排序
散列函数
广度优先算法
狄克斯特拉算法
近似算法
动态规划问题
o(1), o(n), o(logn), o(nlogn)的解释
K 最近邻算法（K-nearestneighbours,KNN)
二分查找树?
反向索引
布隆过滤器
HyperLogLog
SHA算法
Simhash 算法
Diffie-Hellman 和 RSA

# 选择排序

# 复杂度

选择排序是一种灵巧的算法，速度不是很快快速排序更快运行时间为 O(nlogn)

# 算法

# Finds the smallest value in an array
def findSmallest(arr):
    # Stores the smallest value
    smallest = arr[0]
    # Stores the index of the smallest value
    smallest_index = 0
    for i in range(1, len(arr)):
        if arr[i] < smallest:
            smallest = arr[i]
            smallest_index = i
    return smallest_index


# Sort array
def selectionSort(arr):
    newArr = []
    for i in range(len(arr)):
        # Finds the smallest element in the array and adds it to the new array
        smallest = findSmallest(arr)
        # 这里会从arr中弹出这个最小的，加入到新的数组中
        newArr.append(arr.pop(smallest))
    return newArr


print(selectionSort([5, 3, 6, 2, 10]))

# 什么是递归?

递归，就是自己调用自己

# 什么是调用栈?

计算机在内部使用被称为调用栈的栈假设你调用greet("maggie")，计算机将首先为该函数调用分配一块内存。函数返回的时候，栈顶内存被弹出

当你调用函数greet2时，函数greet只执行了一部分。这是本节的一个重要概念：调用另一个函数时，当前函数暂停并处于未完成状态你就从函数greet返回。这个栈用于 存储多个函数的变量，被称为调用栈。

阶乘递归函数：

def fact(x):
if x == 1:
return 1
else:
return x * fact(x-1)

每个fact调用都有自己的x变量。 在一个函数调用中不能访问另一个的x变量。

栈有两种操作：压入和弹出。所有函数调用都进入调用栈。调用栈可能很长，这将占用大量的内存。

递归求和、求count、获取最大值

# 什么是分而治之(Divide and conquer, D&C)?

分而治之是你学习的第一种通用的问题解决方法一种著名的递归式问题解决方法。

分而治之工作原理：

找出简单的基线条件，必须尽可能的简单
不断将问题分解（缩小规模）知道符合条件

递归求和

def sum(list):
  if list == []:
    return 0
  return list[0] + sum(list[1:])

递归方式实现的二分查找

l = [2, 3, 5, 10, 15, 16, 18, 22, 26, 30, 32, 35, 41, 42, 43, 55, 56, 66, 67, 69, 72, 76, 82, 83, 88]


def find(l, aim, start=0, end=None):  #
    end = len(l) if end is None else end  # 让下面传上来的元素个数不改变初始的元素个数
    mid_index = (end - start) // 2 + start  #
    if start <= end:
        if l[mid_index] < aim:
            return find(l, aim, start=mid_index + 1, end=end)
        elif l[mid_index] > aim:
            return find(l, aim, start=start, end=mid_index - 1)
        else:
            return mid_index
    else:
        return '找不到这个值'


ret = find(l, 3)
print(ret)

# 常用算法

# 二分查找算法

def binary_search(list, item):
    # low and high keep track of which part of the list you'll search in.
    low = 0
    high = len(list) - 1

    # While you haven't narrowed it down to one element ...
    while low <= high:
        # ... check the middle element
        mid = (low + high) // 2
        guess = list[mid]
        # Found the item.
        if guess == item:
            return mid
        # The guess was too high.
        if guess > item:
            high = mid - 1
        # The guess was too low.
        else:
            low = mid + 1

    # Item doesn't exist
    return None


my_list = [1, 3, 5, 7, 9]
print(binary_search(my_list, 3))  # => 1

# 'None' means nil in Python. We use to indicate that the item wasn't found.
print(binary_search(my_list, -1))  # => None

# 快速排序

复杂度 O(nlogn), 使用递归，数组分解，从数组选择第一个元素，成为基准值（pivot），第一个元素作为基准值，找出比基准值大的元素。被成为分区，现在有三个部分组成

小于基准值的数组
基准值
大与基准值的数组

最糟情况 O(n^2)

最佳情况O(nlogn)

最佳情况也是平均情况，每次随机选择一个数组元素作为基准值哟

def quicksort(array):
  if len(array) < 2:
    # base case, arrays with 0 or 1 element are already "sorted"
    return array
  else:
    # recursive case
    pivot = array[0]
    # sub-array of all the elements less than the pivot
    less = [i for i in array[1:] if i <= pivot]
    # sub-array of all the elements greater than the pivot
    greater = [i for i in array[1:] if i > pivot]
    return quicksort(less) + [pivot] + quicksort(greater)

print(quicksort([10, 5, 2, 3]))

# 散列函数

包含额外逻辑的数据结构，使用散列函数来确定元素的存储位置
处理冲突最贱的方法是如果映射到同一个位置，就存储一个链表

# 广度优先算法

如果你在你的整个人际关系网中搜索芒果销售商，就意味着你将沿每条边前行（记住，边是从一个人到另一个人的箭头或连接），因此运行时间至少为O(边数)。你还使用了一个队列，其中包含要检查的每个人。将一个人添加到队列需要的时间是固定的，即为O(1)，因此对每个人都这样做需要的总时间为O(人数)。所以，广度优先搜索的运行时间为 O(人数 + 边数)，这通常写作 O(V + E) ，其中V为顶点（vertice）数， E为边数。

from collections import deque

def person_is_seller(name):
      return name[-1] == 'm'

graph = {}
graph["you"] = ["alice", "bob", "claire"]
graph["bob"] = ["anuj", "peggy"]
graph["alice"] = ["peggy"]
graph["claire"] = ["thom", "jonny"]
graph["anuj"] = []
graph["peggy"] = []
graph["thom"] = []
graph["jonny"] = []

def search(name):
    search_queue = deque()
    search_queue += graph[name]
    # This array is how you keep track of which people you've searched before.
    searched = []
    while search_queue:
        person = search_queue.popleft()
        # Only search this person if you haven't already searched them.
        if not person in searched:
            if person_is_seller(person):
                print person + " is a mango seller!"
                return True
            else:
                search_queue += graph[person]
                # Marks this person as searched
                searched.append(person)
    return False

search("you")

# 狄克斯特拉算法

四个步骤

找出最便宜的节点，即可在最短时间内前往的节点
对于该节点的令居，检查是否有前往他们更短路径，如果有，就更新其开销
重复这个过程，直到对图中每个节点都这么做了

计算非加权图的最短路径，使用广度优先搜索，计算加权图最短路径，使用迪科特斯拉算法。

创建一个表格，在其中列出每个节点的开销。这里的开销指的是达到节点需要额外支付多少钱。

在执行狄克斯特拉算法的过程中，你将不断更新这个表。为计算最终路径，还需在这个表中添加表示父节点的列

如果有负权边，就不能使用狄克斯特拉算法。因为负权边会导致这种算法不管用。

计算方法

# the graph
graph = {}
graph["start"] = {}
graph["start"]["a"] = 6
graph["start"]["b"] = 2

graph["a"] = {}
graph["a"]["fin"] = 1

graph["b"] = {}
graph["b"]["a"] = 3
graph["b"]["fin"] = 5

graph["fin"] = {}

# the costs table
infinity = float("inf")
costs = {}
costs["a"] = 6
costs["b"] = 2
costs["fin"] = infinity

# the parents table
parents = {}
parents["a"] = "start"
parents["b"] = "start"
parents["fin"] = None

processed = []


def find_lowest_cost_node(costs):
    lowest_cost = float("inf")
    lowest_cost_node = None
    # Go through each node.
    for node in costs:
        cost = costs[node]
        # If it's the lowest cost so far and hasn't been processed yet...
        if cost < lowest_cost and node not in processed:
            # ... set it as the new lowest-cost node.
            lowest_cost = cost
            lowest_cost_node = node
    return lowest_cost_node


# Find the lowest-cost node that you haven't processed yet.
node = find_lowest_cost_node(costs)
# If you've processed all the nodes, this while loop is done.
while node is not None:
    cost = costs[node]
    # Go through all the neighbors of this node.
    neighbors = graph[node]
    for n in neighbors.keys():
        new_cost = cost + neighbors[n]
        # If it's cheaper to get to this neighbor by going through this node...
        if costs[n] > new_cost:
            # ... update the cost for this node.
            costs[n] = new_cost
            # This node becomes the new parent for this neighbor.
            parents[n] = node
    # Mark the node as processed.
    processed.append(node)
    # Find the next node to process, and loop.
    node = find_lowest_cost_node(costs)

print
"Cost from the start to each node:"
print(costs)

# result
# {'a': 5, 'b': 2, 'fin': 6}

广度优先搜索用于在非加权图中查找最短路径。狄克斯特拉算法用于在加权图中查找最短路径。仅当权重为正时狄克斯特拉算法才管用。如果图中包含负权边，请使用贝尔曼福德算

# 近似算法

# You pass an array in, and it gets converted to a set.
states_needed = set(["mt", "wa", "or", "id", "nv", "ut", "ca", "az"])

stations = {}
stations["kone"] = set(["id", "nv", "ut"])
stations["ktwo"] = set(["wa", "id", "mt"])
stations["kthree"] = set(["or", "nv", "ca"])
stations["kfour"] = set(["nv", "ut"])
stations["kfive"] = set(["ca", "az"])

final_stations = set()

while states_needed:
  best_station = None
  states_covered = set()
  for station, states in stations.items():
    covered = states_needed & states
    if len(covered) > len(states_covered):
      best_station = station
      states_covered = covered

  states_needed -= states_covered
  final_stations.add(best_station)

print(final_stations)


## output
## {'ktwo', 'kone', 'kfive', 'kthree'}

贪婪算法寻找局部最优解，企图以这种方式获得全局最优解。对于NP完全问题，还没有找到快速解决方案。面临NP完全问题时，最佳的做法是使用近似算法。贪婪算法易于实现、运行速度快，是不错的近似算法

# 动态规划问题

动态规划从小问题着手，逐步解决大问题

在每一行，可偷的商品都为当前行的商品以及之前各行的商品

# o(1), o(n), o(logn), o(nlogn)的解释

在描述算法复杂度时,经常用到o(1), o(n), o(logn), o(nlogn)来表示对应算法的时间复杂度, 这里进行归纳一下它们代表的含义: 这是算法的时空复杂度的表示。不仅仅用于表示时间复杂度，也用于表示空间复杂度。

O后面的括号中有一个函数，指明某个算法的耗时/耗空间与数据增长量之间的关系。其中的n代表输入数据的量。

比如时间复杂度为O(n)，就代表数据量增大几倍，耗时也增大几倍。比如常见的遍历算法。再比如时间复杂度O(n^2)，就代表数据量增大n倍时，耗时增大n的平方倍，这是比线性更高的时间复杂度。比如冒泡排序，就是典型的O(n^2)的算法，对n个数排序，需要扫描n×n次。再比如O(logn)，当数据增大n倍时，耗时增大logn倍（这里的log是以2为底的，比如，当数据增大256倍时，耗时只增大8倍，是比线性还要低的时间复杂度）。二分查找就是O(logn)的算法，每找一次排除一半的可能，256个数据中查找只要找8次就可以找到目标。 O(nlogn)同理，就是n乘以logn，当数据增大256倍时，耗时增大256*8=2048倍。这个复杂度高于线性低于平方。归并排序就是O(nlogn)的时间复杂度。 O(1)就是最低的时空复杂度了，也就是耗时/耗空间与输入数据大小无关，无论输入数据增大多少倍，耗时/耗空间都不变。

哈希算法就是典型的O(1)时间复杂度，无论数据规模多大，都可以在一次计算后找到目标（不考虑冲突的话）